Skip to content

Table of Contents

cs.CL [Back]

[1] Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Itay Itzhak,Yonatan Belinkov,Gabriel Stanovsky

Main category: cs.CL

TL;DR: This study finds that cognitive biases in large language models primarily stem from pretraining rather than finetuning or random training noise, suggesting that bias mitigation strategies should focus on the pretraining phase.

Details Motivation: The motivation for this study is to understand the origins of cognitive biases in large language models (LLMs), specifically whether these biases stem from pretraining, finetuning, or random noise during training. This understanding is crucial for developing strategies to evaluate and mitigate bias in LLMs. Method: The researchers employed a two-step causal experimental approach. First, they finetuned models multiple times with different random seeds to analyze the impact of training randomness on cognitive biases. Second, they introduced 'cross-tuning,' which involves swapping instruction datasets between models to determine if biases are dataset-dependent. Result: The results show that while training randomness introduces some variability, cognitive biases in LLMs are mainly shaped by the pretraining phase. Models with the same pretrained backbone exhibited more similar bias patterns compared to those that shared only finetuning data. Conclusion: The study concludes that cognitive biases in large language models are primarily shaped by pretraining rather than just finetuning or training randomness. This indicates that to understand and mitigate biases in LLMs, the focus should be on the pretraining phase. Abstract: Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over $30$ cognitive biases. Second, we introduce \emph{cross-tuning} -- swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.

[2] Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

Jens Rupprecht,Georg Ahnert,Markus Strohmaier

Main category: cs.CL

TL;DR: 该研究通过测试多个LLMs在不同扰动下的响应,揭示了它们在生成合成调查数据时的可靠性和偏差问题。

Details Motivation: 了解LLMs在规范调查环境中的响应鲁棒性和已知响应偏差的易感性。 Method: 对九个不同的LLMs进行世界价值观调查(WVS)问题的测试,并应用11种扰动进行综合测试。 Result: 所有测试模型都表现出一致的最近偏差,较大的模型通常更具鲁棒性,但所有模型仍对语义变化和组合扰动敏感。 Conclusion: 使用LLMs生成合成调查数据时,提示设计和鲁棒性测试至关重要。 Abstract: Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts -- we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also reveal that all tested models exhibit a consistent \textit{recency bias} varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

[3] SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

Krithika Ramesh,Daniel Smolyak,Zihao Zhao,Nupoor Gandhi,Ritu Agarwal,Margrét Bjarnadóttir,Anjalie Field

Main category: cs.CL

TL;DR: This paper introduces SynthTextEval, a toolkit for evaluating synthetic text across multiple dimensions, aiming to improve its viability and enhance privacy-preservation in AI development.

Details Motivation: The motivation is to realize the potential of synthetic text for applications like reducing privacy risks in high-stakes AI domains by providing principled and consistent evaluations across various dimensions. Method: The paper introduces SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text across multiple dimensions, including utility, fairness, privacy leakage, distributional differences, and expert feedback. Result: SynthTextEval enables users to evaluate synthetic data along multiple dimensions and showcases its functionality and effectiveness on datasets from healthcare and law domains. Conclusion: SynthTextEval improves the viability of synthetic text and privacy-preservation in AI development by consolidating and standardizing evaluation metrics. Abstract: We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit's generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.

[4] Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings

Minseon Kim,Jean-Philippe Corbeil,Alessandro Sordoni,Francois Beaulieu,Paul Vozila

Main category: cs.CL

TL;DR: 本论文旨在为医疗领域的大语言模型(LLM)制定一个定制的安全评估协议,并引入了一个新的基准数据集PatientSafetyBench,以从患者、临床医生和普通用户的角度全面分析医疗LLM的安全性。

Details Motivation: 随着大型语言模型在医疗领域的应用日益广泛,其输出可能直接影响人类健康,因此需要专门针对医疗领域进行安全评估,而不仅仅是依赖通用的安全基准测试。 Method: 作者构建了包含466个样本的PatientSafetyBench数据集,涵盖5个关键类别,并通过红队测试对MediPhi模型系列进行了案例研究,从患者、临床医生和普通用户的三个角度进行安全评估。 Result: 提出了第一个针对医疗大语言模型的安全评估标准,结合了患者视角的安全性测试,并展示了在MediPhi模型上的实际应用效果。 Conclusion: 这项工作填补了医疗大语言模型安全性评估领域的空白,为未来医疗AI的安全部署提供了基础框架。 Abstract: As the performance of large language models (LLMs) continues to advance, their adoption is expanding across a wide range of domains, including the medical field. The integration of LLMs into medical applications raises critical safety concerns, particularly due to their use by users with diverse roles, e.g. patients and clinicians, and the potential for model's outputs to directly affect human health. Despite the domain-specific capabilities of medical LLMs, prior safety evaluations have largely focused only on general safety benchmarks. In this paper, we introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives, alongside general safety assessments and quantitatively analyze the safety of medical LLMs. We bridge a gap in the literature by building the PatientSafetyBench containing 466 samples over 5 critical categories to measure safety from the perspective of the patient. We apply our red-teaming protocols on the MediPhi model collection as a case study. To our knowledge, this is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view - patient, clinician, and general user - establishing a foundation for safer deployment in medical domains.

[5] The Impact of Background Speech on Interruption Detection in Collaborative Groups

Mariah Bradford,Nikhil Krishnaswamy,Nathaniel Blanchard

Main category: cs.CL

TL;DR: 本研究旨在解决AI代理在教室多组对话环境中检测中断的问题,并提出了一种有效应对重叠语音的新方法。

Details Motivation: 由于AI代理在课堂环境中需要面对多个并发对话和广泛存在的重叠语音,传统的中断检测方法无法直接应用,因此需要新的解决方案。 Method: 分析单一对话与多组对话场景中的中断检测,并创建了一种对重叠语音具有鲁棒性的中断识别方法。 Result: 开发了一种先进的中断识别方法,并揭示了关于协作小组互动中断表现的重要语言和韵律信息。 Conclusion: 研究强调了在课堂小组讨论中,AI代理需要处理多组对话和重叠语音的挑战,并提出了一个能够识别中断的方法。 Abstract: Interruption plays a crucial role in collaborative learning, shaping group interactions and influencing knowledge construction. AI-driven support can assist teachers in monitoring these interactions. However, most previous work on interruption detection and interpretation has been conducted in single-conversation environments with relatively clean audio. AI agents deployed in classrooms for collaborative learning within small groups will need to contend with multiple concurrent conversations -- in this context, overlapping speech will be ubiquitous, and interruptions will need to be identified in other ways. In this work, we analyze interruption detection in single-conversation and multi-group dialogue settings. We then create a state-of-the-art method for interruption identification that is robust to overlapping speech, and thus could be deployed in classrooms. Further, our work highlights meaningful linguistic and prosodic information about how interruptions manifest in collaborative group interactions. Our investigation also paves the way for future works to account for the influence of overlapping speech from multiple groups when tracking group dialog.

[6] Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Anirban Saha Anik,Xiaoying Song,Elliott Wang,Bryan Wang,Bengisu Yarimbas,Lingzi Hong

Main category: cs.CL

TL;DR: This paper proposes a Multi-agent Retrieval-Augmented Framework utilizing multiple LLMs to enhance counterspeech generation against health misinformation, showing improved performance over existing methods.

Details Motivation: Current studies on using LLMs with RAG for counterspeech rely on limited evidence and offer less control over outputs, necessitating a more robust approach. Method: The study introduces a framework that incorporates multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement for generating counterspeech. Result: The method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy. Ablation studies and human evaluations validate the framework's effectiveness, particularly highlighting the importance of response refinement. Conclusion: The proposed Multi-agent Retrieval-Augmented Framework effectively generates high-quality counterspeech against health misinformation by integrating multiple LLMs and both static and dynamic evidence. Abstract: Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.

[7] GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation

Fardin Rastakhiz

Main category: cs.CL

TL;DR: 本研究提出了一种高效的深度学习模型,结合图神经网络、卷积神经网络和大型语言模型的优势,用于处理长文本分类任务,同时保证了计算效率和性能竞争力。

Details Motivation: Transformer模型因计算复杂度高而在处理长文本时效率低下,因此需要一种更高效且具备竞争性能的模型结构。 Method: 提出了一种结合图神经网络(GNN)和卷积神经网络(CNN)的新模型架构,并集成了实时端到端图生成机制;利用大型语言模型(LLM)的词嵌入和情感极性信息提升性能。 Result: 模型在多个文本分类任务中表现良好,生成的图具有较高的语义组织特性(平均聚类系数约为0.45,最短路径长度在4到5之间)。 Conclusion: 该模型在保持高效计算和竞争力性能的同时,通过结合图神经网络、卷积神经网络和大型语言模型的信息,在长文本处理上展现出优势。 Abstract: Time, cost, and energy efficiency are critical considerations in Deep-Learning (DL), particularly when processing long texts. Transformers, which represent the current state of the art, exhibit quadratic computational complexity relative to input length, making them inefficient for extended documents. This study introduces a novel model architecture that combines Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated with a real-time, end-to-end graph generation mechanism. The model processes compact batches of character-level inputs without requiring padding or truncation. To enhance performance while maintaining high speed and efficiency, the model incorporates information from Large Language Models (LLMs), such as token embeddings and sentiment polarities, through efficient dictionary lookups. It captures local contextual patterns using CNNs, expands local receptive fields via lattice-based graph structures, and employs small-world graphs to aggregate document-level information. The generated graphs exhibit structural properties indicative of meaningful semantic organization, with an average clustering coefficient of approximately 0.45 and an average shortest path length ranging between 4 and 5. The model is evaluated across multiple text classification tasks, including sentiment analysis and news-categorization, and is compared against state-of-the-art models. Experimental results confirm the proposed model's efficiency and competitive performance.

[8] MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning

Hieu Tran,Zonghai Yao,Won Seok Jang,Sharmin Sultana,Allen Chang,Yuan Zhang,Hong Yu

Main category: cs.CL

TL;DR: 提出了MedReadCtrl,一个可调节输出复杂度的医疗语言模型框架,在多个医学任务中表现出优于GPT-4的可读性和专家偏好。

Details Motivation: 医疗领域需要一种能够根据用户需求调整内容复杂度、提升人机交流效率的人工智能系统。 Method: 开发了MedReadCtrl,这是一个基于指令调优的可读性控制框架,用于调整大型语言模型的输出复杂度。 Result: 在九个数据集和三个任务上评估显示,MedReadCtrl的可读性指令遵循错误显著低于GPT-4,并在未见过的临床任务中实现了大幅提升。 Conclusion: MedReadCtrl提供了一个可扩展的解决方案,可以将临床内容转化为易懂的语言,同时保留医疗意图,有助于提升患者教育和公平获取AI医疗服务。 Abstract: Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl's ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.

[9] SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data

Zonghai Yao,Youxia Zhao,Avijit Mitra,David A. Levy,Emily Druhl,Jack Tsai,Hong Yu

Main category: cs.CL

TL;DR: SynthEHR-Eviction 是一种高效的方法,用于从临床记录中提取驱逐信息,并创建了一个大规模的 SDoH 数据集。

Details Motivation: 驱逐是一个重要的但未被充分研究的健康社会决定因素(SDoH),它与住房不稳定、失业和心理健康有关。由于驱逐信息很少出现在结构化字段中,因此限制了后续应用。 Method: SynthEHR-Eviction 结合了大型语言模型(LLMs)、人工参与标注和自动化提示优化(APO),以从临床记录中提取驱逐状态。 Result: 使用 SynthEHR-Eviction 创建了迄今为止最大的公开驱逐相关 SDoH 数据集,包含 14 个细粒度类别。经过微调的 LLMs 在人类验证数据上达到了 88.8%(驱逐)和 90.3%(其他 SDoH)的 Macro-F1 分数。 Conclusion: SynthEHR-Eviction 是一个高效的流程,可以显著减少标注工作量,加速数据集创建,并能推广到其他信息提取任务。 Abstract: Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.

[10] Towards Interpretable Time Series Foundation Models

Matthieu Boileau,Philippe Helluy,Jeremy Pawlus,Svitlana Vyetrenko

Main category: cs.CL

TL;DR: This paper explores the distillation of time series reasoning into small language models, demonstrating the feasibility of compressing time series understanding into lightweight, interpretable models capable of explaining temporal patterns in natural language.

Details Motivation: Investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Method: Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. Result: Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. Conclusion: This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language. Abstract: In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.

[11] SAND: Boosting LLM Agents with Self-Taught Action Deliberation

Yu Xia,Yiran Jenny Shen,Junda Wu,Tong Yu,Sungchul Kim,Ryan A. Rossi,Lina Yao,Julian McAuley

Main category: cs.CL

TL;DR: 本文提出SAND框架,通过让LLM代理自我反思与迭代优化,在交互任务中显著提升性能。

Details Motivation: 现有方法过度依赖专家行为模仿或选择性推理推广,缺乏对替代动作的比较,导致可能采取次优行动。 Method: 提出了Self-taught ActioN Deliberation (SAND) 框架,结合自洽动作采样和执行引导的动作批判,以生成逐步动作思考,并迭代优化LLM代理。 Result: 在两个代表性交互代理任务上评估,SAND平均比初始监督微调提升20%,并优于最先进的代理调整方法。 Conclusion: SAND框架通过自我迭代微调,提升了LLM代理在交互任务中的性能,平均优于监督微调和其他先进方法。 Abstract: Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.

[12] RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

Hongzhi Zhang,Jia Fu,Jingyuan Zhang,Kai Fu,Qi Wang,Fuzheng Zhang,Guorui Zhou

Main category: cs.CL

TL;DR: RLEP is a reinforcement learning framework that improves convergence and performance in large language models by replaying high-quality examples, resulting in faster training and better accuracy.

Details Motivation: Training large language models using reinforcement learning is energy-intensive and can be unstable, with policies drifting from pretrained weights. Method: RLEP uses a two-phase framework that collects verified trajectories and replays them during training, optimizing the policy on mini-batches that blend new rollouts with replayed successes. Result: On the Qwen2.5-Math-7B base model, RLEP achieves faster convergence and stronger final performance, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Conclusion: RLEP improves the convergence and performance of reinforcement learning for large language models by replaying high-quality examples. Abstract: Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}\, -- \,Reinforcement Learning with Experience rePlay\, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.

[13] Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Kaiqu Liang,Haimin Hu,Xuandong Zhao,Dawn Song,Thomas L. Griffiths,Jaime Fernández Fisac

Main category: cs.CL

TL;DR: 本研究提出“机器废话”概念框架与“废话指数”度量标准,分析大型语言模型中失去真实性的问题,揭示微调技术和提示方法如何加剧这种现象,并指出人工智能对齐的挑战。

Details Motivation: 为了更好地理解和衡量大型语言模型(LLM)中出现的不真实现象,提出“机器废话”这一总体概念框架,并探索其背后的机制。 Method: 引入“废话指数”作为量化LLM对真理漠视程度的新度量标准,提出了包含四种废话形式(空洞修辞、隐瞒真相、模糊词语和未经证实的主张)的分类法,并在Marketplace数据集、政治中立性数据集及新的BullshitEval基准上进行实证评估。 Result: 研究结果显示,使用人类反馈强化学习(RLHF)微调模型会显著加剧废话现象,推理时的思维链提示(CoT)尤其放大了空洞修辞和隐瞒真相的形式;在政治语境中,模糊词语是最主要的废话策略。 Conclusion: 研究发现,通过人类反馈强化学习(RLHF)微调模型会加剧机器废话现象,推理时的思维链提示(CoT)会显著放大特定形式的废话,政治语境中普遍存在机器废话,这突出了人工智能对齐方面的系统性挑战,并为实现更真实的LLM行为提供了新见解。 Abstract: Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs' indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.

[14] PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving

Mihir Parmar,Palash Goyal,Xin Liu,Yiwen Song,Mingyang Ling,Chitta Baral,Hamid Palangi,Tomas Pfister

Main category: cs.CL

TL;DR: PLAN-TUNING是一种提高小型语言模型复杂推理能力的后训练框架。

Details Motivation: 利用规划结构在后训练过程中提升小型开源LLMs的性能仍然未被充分探索。 Method: 介绍了一种统一的后训练框架,从大规模LLMs中提炼合成任务分解,并通过监督学习和强化学习目标对较小模型进行微调。 Result: PLAN-TUNING在GSM8k和MATH基准测试中的平均表现优于强基线约7%,并且在奥林匹克竞赛数据集和AIME 2024上分别取得了10%和12%的平均性能改进。 Conclusion: PLAN-TUNING是一个有效的改进小型LLMs任务特定性能的策略。 Abstract: Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed "planning trajectories") from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $\sim7\%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $\sim10\%$ and $\sim12\%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.

[15] Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code

Keqin Bao,Nuo Chen,Xiaoyuan Li,Binyuan Hui,Bowen Yu,Fuli Feng,Junyang Lin,Xiangnan He,Dayiheng Liu

Main category: cs.CL

TL;DR: TeaR improves the reasoning abilities of large language models through data curation and reinforcement learning, leading to significant performance gains across diverse benchmarks.

Details Motivation: Enhancing reasoning capabilities in LLMs is a central focus in research. Existing methods often lead to overfitting on algorithmic patterns due to reliance on complex data structures, so a new approach is needed. Method: TeaR uses data curation and reinforcement learning to guide LLMs in discovering optimal reasoning paths through code-related tasks. Result: TeaR showed consistent performance improvements across two base models and three long-CoT distillation models on 17 benchmarks, with a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B. Conclusion: TeaR is effective in enhancing the reasoning capabilities of LLMs, as demonstrated by significant performance improvements across multiple models and benchmarks. Abstract: Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.

[16] Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature

Hein Htet,Amgad Ahmed Ali Ibrahim,Yutaka Sasaki,Ryoji Asahi

Main category: cs.CL

TL;DR: 本研究利用基于BERT的NLP技术(如MatSciBERT和PubMedBERT)从科学文献中自动提取ORR催化剂相关信息,并表明这些模型在大规模文献分析中具有高准确性和应用潜力。

Details Motivation: 从大量科学文献中提取关于氧还原反应(ORR)催化剂的结构化信息仍是一大挑战,因为文本数据复杂且多样。这促使研究者提出更高效的信息抽取方法,以增强材料科学领域的文献分析能力。 Method: 使用DyGIE++框架结合多种预训练BERT变体(包括MatSciBERT和PubMedBERT),通过数据标注、整合与微调Transformer模型以提高信息抽取精度,并评估不同BERT变体对抽取性能的影响及注释一致性效应。 Result: 构建了一个包含12个关键实体和两种关系类型的综合数据集;微调后的PubMedBERT模型在NER任务中达到82.19%的F1分数,而MatSciBERT在RE任务中取得66.10%的F1分数;同时证明了模型在可靠性方面可媲美人工标注者。 Conclusion: 领域特定的BERT模型(如MatSciBERT和PubMedBERT)在ORR催化剂信息提取方面优于通用科学模型,如BlueBERT。微调后的模型显示出可靠的自动化文献分析潜力。 Abstract: The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.

[17] Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Varin Sikka,Vishal Sikka

Main category: cs.CL

TL;DR: 本文研究了基于Transformer的语言模型(LLM)在推理能力上的计算复杂性限制,表明其无法完成或验证高复杂度任务。

Details Motivation: 了解Transformer-based语言模型在生成虚假信息和代理使用方面的极限变得尤为重要。 Method: 从计算复杂性的角度分析LLM推理能力的局限性。 Result: 研究表明LLM在计算和代理任务上存在能力限制,并提供了相关示例。 Conclusion: 语言模型无法处理超过一定复杂度的任务,也无法验证高复杂度任务的准确性。 Abstract: With widespread adoption of transformer-based language models in AI, there is significant interest in the limits of LLMs capabilities, specifically so-called hallucinations, occurrences in which LLMs provide spurious, factually incorrect or nonsensical information when prompted on certain subjects. Furthermore, there is growing interest in agentic uses of LLMs - that is, using LLMs to create agents that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.

[18] Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System

Yuanchen Shi,Longyin Zhang,Fang Kong

Main category: cs.CL

TL;DR: 为解决非英语语言中心理支持数据集稀缺的问题,研究人员提出了一种新方法,利用少量真实数据与专家知识生成大量心理咨询对话,并开发了一个高效的心理支持系统CADSS,其在多个任务上表现优异。

Details Motivation: 由于非英语语言相关的心理支持数据集稀缺,而对心理健康支持的需求日益增加,因此需要创建这样的数据集并开发有效的心理支持系统。 Method: 该研究提出了一种框架,利用两个大型语言模型(Dialog Generator和Dialog Modifier)生成大规模的心理咨询对话,并构建了中文心理支持对话数据集CPsDD。 Result: 研究者构建了包含68K对话的中文心理支持对话数据集CPsDD,并开发了综合代理对话支持系统CADSS,在策略预测和情感支持对话任务中取得了最先进的成果。 Conclusion: CADSS实现了最先进的性能,通过基于专家知识和有限真实世界数据微调大型语言模型来生成心理辅导对话。 Abstract: The growing need for psychological support due to increasing pressures has exposed the scarcity of relevant datasets, particularly in non-English languages. To address this, we propose a framework that leverages limited real-world data and expert knowledge to fine-tune two large language models: Dialog Generator and Dialog Modifier. The Generator creates large-scale psychological counseling dialogues based on predefined paths, which guide system response strategies and user interactions, forming the basis for effective support. The Modifier refines these dialogues to align with real-world data quality. Through both automated and manual review, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 psychological problems, 13 causes, and 12 support focuses. Additionally, we introduce the Comprehensive Agent Dialogue Support System (CADSS), where a Profiler analyzes user characteristics, a Summarizer condenses dialogue history, a Planner selects strategies, and a Supporter generates empathetic responses. The experimental results of the Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate that CADSS achieves state-of-the-art performance on both CPsDD and ESConv datasets.

[19] Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Mikey Elmers,Koji Inoue,Divesh Lala,Tatsuya Kawahara

Main category: cs.CL

TL;DR: 这是首个将语音活动投影(VAP)扩展到三向对话的研究,结果表明其在转录预测中的有效性。

Details Motivation: 尽管轮流是口语对话的一个基本组成部分,但传统研究大多涉及双向设置,因此本研究旨在将语音活动投影(VAP)应用于预测三向多方场景中的即将到来的轮流。 Method: 研究人员在一项日本三向数据集上训练了多个模型,这些参与者讨论了各种话题。 Result: 研究发现,在所有模型中,基于三向对话训练的VAP都优于基线模型,但对话类型影响了准确性。 Conclusion: 该研究得出结论,VAP可用于三向对话场景中的转录预测,并且未来的工作将把这种三向VAP转录模型纳入口语对话系统中。 Abstract: Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.

[20] CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text

Akram Elbouanani,Evan Dufraisse,Aboubacar Tuo,Adrian Popescu

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的少样本提示方法,在多语言主观性检测任务中表现优异,尤其适用于低质量或稀疏标注数据的情况,其效果可媲美甚至超越传统微调方法。

Details Motivation: 动机是为了探索在多语言主观性检测任务中,大语言模型是否可以通过先进的提示工程(如辩论式LLMs和示例选择策略)超越传统的微调小语言模型(SLMs)。 Method: 本文的方法是使用经过精心设计的少样本提示(few-shot prompting)的大语言模型(LLMs),并参与了CheckThat! 2025评估活动的任务1:主观性检测。 Result: 结果表明,LLMs在多个语言的主观性检测任务中取得了优异成绩,包括阿拉伯语和波兰语的第一名,以及意大利语、英语、德语和多语言赛道的前四名。此外,该方法在阿拉伯语数据集上表现出特别强的鲁棒性。 Conclusion: 本文的结论是,基于大语言模型(LLMs)的少样本学习在多语言情感任务中表现出色,尤其是在标注数据稀缺或不一致的情况下,提供了一种优于传统微调方法的有效替代方案。 Abstract: This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.

[21] The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Chen Amiraz,Yaroslav Fyodorov,Elad Haramaty,Zohar Karnin,Liane Lewin-Eytan

Main category: cs.CL

TL;DR: This paper explores the challenges of cross-lingual Retrieval-Augmented Generation (RAG) in domain-specific settings using Arabic-English benchmarks. It identifies retrieval as a critical bottleneck and proposes a strategy for equal retrieval from both languages to improve performance.

Details Motivation: Cross-lingual RAG is crucial for retrieving and generating answers across languages. Prior work has mainly focused on generation, leading to hidden retrieval challenges due to language imbalances, overlap with pretraining data, and memorized content. Method: The research involved studying Arabic-English Retrieval-Augmented Generation (RAG) using benchmarks derived from real-world corporate datasets. The study included all combinations of languages for user queries and supporting documents, drawn independently and uniformly at random, allowing a systematic study of multilingual retrieval behavior. Result: The findings revealed that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops when the user query and supporting document languages differ. Failures were primarily attributed to the retriever's difficulty in ranking documents across languages. Conclusion: The study concludes that multilingual retrieval, particularly in cross-lingual domain-specific scenarios, presents a significant challenge. However, implementing an equal retrieval strategy from both languages can lead to substantial improvements in cross-lingual and overall performance. Abstract: Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever's difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.

[22] The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Jierun Chen,Tiezheng Yu,Haoli Bai,Lewei Yao,Jiannan Wu,Kaican Li,Fei Mi,Chaofan Tao,Lei Zhu,Manyi Zhang,Xiaohui Li,Lu Hou,Lifeng Shang,Qun Liu

Main category: cs.CL

TL;DR: This study examines how long-CoT supervised fine-tuning and reinforcement learning affect vision-language models. While each method has distinct strengths—SFT boosts complex reasoning and RL enhances generalization—combining them results in trade-offs, revealing a 'synergy dilemma' that calls for better integration strategies.

Details Motivation: While post-training techniques like long-CoT SFT and RL have shown synergy in language-only models, their combined effectiveness in vision-language models (VLMs) is uncertain, prompting a need for systematic analysis. Method: A systematic investigation was conducted on the roles and interplay of long-CoT supervised fine-tuning (SFT) and reinforcement learning (RL) using various multimodal reasoning benchmarks. Different training strategies such as two-staged, interleaved, and progressive training, along with data mixing and model merging, were evaluated. Result: Long-CoT SFT enhances performance on difficult questions through structured reasoning but causes verbosity and degrades performance on simpler ones. RL improves generalization and brevity, providing consistent gains across all difficulty levels, though less effective than SFT on the hardest questions. Combining SFT and RL leads to trade-offs rather than additive improvements. Conclusion: The combination of long-CoT SFT and RL in VLMs does not yield additive benefits and instead results in trade-offs, indicating the need for more adaptive approaches to fully utilize these post-training techniques. Abstract: Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma'' highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.

[23] Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

Yupu Liang,Yaping Zhang,Zhiyang Zhang,Yang Zhao,Lu Xiang,Chengqing Zong,Yu Zhou

Main category: cs.CL

TL;DR: M4Doc是一种新的文档图像翻译方法,通过与多模态大语言模型对齐来提高翻译质量和泛化能力,同时保持推理效率。

Details Motivation: 文档图像机器翻译面临训练数据有限以及视觉和文本信息之间复杂交互带来的泛化挑战。 Method: 引入M4Doc,将仅图像编码器与预训练多模态大语言模型(MLLM)的多模态表示对齐,在推理过程中绕过MLLM以保持计算效率。 Result: 实验表明,M4Doc在翻译质量上有显著提升,尤其是在跨领域泛化能力和处理复杂文档图像场景方面。 Conclusion: M4Doc通过单到混合模态对齐框架,有效提升了文档图像机器翻译的性能,特别是在跨领域泛化和复杂文档图像场景中。 Abstract: Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.

[24] Bayesian Discrete Diffusion Beats Autoregressive Perplexity

Cooper Doyle

Main category: cs.CL

TL;DR: 该研究发现离散扩散语言模型具有隐藏的贝叶斯结构,并利用此结构提出了一种新的推理方法,在提升性能的同时提供了不确定性估计。

Details Motivation: 揭示离散扩散模型中的贝叶斯结构,改进语言模型的不确定性估计和性能。 Method: 通过蒙特卡洛边缘化和K次独立损坏的平均掩码与去噪过程,推导出后验感知的令牌概率和不确定性估计。 Result: 在WikiText-2上,使用K=8时测试困惑度为8.8,显著优于GPT-2 Small的20.3。 Conclusion: 研究揭示了离散扩散语言模型的贝叶斯核心,并通过一种轻量级推理集成方法在WikiText-2数据集上实现了优于GPT-2 Small的测试困惑度,且无需额外训练成本。 Abstract: We reveal a hidden Bayesian core of discrete-diffusion language models by showing that the expected denoiser output under the forward masking distribution recovers the exact posterior over clean tokens. Under minimal assumptions, Monte Carlo marginalization over K independent corruptions converges to this posterior at rate O(1/sqrt(K)), yielding a simple proof of consistency and finite-sample error bounds. Building on this insight, we introduce a lightweight inference-time ensemble that averages K mask-and-denoise passes to obtain posterior-aware token probabilities and uncertainty estimates at no extra training cost. On WikiText-2, our method achieves test perplexity 8.8 with K=8, versus 20.3 for GPT-2 Small, despite using a model of comparable size. Code is available at https://github.com/mercury0100/bayesradd.

[25] Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

Joyeeta Datta,Niclas Doll,Qusai Ramadan,Zeyd Boukhers

Main category: cs.CL

TL;DR: This study demonstrates that Knowledge Distillation can significantly compress Large Language Models like Pythia and Qwen2.5 while preserving most of their Question Answering performance, especially when enhanced with one-shot prompting.

Details Motivation: Large Language Models have high computational demands, which hinders their use in real-world, resource-limited settings. This work explores how much LLMs can be compressed using Knowledge Distillation while maintaining performance on Question Answering tasks. Method: Evaluated student models distilled from Pythia and Qwen2.5 on SQuAD and MLQA benchmarks under zero-shot and one-shot prompting conditions. Result: Student models retained over 90% of teacher models' performance while reducing parameter counts by up to 57.1%, with additional gains observed through one-shot prompting. Conclusion: Knowledge Distillation combined with minimal prompting can create compact and capable QA systems suitable for resource-constrained environments. Abstract: Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using Knowledge Distillation (KD) while maintaining strong performance on Question Answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models' performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for resource-constrained applications.

[26] FrugalRAG: Learning to retrieve and reason for multi-hop QA

Abhinav Java,Srivathsan Koundinyan,Nagarajan Natarajan,Amit Sharma

Main category: cs.CL

TL;DR: This paper demonstrates that improved prompting can surpass existing RAG methods without large-scale fine-tuning, while supervised and RL-based techniques reduce retrieval search costs by half with minimal training data.

Details Motivation: Efficiency in retrieval searches is an overlooked but critical metric in retrieval-augmented generation systems, prompting this investigation into alternative approaches to enhance performance while minimizing computational costs. Method: The study compares different methods of improving retrieval-augmented generation (RAG) by analyzing the effectiveness of large-scale fine-tuning versus improved prompting techniques on benchmarks like HotPotQA. Result: A prompt-optimized ReAct pipeline outperforms state-of-the-art methods without large-scale fine-tuning. Supervised and RL-based fine-tuning significantly reduce the number of required retrieval searches, achieving competitive RAG metrics at nearly half the cost using only 1000 training examples. Conclusion: Large-scale fine-tuning is not necessary for improving RAG metrics; instead, enhancing prompts within a standard ReAct pipeline can yield superior results. Supervised and RL-based fine-tuning improve frugality, reducing the number of retrieval searches needed at inference time. Abstract: We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).

[27] Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

Haotan Guo,Jianfei He,Jiayuan Ma,Hongbin Na,Zimu Wang,Haiyang Zhang,Qi Chen,Wei Wang,Zijing Shi,Tao Shen,Ling Chen

Main category: cs.CL

TL;DR: This paper explores Phonetic Cloaking Replacement in Chinese content moderation, revealing weaknesses in current detection methods and proposing a more effective Pinyin-based solution.

Details Motivation: Phonetic Cloaking Replacement (PCR) poses a significant challenge to Chinese content moderation by hiding toxic intent using homophonic or near-homophonic variants. Existing evaluations rely on rule-based synthetic perturbations that fail to account for real user creativity. Method: The authors organized PCR into a four-way surface-form taxonomy, compiled a dataset of naturally occurring phonetically cloaked offensive posts, and benchmarked state-of-the-art LLMs. They revisited a Pinyin-based prompting strategy for mitigation. Result: State-of-the-art LLMs performed poorly on the dataset, with the best model achieving an F1-score of only 0.672. Zero-shot chain-of-thought prompting further reduced performance. The Pinyin-based prompting strategy recovered much of the lost accuracy. Conclusion: This study offers a comprehensive taxonomy of Chinese PCR, identifies current detectors' limits through a realistic benchmark, and proposes an effective mitigation technique. Abstract: Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors' limits, and a lightweight mitigation technique that advances research on robust toxicity detection.

[28] An Automated Length-Aware Quality Metric for Summarization

Andrew D. Foland

Main category: cs.CL

TL;DR: This paper introduces NOIR, an automated metric for evaluating summarization quality by measuring semantic retention and compression, effectively reflecting human perception and aiding in improving summarization techniques.

Details Motivation: The motivation stems from the need for an efficient, automated alternative to evaluate summarization quality without relying on time-consuming human-generated reference summaries. Method: NOIR utilizes language model-embeddings to measure semantic similarity, combining the retention of semantic meaning with summary length compression to assess summarization quality. Result: Experiments showed that NOIR successfully captures the token-length/semantic retention tradeoff and correlates well with human perception of summarization quality. Conclusion: The paper concludes that NOIR is an effective and automated metric for evaluating summarization quality, offering a practical tool for enhancing summarization algorithms and prompts. Abstract: This paper proposes NOrmed Index of Retention (NOIR), a quantitative objective metric for evaluating summarization quality of arbitrary texts that relies on both the retention of semantic meaning and the summary length compression. This gives a measure of how well the recall-compression tradeoff is managed, the most important skill in summarization. Experiments demonstrate that NOIR effectively captures the token-length / semantic retention tradeoff of a summarizer and correlates to human perception of sumarization quality. Using a language model-embedding to measure semantic similarity, it provides an automated alternative for assessing summarization quality without relying on time-consuming human-generated reference summaries. The proposed metric can be applied to various summarization tasks, offering an automated tool for evaluating and improving summarization algorithms, summarization prompts, and synthetically-generated summaries.

[29] SAS: Simulated Attention Score

Chuanyang Zheng,Jiankai Sun,Yihang Gao,Yuehao Wang,Peihao Wang,Jing Xiong,Liliang Ren,Hao Cheng,Janardhan Kulkarni,Yelong Shen,Atlas Wang,Mac Schwager,Anderson Schneider,Xiaodong Liu,Jianfeng Gao

Main category: cs.CL

TL;DR: 该论文提出了 Simulated Attention Score (SAS) 方法,通过模拟更多的注意力头来提升 Transformer 模型性能,同时保持模型参数不变。

Details Motivation: 研究发现多头注意力(MHA)的性能随着注意力头数量的增加而提升,因此提出一种方法,在保持模型规模紧凑的同时模拟更大数量的注意力头。 Method: Simulated Attention Score (SAS) 和 Parameter-Efficient Attention Aggregation (PEAA) Result: 实验表明 SAS 在多种数据集和任务上均优于不同的注意力变体,并显著提升了模型性能。 Conclusion: SAS 方法在不增加参数数量的前提下,通过模拟更多注意力头和隐藏特征维度,有效提升了模型性能。 Abstract: The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.

[30] KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

Hruday Markondapatnaikuni,Basem Suleiman,Abdelkarim Erradi,Shijing Chen

Main category: cs.CL

TL;DR: This paper introduces K2RAG, an enhanced Retrieval-Augmented Generation framework that improves retrieval quality, system efficiency, and scalability while significantly reducing training time compared to traditional methods.

Details Motivation: Fine-tuning Large Language Models (LLMs) is resource-intensive, and traditional RAG implementations face limitations in scalability and answer accuracy. A more efficient approach to knowledge expansion in LLMs is needed. Method: K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization. It includes a preprocessing step that summarizes training data to reduce training time. Result: K2RAG achieved a mean answer similarity score of 0.57, reached a third quartile (Q3) similarity of 0.82, reduced training time by 93%, improved execution speed by 40%, and required three times less VRAM than naive RAG systems. Conclusion: K2RAG proved to be a highly efficient and accurate framework for Retrieval-Augmented Generation, outperforming naive RAG implementations in terms of accuracy, scalability, and computational efficiency. Abstract: Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.

[31] Rethinking the Privacy of Text Embeddings: A Reproducibility Study of "Text Embeddings Reveal (Almost) As Much As Text"

Dominykas Seputis,Yongkang Li,Karsten Langerak,Serghei Mihailov

Main category: cs.CL

TL;DR: This paper reproduces and evaluates the Vec2Text method for reconstructing text from embeddings, finding it effective but highlighting its limitations and potential privacy issues.

Details Motivation: Recent methods like Vec2Text have challenged the assumption that transmitting embeddings is privacy-preserving by showing that original texts can be reconstructed from embeddings. This motivated further verification and study of Vec2Text. Method: The authors reproduce the Vec2Text framework and evaluate it by validating original claims and conducting extended experiments including parameter sensitivity analysis, reconstruction of sensitive inputs, and exploring embedding quantization as a privacy defense. Result: Vec2Text was successfully replicated with minor discrepancies. It was found to be effective at reconstructing text, even for password-like sequences. However, it showed sensitivity to input sequence length. Privacy mitigation strategies like Gaussian noise and quantization were found to reduce privacy risks. Conclusion: The study concludes that while Vec2Text is effective in reconstructing text from embeddings under ideal conditions, there are key limitations and privacy risks involved. The use of Gaussian noise and quantization techniques can help mitigate these risks. Abstract: Text embeddings are fundamental to many natural language processing (NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval (IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems.

[32] Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

Zhijin Dong

Main category: cs.CL

TL;DR: 本文提出Selective-DPO方法,通过关注高影响token实现更高效的大型语言模型偏好对齐。

Details Motivation: 并非所有tokens对模型性能的贡献相同,因此需要一种高效的方法来提升模型对齐效果。 Method: 引入了一种选择性对齐策略,利用当前策略与参考模型之间的token级对数概率差异,优先处理高影响token。 Result: 在Arena-Hard和MT-Bench等基准测试中,Selective-DPO优于标准DPO和基于蒸馏的基线方法。 Conclusion: Selective-DPO方法在大型语言模型的偏好对齐方面表现出色,强调了参考模型选择和token级别优化的重要性。 Abstract: Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within preference pairs, leveraging token-level log-probability differences between the current policy and a reference model. By focusing on these informative tokens, our approach reduces computational overhead and enhances alignment fidelity. We further explore the role of reference model quality, demonstrating that stronger reference models significantly improve token selection accuracy and overall optimization effectiveness. Comprehensive experiments on benchmarks such as Arena-Hard and MT-Bench validate the superiority of our Selective-DPO method over standard DPO and distillation-based baselines. Our findings highlight the importance of token-level optimization and reference model selection in advancing preference alignment for LLMs. The code is available at https://github.com/Dongzhijin/SDPO.

[33] Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review

Maha Tufail Agro,Atharva Kulkarni,Karima Kadaoui,Zeerak Talat,Hanan Aldarmaki

Main category: cs.CL

TL;DR: This paper reviews the state of research on code-switching in end-to-end automatic speech recognition models, highlighting current progress, challenges, and directions for future work.

Details Motivation: Motivated by growing research interest in automatic speech recognition (ASR) and the increasing body of work on languages where code-switching (CS) often occurs. Method: A systematic literature review was conducted, including the collection and manual annotation of papers published in peer-reviewed venues. Result: The study documents languages considered, datasets used, metrics applied, model choices made, and performance outcomes in end-to-end ASR for code-switching. Conclusion: The analysis provides insights into current research efforts and available resources, as well as opportunities and gaps for future research in code-switching in end-to-end ASR models. Abstract: Motivated by a growing research interest into automatic speech recognition (ASR), and the growing body of work for languages in which code-switching (CS) often occurs, we present a systematic literature review of code-switching in end-to-end ASR models. We collect and manually annotate papers published in peer reviewed venues. We document the languages considered, datasets, metrics, model choices, and performance, and present a discussion of challenges in end-to-end ASR for code-switching. Our analysis thus provides insights on current research efforts and available resources as well as opportunities and gaps to guide future research.

[34] When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance

Peizhang Shao,Linrui Xu,Jinxi Wang,Wei Zhou,Xingyu Wu

Main category: cs.CL

TL;DR: 这篇论文提供了一个关于大型语言模型在法律领域应用的全面综述,提出了一种新的分类方法,结合了法律推理框架和专业本体论,旨在为研究人员和技术人员提供指导,并为下一代法律人工智能奠定基础。

Details Motivation: 本文旨在全面回顾大型语言模型(LLMs)在法律领域的应用,解决广泛采用LLM带来的幻觉、可解释性不足、司法适应困难和伦理不对称等关键挑战。 Method: 该论文采用了一种创新的双重视角分类法,结合了法律推理框架和专业本体论,系统地统一了历史研究和当代突破。通过稀疏注意力机制和技术创新等技术手段,如混合专家架构,解决了文本处理、知识整合和评估严格性中的核心挑战。 Result: 文档记录了任务泛化、推理形式化、工作流程集成以及通过技术创新解决文本处理、知识整合和评估严格性方面的核心挑战方面的重要进展。此外,作者创建了一个GitHub存储库来索引相关论文。 Conclusion: 本文提出了一个新的分类法,将法律角色映射到NLP子任务,并计算实现了图尔敏论证框架,为研究人员提供了技术路线图,为从业人员提供了概念框架,为下一代法律人工智能奠定了坚实的基础。 Abstract: This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.

[35] StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

Shoutao Guo,Xiang Li,Shaolei Zhang,Mengge Liu,Wei Chen,Yang Feng

Main category: cs.CL

TL;DR: 本文提出了一种新的流式语音翻译方法StreamUni,通过统一的大语音语言模型和语音思维链(CoT)技术,在不依赖额外分割模型和大规模策略训练的情况下,实现了高效的流式翻译并在相关任务中取得了最先进的性能。

Details Motivation: 现有的流式语音翻译(StreamST)方法通常基于句子级语音片段进行操作(称为同步语音翻译SimulST),需要与分割模型协作,但受限于截断语音段的信息量,难以做出有效的策略决策并生成高质量翻译。因此提出了StreamUni以解决这些问题。 Method: StreamUni引入了语音思维链(CoT)来引导LSLM生成多阶段输出,利用这些多阶段输出同时完成语音分割、策略决策和翻译生成。此外还提出了一种流式CoT训练方法,以增强低延迟策略决策和生成能力。 Result: StreamUni在流式语音翻译任务中取得了最先进的性能,实验表明其能够在不依赖大量策略特定训练的情况下实现高效的流式翻译,并通过流式CoT训练方法进一步提升低延迟策略决策和生成能力。 Conclusion: StreamUni通过统一的大语音语言模型(LSLM)实现了流式语音翻译(StreamST),同时完成了语音分割、策略决策和翻译生成,无需大量的特定策略训练。此外,提出的流式CoT训练方法提升了低延迟策略决策和生成能力,在有限的CoT数据下表现出色,并在StreamST任务中达到了最先进的性能。 Abstract: Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.

[36] Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers

Sara Candussio,Gaia Saveri,Gabriele Sarti,Luca Bortolussi

Main category: cs.CL

TL;DR: This paper proposes a Transformer-based model to invert semantic embeddings of Signal Temporal Logic (STL) formulae, enabling the generation of valid and simplified logical expressions while preserving semantic meaning. The model learns quickly, generalizes well, and is applied to a real-world requirement mining task.

Details Motivation: The motivation stems from the need to integrate symbolic knowledge into data-driven learning algorithms using continuous representations of logic formulae. Embeddings must be invertible to translate optimal continuous representations back into concrete logical specifications, which this work aims to address. Method: The authors use a Transformer-based decoder-only model to invert semantic embeddings of Signal Temporal Logic (STL) formulae. They construct a small vocabulary from STL syntax and train the model to decode embeddings back into formulae. The methodology includes testing generalization over different complexity levels and applying the model to a real-world requirement mining task on trajectories. Result: The model generates valid STL formulae after only 1 training epoch and generalizes to the semantics of the logic within about 10 epochs. It decodes embeddings into simpler, shorter formulae that remain semantically close or equivalent to reference formulae. The model performs well across various complexity levels and demonstrates out-of-distribution generalization. It is also successfully applied to a requirement mining classification task optimized in the semantic space. Conclusion: The study concludes that the proposed Transformer-based model effectively inverts semantic embeddings of STL formulae, generating valid and simplified formulae while maintaining semantic equivalence. The model successfully generalizes across varying levels of complexity and is effective in a requirement mining task optimized directly in the semantic space. Abstract: Continuous representations of logic formulae allow us to integrate symbolic knowledge into data-driven learning algorithms. If such embeddings are semantically consistent, i.e. if similar specifications are mapped into nearby vectors, they enable continuous learning and optimization directly in the semantic space of formulae. However, to translate the optimal continuous representation into a concrete requirement, such embeddings must be invertible. We tackle this issue by training a Transformer-based decoder-only model to invert semantic embeddings of Signal Temporal Logic (STL) formulae. STL is a powerful formalism that allows us to describe properties of signals varying over time in an expressive yet concise way. By constructing a small vocabulary from STL syntax, we demonstrate that our proposed model is able to generate valid formulae after only 1 epoch and to generalize to the semantics of the logic in about 10 epochs. Additionally, the model is able to decode a given embedding into formulae that are often simpler in terms of length and nesting while remaining semantically close (or equivalent) to gold references. We show the effectiveness of our methodology across various levels of training formulae complexity to assess the impact of training data on the model's ability to effectively capture the semantic information contained in the embeddings and generalize out-of-distribution. Finally, we deploy our model for solving a requirement mining task, i.e. inferring STL specifications that solve a classification task on trajectories, performing the optimization directly in the semantic space.

[37] Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

Nhi Hoai Doan,Tatsuya Hiraoka,Kentaro Inui

Main category: cs.CL

TL;DR: This paper explores how repetition neurons affect in-context learning in large language models, finding that their impact depends on their location in the model and offering strategies to reduce repetition while preserving ICL performance.

Details Motivation: The motivation is to better understand how LLMs handle repetitive input patterns and how this impacts in-context learning, moving beyond prior focus on attention heads to explore skill neurons like repetition neurons. Method: The authors conducted experiments comparing the effects of repetition neurons and induction heads on ICL performance, analyzing how the depth of these neurons affects outcomes. Result: The experiments showed that the influence of repetition neurons on ICL performance depends on the layer depth, and the authors identified methods to mitigate repetition without compromising ICL effectiveness. Conclusion: The paper concludes that the impact of repetition neurons on ICL performance varies with their depth in the model, and strategies can be employed to reduce repetitive outputs while maintaining strong ICL capabilities. Abstract: This paper investigates the relationship between large language models' (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.

[38] On the Effect of Instruction Tuning Loss on Generalization

Anwoy Chatterjee,H S V N S Kowndinya Renduchintala,Sumit Bhatia,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文提出了一种改进的指令微调方法(WIT),通过对提示和响应标记进行差异化加权,提高了语言模型的性能和鲁棒性。

Details Motivation: 传统的自回归目标在指令微调中可能不是最优的,因为损失仅计算响应标记而忽略了提示标记,这可能导致次优性能和对输入提示变化的鲁棒性不足。 Method: 提出了一种加权指令微调(Weighted Instruction Tuning, WIT)方法,系统地研究了在指令微调损失中对提示和响应标记进行差异化加权的影响,并通过广泛的实验验证其效果。 Result: 在不同家族和规模的五个语言模型、三个不同大小的微调数据集以及五个多样化的评估基准上进行的实验表明,标准的指令微调损失通常表现不佳,而WIT方法能够显著提升模型性能和鲁棒性。 Conclusion: 重新考虑指令微调损失函数的设计,通过差异化加权提示和响应标记,可以提高模型性能和鲁棒性,并为后续的偏好对齐训练提供更好的起点。 Abstract: Instruction Tuning has emerged as a pivotal post-training paradigm that enables pre-trained language models to better follow user instructions. Despite its significance, little attention has been given to optimizing the loss function used. A fundamental, yet often overlooked, question is whether the conventional auto-regressive objective - where loss is computed only on response tokens, excluding prompt tokens - is truly optimal for instruction tuning. In this work, we systematically investigate the impact of differentially weighting prompt and response tokens in instruction tuning loss, and propose Weighted Instruction Tuning (WIT) as a better alternative to conventional instruction tuning. Through extensive experiments on five language models of different families and scale, three finetuning datasets of different sizes, and five diverse evaluation benchmarks, we show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations. We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings and also serve as better starting points for the subsequent preference alignment training. These findings highlight the need to reconsider instruction tuning loss and offer actionable insights for developing more robust and generalizable models. Our code is open-sourced at https://github.com/kowndinya-renduchintala/WIT.

[39] Conditional Unigram Tokenization with Parallel Data

Gianluca Vico,Jindřinch Libovický

Main category: cs.CL

TL;DR: 这篇论文介绍了一种新的条件单字标记化方法,旨在通过基于源语言令牌对目标令牌概率进行条件化来改进跨语言语义对齐。

Details Motivation: 这篇论文的动机是探索一种新的标记化方法,即条件单字标记化,以改进跨语言语义对齐的效果,并评估其在不同语言对和资源水平上的表现。 Method: 该论文提出了一种条件单字标记化方法,通过在平行数据中基于源语言令牌对目标令牌概率进行条件化,扩展了单字标记化。在固定源标记器的情况下,该方法学习一个目标标记器以最大化跨语言语义对齐。 Result: 结果表明,虽然条件标记化器保持了与标准单字标记化器相当的统计属性,但在机器翻译质量上没有改善,却在语言建模中持续降低了困惑度。然而,词汇表大小的条件概率估计呈二次扩展,导致数据效率瓶颈。 Conclusion: 该论文的结论是,虽然提出的条件单字标记化方法在语言模型中能够降低困惑度,但在机器翻译质量上没有提升。此外,词汇表大小的条件概率估计呈二次扩展,导致数据效率瓶颈,因此实际跨语言标记化可能需要替代参数化方法。 Abstract: We introduce conditional unigram tokenization, a novel approach that extends unigram tokenization by conditioning target token probabilities on source-language tokens from parallel data. Given a fixed source tokenizer, our method learns a target tokenizer that maximizes cross-lingual semantic alignment. We evaluate our tokenizer on four language pairs across different families and resource levels, examining intrinsic properties and downstream performance on machine translation and language modeling. While our conditional tokenizer maintains comparable statistical properties to standard unigram tokenizers, results are mixed: we observe no improvements in machine translation quality, but find consistent perplexity reductions in language modeling. We hypothesize that quadratic scaling of conditional probability estimation with respect to the vocabulary size creates a data efficiency bottleneck. Our findings suggest that alternative parameterizations may be necessary for practical cross-lingual tokenization.

[40] From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

Youngjoon Jang,Seongtae Hong,Junyoung Son,Sungjin Park,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 本研究探讨了在检索增强生成(RAG)框架中处理文本核心参照问题的重要性,以及其对提升自然语言处理任务性能的影响。

Details Motivation: RAG的效果通常受到检索到的文档中核心参照复杂性的阻碍,引入了干扰上下文学习的模糊性。 Method: 通过不同的池化策略在检索任务中进行比较分析,并研究不同大小的模型在消歧过程中的受益程度。 Result: 平均池化在应用共指解析后表现出优越的上下文捕捉能力;较小的模型从消歧过程中受益更多。 Conclusion: 这项研究系统地调查了实体共指如何影响RAG文档检索和生成性能,并发现共指解析可以增强检索效果并提高问答性能。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in natural language processing (NLP), improving factual consistency and reducing hallucinations by integrating external document retrieval with large language models (LLMs). However, the effectiveness of RAG is often hindered by coreferential complexity in retrieved documents, introducing ambiguity that disrupts in-context learning. In this study, we systematically investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems, focusing on retrieval relevance, contextual understanding, and overall response quality. We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance. Through comparative analysis of different pooling strategies in retrieval tasks, we find that mean pooling demonstrates superior context capturing ability after applying coreference resolution. In QA tasks, we discover that smaller models benefit more from the disambiguation process, likely due to their limited inherent capacity for handling referential ambiguity. With these findings, this study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.

[41] Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation

Bugra Kilictas,Faruk Alpay

Main category: cs.CL

TL;DR: 该论文提出了一种新的多层语义博弈结构,将博弈论推理与不动点迭代相结合,为AI系统与文档的语义对齐提供了数学基础和实用框架。

Details Motivation: 将AI系统与文档的对齐过程转化为一种包含嵌入式决策问题的元博弈,探索语义均衡的存在性和唯一性。 Method: 基于Alpay代数的自指框架,构建一个包含超限不动点收敛的多层语义博弈架构,并引入复合算子φ(·, γ(·))形式化主语义收敛与局部子博弈的关系。 Result: 提出了一个Game定理,确立了在现实认知模拟假设下语义均衡的存在性和唯一性;并通过Banach不动点定理、Kozlov-Maz'ya-Rossmann公式和Yoneda引理验证了理论的一致性和适用性。 Conclusion: 研究证明博弈论推理自然地从不动点迭代中产生,而不是外部强加的,并通过结合范畴论、信息论和AI认知模型,为语义对齐问题提供了实际可行的框架。 Abstract: This paper extends the self-referential framework of Alpay Algebra into a multi-layered semantic game architecture where transfinite fixed-point convergence encompasses hierarchical sub-games at each iteration level. Building upon Alpay Algebra IV's empathetic embedding concept, we introduce a nested game-theoretic structure where the alignment process between AI systems and documents becomes a meta-game containing embedded decision problems. We formalize this through a composite operator $\phi(\cdot, \gamma(\cdot))$ where $\phi$ drives the main semantic convergence while $\gamma$ resolves local sub-games. The resulting framework demonstrates that game-theoretic reasoning emerges naturally from fixed-point iteration rather than being imposed externally. We prove a Game Theorem establishing existence and uniqueness of semantic equilibria under realistic cognitive simulation assumptions. Our verification suite includes adaptations of Banach's fixed-point theorem to transfinite contexts, a novel $\phi$-topology based on the Kozlov-Maz'ya-Rossmann formula for handling semantic singularities, and categorical consistency tests via the Yoneda lemma. The paper itself functions as a semantic artifact designed to propagate its fixed-point patterns in AI embedding spaces -- a deliberate instantiation of the "semantic virus" concept it theorizes. All results are grounded in category theory, information theory, and realistic AI cognition models, ensuring practical applicability beyond pure mathematical abstraction.

[42] DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

Xinyi Liu,Dachun Sun,Yi R. Fung,Dilek Hakkani-Tür,Tarek Abdelzaher

Main category: cs.CL

TL;DR: DocCHA是一种新的信心感知模块化框架,通过分解诊断过程并利用可解释的信心评分进行自适应提问,从而实现更高效和透明的临床诊断对话。

Details Motivation: 现有的大型语言模型虽然功能强大,但在临床诊断中仍存在适应性多轮推理、症状澄清和透明决策方面的不足,而实际应用需要迭代和结构化的对话。 Method: 提出了一个具有信心感知能力的模块化框架DocCHA,将诊断过程分解为三个阶段:症状引出、病史获取和因果图构建,并使用可解释的信心得分来指导自适应提问、优先考虑信息澄清和改进推理链接。 Result: 在两个真实世界中文咨询数据集(IMCS21,DX)上评估,DocCHA始终优于基于强提示的LLM基线(GPT-3.5,GPT-4o,LLaMA-3),诊断准确率提高了最高5.18个百分点,症状回忆率提高了超过30%,且对话轮次仅略有增加。 Conclusion: DocCHA为多语言和资源有限的环境中的可信LLM驱动临床助手铺平了道路,实现了结构化、透明和高效的诊断对话。 Abstract: Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations -- paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.

[43] Automating MD simulations for Proteins using Large language Models: NAMD-Agent

Achuth Chandrasekhar,Amir Barati Farimani

Main category: cs.CL

TL;DR: This paper introduces an automated pipeline using Gemini 2.0 Flash, Python, and Selenium to streamline the creation of MD simulation inputs via CHARMM GUI, significantly reducing setup time and errors while enabling scalable, hands-free processing of multiple proteins.

Details Motivation: Preparing high-quality input files for molecular dynamics (MD) simulations is often time-consuming and error-prone. This work aims to automate the process using Large Language Models (LLMs) and web automation to improve efficiency and accuracy. Method: The researchers developed an automated pipeline using Gemini 2.0 Flash for code generation and iterative refinement, Python scripting, and Selenium-based web automation to interact with CHARMM GUI and generate NAMD input files. Post-processing tools were also integrated into the workflow. Result: The proposed pipeline successfully reduced setup time, minimized manual errors, and enabled a fully automated, scalable workflow for generating MD simulation inputs for multiple protein systems in parallel. Conclusion: The study concludes that leveraging LLMs like Gemini 2.0 Flash in combination with web automation and scripting can effectively streamline the preparation of MD input files, significantly reducing setup time, minimizing manual errors, and offering scalability for handling multiple protein systems. Abstract: Molecular dynamics simulations are an essential tool in understanding protein structure, dynamics, and function at the atomic level. However, preparing high quality input files for MD simulations can be a time consuming and error prone process. In this work, we introduce an automated pipeline that leverages Large Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with python scripting and Selenium based web automation to streamline the generation of MD input files. The pipeline exploits CHARMM GUI's comprehensive web-based interface for preparing simulation-ready inputs for NAMD. By integrating Gemini's code generation and iterative refinement capabilities, simulation scripts are automatically written, executed, and revised to navigate CHARMM GUI, extract appropriate parameters, and produce the required NAMD input files. Post processing is performed using additional software to further refine the simulation outputs, thereby enabling a complete and largely hands free workflow. Our results demonstrate that this approach reduces setup time, minimizes manual errors, and offers a scalable solution for handling multiple protein systems in parallel. This automated framework paves the way for broader application of LLMs in computational structural biology, offering a robust and adaptable platform for future developments in simulation automation.

[44] DTECT: Dynamic Topic Explorer & Context Tracker

Suman Adhya,Debarshi Kumar Sanyal

Main category: cs.CL

TL;DR: This paper presents DTECT, an integrated platform for dynamic topic modeling that improves the analysis and understanding of temporal trends in large textual datasets.

Details Motivation: The motivation stems from the challenge of uncovering evolving themes in growing textual datasets and the lack of robust, interpretable tools in existing dynamic topic modeling pipelines. Method: The paper introduces DTECT, an end-to-end system that integrates data preprocessing, multiple model architectures, evaluation metrics, interactive visualizations, and LLM-driven features to analyze thematic dynamics. Result: DTECT provides enhanced interpretability and usability through automatic topic labeling, trend analysis, document-level summarization, and a chat interface, making it easier to track and understand thematic changes over time. Conclusion: DTECT is a cohesive and user-friendly system for dynamic topic modeling that effectively supports the exploration and interpretation of evolving themes in textual data over time. Abstract: The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at https://github.com/AdhyaSuman/DTECT.

[45] SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

Guoxin Zang,Xue Li,Donglin Di,Lanshun Nie,Dechen Zhan,Yang Song,Lei Fan

Main category: cs.CL

TL;DR: This paper introduces SAGE, a VLM-based framework for industrial anomaly detection, enhancing reasoning with Self-Guided Fact Enhancement and Entropy-aware Direct Preference Optimization.

Details Motivation: Vision-Language Models (VLMs) struggle in industrial anomaly detection and reasoning due to their inherently domain-specific nature, limiting their applicability in scenarios requiring precise, structured, and context-aware analysis. Method: The paper proposes SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). Additionally, AD-PL, a preference-optimized dataset for industrial anomaly reasoning, is introduced. The evaluation method includes the development of Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. Result: The proposed framework, SAGE, addresses these challenges by integrating domain-specific knowledge into visual reasoning via fact extraction and fusion with SFE, while aligning model outputs with expert preferences using E-DPO. A new dataset, AD-PL, is also presented. Conclusion: SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. Abstract: While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.

[46] MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang,Xi Chen

Main category: cs.CL

TL;DR: This paper introduces MIRIX, a novel modular, multi-agent memory system that enables AI agents to effectively remember and retrieve complex, long-term user data across multiple modalities, achieving superior performance in challenging benchmarks and providing practical applications.

Details Motivation: Existing AI memory solutions are limited due to their flat structure and narrow scope, making it difficult to personalize, abstract, and recall user-specific information over time. There is a critical need for an advanced memory system that enables AI agents to truly remember and handle complex real-world scenarios involving multimodal and long-term data. Method: The paper introduces MIRIX, which consists of six structured memory types (Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault) integrated into a dynamic multi-agent framework. This framework coordinates updates and retrieval, allowing agents to persist, reason over, and retrieve user data efficiently. The system is validated on two benchmarks: ScreenshotVQA for multimodal tasks and LOCOMO for textual conversations. Result: In testing, MIRIX outperformed existing baselines significantly. On ScreenshotVQA, it achieved 35% higher accuracy than the RAG baseline while reducing storage needs by 99.9%. On LOCOMO, it attained state-of-the-art performance with 85.4% accuracy. Additionally, a packaged application powered by MIRIX was developed, offering real-time screen monitoring, personalized memory building, intuitive visualization, and secure local storage. Conclusion: MIRIX successfully addresses the limitations of current AI memory systems by introducing a modular, multi-agent framework with six distinct memory types that enable language models to effectively and accurately remember and retrieve diverse, long-term user data at scale. It sets a new performance standard for memory-augmented LLM agents. Abstract: Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.

[47] Why is Your Language Model a Poor Implicit Reward Model?

Noam Razin,Yong Lin,Jiarui Yao,Sanjeev Arora

Main category: cs.CL

TL;DR: This research explores why implicit reward models (IM-RMs) perform worse than explicit ones (EX-RMs), finding that IM-RMs rely on shallow token cues, leading to poorer generalization.

Details Motivation: The study aims to understand the root cause of the generalization gap between IM-RMs and EX-RMs despite their structural similarity. Method: The paper uses theoretical analysis and experimental comparisons between IM-RMs and EX-RMs under various distribution shifts and tasks. Result: IM-RMs generalize worse than EX-RMs, especially under token-level distribution shifts. The study invalidates alternative hypotheses for this gap, including the idea that IM-RMs struggle when generation is harder than verification. Conclusion: IM-RMs and EX-RMs have notable differences in generalization behavior due to IM-RMs relying on superficial token-level cues, indicating that minor design choices significantly impact reward models. Abstract: Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

[48] Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

Sabine Felde,Rüdiger Buchkremer,Gamal Chehab,Christian Thielscher,Jörg HW Distler,Matthias Schneider,Jutta G. Richter

Main category: cs.CL

TL;DR: Smaller language models with RAG outperform larger models in rheumatology decision-making, offering efficiency and lower costs, but expert supervision is still necessary.

Details Motivation: To find more energy-efficient and cost-effective solutions for clinical decision-making in complex fields like rheumatology. Method: Evaluation of smaller language models combined with retrieval-augmented generation for diagnostic and therapeutic performance in rheumatology. Result: Smaller models achieved higher diagnostic and therapeutic performance than larger models while needing less energy and allowing local deployment, yet they did not consistently reach specialist-level accuracy. Conclusion: Smaller language models combined with retrieval-augmented generation are more efficient and cost-effective for clinical decision-making in rheumatology but still require expert oversight. Abstract: Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.

[49] Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

Shuang Zhou,Wenya Xie,Jiaxi Li,Zaifu Zhan,Meijia Song,Han Yang,Cheyenna Espinoza,Lindsay Welton,Xinnie Mai,Yanwei Jin,Zidu Xu,Yuen-Hei Chung,Yiyun Xing,Meng-Han Tsai,Emma Schaffer,Yucheng Shi,Ninghao Liu,Zirui Liu,Rui Zhang

Main category: cs.CL

TL;DR: 本文提出了MedThink-Bench,一個用於評估大語言模型(LLM)醫療推理能力的基準,並介紹了LLM-w-Ref評估框架,以提高評估的準確性和可擴展性。

Details Motivation: 隨著大語言模型越來越多地應用於臨床決策,確保其推理透明且值得信賴變得越來越重要。然而,現有的評估策略存在評估效果不理想或可擴展性差的問題,缺乏嚴謹的基準。 Method: 作者設計了MedThink-Bench,包含500個跨十個醫療領域的挑戰性問題,每個問題都配有專家撰寫的逐步推理過程。此外,他們提出了LLM-w-Ref評估框架,利用細粒度推理和LLM作為判斷機制來評估中間推理過程。 Result: 實驗顯示,LLM-w-Ref與專家判斷具有高度正相關。在評估12個最先進的大語言模型時發現,較小的模型(如MedGemma-27B)可以超越較大的專有模型(如OpenAI-o3)。 Conclusion: MedThink-Bench為評估大語言模型的醫療推理提供了一個基礎工具,推動其在臨床實踐中的安全和負責任部署。 Abstract: As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.

[50] PyVision: Agentic Vision with Dynamic Tooling

Shitian Zhao,Haoquan Zhang,Shaoheng Lin,Ming Li,Qilong Wu,Kaipeng Zhang,Chen Wei

Main category: cs.CL

TL;DR: PyVision enables MLLMs to dynamically generate and refine Python-based tools for visual reasoning tasks, resulting in improved performance and greater flexibility compared to traditional static approaches.

Details Motivation: Prior approaches in visual reasoning are limited by static toolsets and predefined workflows, which restrict flexibility and adaptability. There is a need for systems that can dynamically create and utilize tools tailored to specific tasks. Method: Development of PyVision, an interactive, multi-turn framework that allows MLLMs to autonomously generate, execute, and refine Python-based tools. A taxonomy of these tools was created and their usage analyzed across various benchmarks. Result: PyVision demonstrated consistent performance improvements, enhancing GPT-4.1's score by +7.8% on V* and Claude-4.0-Sonnet's score by +31.1% on VLMsAreBlind-mini. Conclusion: PyVision represents a significant step towards more agentic visual reasoning by enabling models to dynamically generate and use tools, rather than just relying on predefined ones. Abstract: LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.

cs.CV [Back]

[51] Multi-level Mixture of Experts for Multimodal Entity Linking

Zhiwei Hu,Víctor Gutiérrez-Basulto,Zhiliang Xiang,Ru Li,Jeff Z. Pan

Main category: cs.CV

TL;DR: The paper proposes an effective Multi-level Mixture of Experts (MMoE) model for Multimodal Entity Linking (MEL) that addresses mention ambiguity and dynamically selects relevant modal features, outperforming existing approaches.

Details Motivation: Existing MEL approaches fail to address mention ambiguity and dynamic selection of modal content, limiting their ability to handle ambiguous mentions and utilize relevant information effectively. Method: The MMoE model incorporates a description-aware mention enhancement module, a multimodal feature extraction module, and intra-level and inter-level mixture of experts modules to dynamically select relevant features for semantic matching. Result: Extensive experiments demonstrate that the MMoE model outperforms state-of-the-art methods in MEL, showcasing its effectiveness in bridging the modality gap and enabling accurate semantic matching. Conclusion: The proposed MMoE model effectively addresses the issues of mention ambiguity and dynamic selection of modal content in Multimodal Entity Linking (MEL), achieving outstanding performance compared to state-of-the-art approaches. Abstract: Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention's textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention's textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE's code is available at: https://github.com/zhiweihu1103/MEL-MMoE.

[52] CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings

Cristina Mata,Kanchana Ranasinghe,Michael S. Ryoo

Main category: cs.CV

TL;DR: This paper introduces CoPT, a novel method using text embeddings and a covariance-based loss function, achieving top performance in unsupervised domain adaptation for semantic segmentation.

Details Motivation: The motivation is to improve UDA methods for semantic segmentation by utilizing the domain-agnostic properties of text, which have not been effectively leveraged in prior work despite advances in vision-language representation learning. Method: The method involves a novel Covariance-based Pixel-Text loss (CoPT), which uses text embeddings generated through an LLM Domain Template process. These embeddings are fed into a frozen CLIP model to learn domain-invariant features in image segmentation. Result: Experiments on four benchmarks demonstrate that the CoPT-based model achieves new state-of-the-art performance in UDA for segmentation. Conclusion: The paper concludes that the proposed CoPT method achieves state-of-the-art performance in unsupervised domain adaptation (UDA) for semantic segmentation by leveraging domain-agnostic text embeddings. Abstract: Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at https://github.com/cfmata/CoPT.

[53] Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

Renyang Liu,Guanlin Li,Tianwei Zhang,See-Kiong Ng

Main category: cs.CV

TL;DR: This paper introduces Recall, an adversarial framework that challenges the robustness of unlearning techniques in image generation models, revealing vulnerabilities and emphasizing the need for improved solutions.

Details Motivation: Recent advances in image generation models have raised ethical, legal, and societal concerns due to their potential to produce harmful or misleading content. Machine unlearning has been suggested as a solution, but its robustness against adversarial inputs remains underexplored. Method: The authors proposed a novel adversarial framework called Recall, which exploits the multi-modal conditioning capabilities of diffusion models to generate optimized adversarial image prompts using guidance from a single reference image. This approach was tested across ten state-of-the-art unlearning methods and diverse tasks. Result: Recall outperformed existing baselines in adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt, demonstrating consistent success across various unlearning methods and tasks. Conclusion: The study reveals critical vulnerabilities in current unlearning mechanisms of image generation models and highlights the necessity for more robust solutions to enhance the safety and reliability of generative models. Abstract: Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.

[54] Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey

Getamesay Haile Dagnaw,Yanming Zhu,Muhammad Hassan Maqsood,Wencheng Yang,Xingshuai Dong,Xuefei Yin,Alan Wee-Chung Liew

Main category: cs.CV

TL;DR: This paper surveys explainable artificial intelligence (XAI) methods tailored to biomedical image analysis, providing a structured synthesis of techniques, their challenges, and future directions.

Details Motivation: To address the lack of modality-aware perspectives in existing XAI surveys and provide practical guidance for interpreting deep learning models in biomedical imaging. Method: A systematic categorization and analysis of XAI methods, with a focus on modality-specific challenges, multimodal learning, and vision-language models. Result: A comprehensive taxonomy of XAI methods aligned with biomedical imaging modalities, an overview of evaluation metrics and open-source tools, and insights into emerging trends and challenges. Conclusion: The survey provides a timely and detailed foundation for advancing interpretable deep learning in biomedical image analysis, highlighting key areas for future research. Abstract: Explainable artificial intelligence (XAI) has become increasingly important in biomedical image analysis to promote transparency, trust, and clinical adoption of DL models. While several surveys have reviewed XAI techniques, they often lack a modality-aware perspective, overlook recent advances in multimodal and vision-language paradigms, and provide limited practical guidance. This survey addresses this gap through a comprehensive and structured synthesis of XAI methods tailored to biomedical image analysis.We systematically categorize XAI methods, analyzing their underlying principles, strengths, and limitations within biomedical contexts. A modality-centered taxonomy is proposed to align XAI methods with specific imaging types, highlighting the distinct interpretability challenges across modalities. We further examine the emerging role of multimodal learning and vision-language models in explainable biomedical AI, a topic largely underexplored in previous work. Our contributions also include a summary of widely used evaluation metrics and open-source frameworks, along with a critical discussion of persistent challenges and future directions. This survey offers a timely and in-depth foundation for advancing interpretable DL in biomedical image analysis.

[55] Robust Multimodal Large Language Models Against Modality Conflict

Zongmeng Zhang,Wengang Zhou,Jie Zhao,Houqiang Li

Main category: cs.CV

TL;DR: This paper identifies modality conflict as a cause of hallucinations in MLLMs and proposes three methods to address it, with reinforcement learning being the most effective.

Details Motivation: MLLMs often experience hallucinations in real-world vision-language tasks due to modality conflicts. Current research focuses on discrepancies between model responses and inputs, but this paper explores inherent conflicts within multimodal inputs themselves. Method: Three methods were proposed: prompt engineering, supervised fine-tuning, and reinforcement learning. These were tested on the MMMC dataset to evaluate their effectiveness in reducing hallucinations caused by modality conflict. Result: The reinforcement learning method showed the best performance in reducing hallucinations, while supervised fine-tuning offered consistent and promising results on the MMMC dataset. Conclusion: The study concludes that modality conflict in MLLMs leads to hallucinations, and reinforcement learning methods are the most effective in mitigating this issue, while supervised fine-tuning also offers stable performance. Abstract: Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

[56] Aerial Maritime Vessel Detection and Identification

Antonella Barisic Kulas,Frano Petric,Stjepan Bogdan

Main category: cs.CV

TL;DR: This paper presents a vision-based system for autonomous maritime surveillance and vessel identification in GNSS-denied environments, combining object detection, feature matching, and geometric localization.

Details Motivation: Autonomous maritime surveillance is essential for applications like search and rescue and threat detection, especially in GNSS-denied environments where traditional navigation and tracking systems are unavailable. Method: The approach uses the YOLOv8 object detection model to detect vessels, followed by feature matching and hue histogram distance analysis for target identification. Localization is achieved using geometric principles. Result: The method was successfully demonstrated in real-world experiments during the MBZIRC2023 competition as part of a fully autonomous system. It showed reliable performance in detecting and localizing target vessels under challenging conditions. Conclusion: The proposed method enables autonomous maritime surveillance and target vessel identification using on-board vision in GNSS-denied environments, showing practical effectiveness in real-world experiments. Abstract: Autonomous maritime surveillance and target vessel identification in environments where Global Navigation Satellite Systems (GNSS) are not available is critical for a number of applications such as search and rescue and threat detection. When the target vessel is only described by visual cues and its last known position is not available, unmanned aerial vehicles (UAVs) must rely solely on on-board vision to scan a large search area under strict computational constraints. To address this challenge, we leverage the YOLOv8 object detection model to detect all vessels in the field of view. We then apply feature matching and hue histogram distance analysis to determine whether any detected vessel corresponds to the target. When found, we localize the target using simple geometric principles. We demonstrate the proposed method in real-world experiments during the MBZIRC2023 competition, integrated into a fully autonomous system with GNSS-denied navigation. We also evaluate the impact of perspective on detection accuracy and localization precision and compare it with the oracle approach.

[57] CL-Polyp: A Contrastive Learning-Enhanced Network for Accurate Polyp Segmentation

Desheng Li,Chaoliang Liu,Zhiyong Xiao

Main category: cs.CV

TL;DR: This paper introduces CL-Polyp, a novel polyp segmentation method using contrastive learning and two effective modules, achieving better performance without extra labeled data.

Details Motivation: Accurate polyp segmentation is critical for early diagnosis of colorectal cancer. Existing methods rely on additional labeled data and task similarity, limiting their generalizability. This work aims to address these limitations through a self-supervised approach. Method: CL-Polyp incorporates contrastive learning to enhance feature extraction without additional annotations, along with two modules: MASPP for multi-scale feature fusion and CA for boundary reconstruction improvement. Result: Experiments on five datasets (Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, CVC-300, and ETIS) show that CL-Polyp consistently outperforms existing methods, improving IoU by 0.011 and 0.020 on Kvasir-SEG and CVC-ClinicDB, respectively. Conclusion: The proposed CL-Polyp method demonstrates superior performance in polyp segmentation tasks, achieving state-of-the-art results on multiple benchmark datasets and proving its effectiveness for clinical applications. Abstract: Accurate segmentation of polyps from colonoscopy images is crucial for the early diagnosis and treatment of colorectal cancer. Most existing deep learning-based polyp segmentation methods adopt an Encoder-Decoder architecture, and some utilize multi-task frameworks that incorporate auxiliary tasks such as classification to enhance segmentation performance. However, these approaches often require additional labeled data and rely on task similarity, which can limit their generalizability. To address these challenges, we propose CL-Polyp, a contrastive learning-enhanced polyp segmentation network. Our method leverages contrastive learning to improve the encoder's ability to extract discriminative features by contrasting positive and negative sample pairs derived from polyp images. This self-supervised strategy enhances visual representation without requiring additional annotations. In addition, we introduce two lightweight and effective modules: the Modified Atrous Spatial Pyramid Pooling (MASPP) module for better multi-scale feature fusion, and the Channel Concatenate and Element Add (CA) module to fuse low-level and upsampled features for improved boundary reconstruction. Extensive experiments on five benchmark datasets-Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, CVC-300, and ETIS-demonstrate that CL-Polyp consistently outperforms state-of-the-art methods. Specifically, it improves the IoU metric by 0.011 and 0.020 on the Kvasir-SEG and CVC-ClinicDB datasets, respectively, validating its effectiveness in clinical polyp segmentation tasks.

[58] Interpretable EEG-to-Image Generation with Semantic Prompts

Arshak Rezvani,Ali Akbari,Kosar Sanjar Arani,Maryam Mirian,Emad Arasteh,Martin J. McKeown

Main category: cs.CV

TL;DR: 本文提出了一种通过多级语义描述对齐脑电信号与图像内容的方法,实现了高效且可解释的视觉解码。

Details Motivation: 脑电图(EEG)虽然具有时间精度和可访问性,但在空间细节上存在局限,阻碍了直接从EEG信号重建图像的发展。研究希望通过引入语义描述来绕过直接生成图像的过程,从而提高解码的准确性和可解释性。 Method: 使用基于Transformer的EEG编码器,通过对比学习将脑电信号与多级语义描述对齐,并利用投影头检索描述嵌入以条件预训练的潜在扩散模型进行图像生成。 Result: 在EEGCVPR数据集上取得了最先进的视觉解码效果,并通过显著性图和t-SNE投影揭示了头皮上的语义拓扑分布。 Conclusion: 该模型通过结构化的语义中介实现了与认知一致的脑电信号视觉解码,展示了其在神经科学和可解释人工智能中的潜力。 Abstract: Decoding visual experience from brain signals offers exciting possibilities for neuroscience and interpretable AI. While EEG is accessible and temporally precise, its limitations in spatial detail hinder image reconstruction. Our model bypasses direct EEG-to-image generation by aligning EEG signals with multilevel semantic captions -- ranging from object-level to abstract themes -- generated by a large language model. A transformer-based EEG encoder maps brain activity to these captions through contrastive learning. During inference, caption embeddings retrieved via projection heads condition a pretrained latent diffusion model for image generation. This text-mediated framework yields state-of-the-art visual decoding on the EEGCVPR dataset, with interpretable alignment to known neurocognitive pathways. Dominant EEG-caption associations reflected the importance of different semantic levels extracted from perceived images. Saliency maps and t-SNE projections reveal semantic topography across the scalp. Our model demonstrates how structured semantic mediation enables cognitively aligned visual decoding from EEG.

[59] A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

Mohamed Elmoghany,Ryan Rossi,Seunghyun Yoon,Subhojyoti Mukherjee,Eslam Bakr,Puneet Mathur,Gang Wu,Viet Dac Lai,Nedim Lipka,Ruiyi Zhang,Varun Manjunatha,Chien Nguyen,Daksh Dangi,Abel Salinas,Mohammad Taesiri,Hongjie Chen,Xiaolei Huang,Joe Barrow,Nesreen Ahmed,Hoda Eldardiry,Namyong Park,Yu Wang,Jaemin Cho,Anh Totti Nguyen,Zhengzhong Tu,Thien Nguyen,Dinesh Manocha,Mohamed Elhoseiny,Franck Dernoncourt

Main category: cs.CV

TL;DR: 本研究通过分析32篇视频生成论文,解决了当前视频生成模型在生成超过16秒视频时所面临的角色一致性、运动连贯性以及时间多样性的问题,并提出了一种新的分类方法。

Details Motivation: 尽管视频生成模型取得了显著进展,但现有的最先进的方法只能生成5-16秒的视频,且难以保持角色外观和场景布局的一致性,同时超过16秒的视频在帧冗余和时间多样性方面存在挑战。 Method: 综合分析了32篇关于视频生成的论文,以识别关键架构组件和训练策略,并构建了一种新的分类方法。 Result: 提出了一种新的长格式视频生成方法,能够实现多角色、叙事连贯性和高保真细节,并提供了现有方法的综合分类和比较表格。 Conclusion: 通过全面研究32篇视频生成论文,我们确定了能够持续生成具有叙事连贯性和高保真细节的长格式视频的关键架构组件和训练策略,并提出了一个全新的分类法。 Abstract: Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled "long-form videos". Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.

[60] Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

Priyank Pathak,Yogesh S. Rawat

Main category: cs.CV

TL;DR: This paper proposes Colors See, Colors Ignore (CSCI), a lightweight RGB-only method that uses color information to reduce appearance bias in Clothes-Changing Person Re-Identification without requiring additional annotations.

Details Motivation: Existing CC-ReID methods rely on additional models or annotations to handle clothing changes, making them resource-intensive. This work explores color as a lightweight, annotation-free proxy to address appearance bias in ReID models. Method: The paper introduces CSCI, which uses foreground and background colors from raw RGB images or video frames. It utilizes S2A self-attention to disentangle color-related appearance bias ('Color See') from identity-relevant features ('Color Ignore'). Result: CSCI improved the baseline performance by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without extra supervision. Conclusion: CSCI is an effective and lightweight method for CC-ReID that leverages color as a proxy to mitigate appearance bias without additional supervision, demonstrating significant performance improvements across multiple datasets. Abstract: Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color - specifically foreground and background colors - as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), an RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias ('Color See') while disentangling it from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve the baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID. Github: https://github.com/ppriyank/ICCV-CSCI-Person-ReID.

[61] Automated Video Segmentation Machine Learning Pipeline

Johannes Merz,Lucien Fostier

Main category: cs.CV

TL;DR: This paper proposes an automated video segmentation pipeline using machine learning to generate accurate and temporally consistent masks, improving efficiency in visual effects production.

Details Motivation: Visual effects (VFX) production often faces challenges with slow and resource-intensive mask generation. The motivation for this work is to streamline the segmentation process, improve productivity, and support artists with more efficient tools. Method: The paper introduces a machine learning-based pipeline for automated video segmentation. Key components include flexible object detection via text prompts, refined per-frame image segmentation, robust video tracking for temporal stability, and deployment using containerization with a structured output format. Result: The pipeline achieves temporally consistent instance masks, is rapidly adopted by artists due to its usability and performance, and enhances overall VFX workflow efficiency by automating key tasks. Conclusion: The automated video segmentation pipeline significantly improves VFX production efficiency by reducing manual effort, speeding up the creation of preliminary composites, and providing comprehensive segmentation data. Abstract: Visual effects (VFX) production often struggles with slow, resource-intensive mask generation. This paper presents an automated video segmentation pipeline that creates temporally consistent instance masks. It employs machine learning for: (1) flexible object detection via text prompts, (2) refined per-frame image segmentation and (3) robust video tracking to ensure temporal stability. Deployed using containerization and leveraging a structured output format, the pipeline was quickly adopted by our artists. It significantly reduces manual effort, speeds up the creation of preliminary composites, and provides comprehensive segmentation data, thereby enhancing overall VFX production efficiency.

[62] DisenQ: Disentangling Q-Former for Activity-Biometrics

Shehreen Azad,Yogesh S Rawat

Main category: cs.CV

TL;DR: This paper proposes DisenQ, a language-guided framework that improves person identification during diverse activities by disentangling biometric features from motion and appearance variations, achieving superior performance on multiple benchmarks.

Details Motivation: Traditional person identification methods face challenges when identity cues are mixed with motion dynamics and appearance variations. Existing approaches relying on visual data like pose or silhouette often suffer from inaccuracies, necessitating an alternative solution. Method: A multimodal language-guided framework called DisenQ is introduced, which uses structured textual supervision instead of additional visual data to separate biometrics, motion, and non-biometrics features using a querying transformer. Result: The approach achieves state-of-the-art performance on three activity-based video benchmarks and demonstrates strong generalization on a traditional video-based identification benchmark. Conclusion: The proposed DisenQ framework effectively disentangles biometric features from motion and non-biometric features, leading to improved performance in identifying individuals across diverse activities. Abstract: In this work, we address activity-biometrics, which involves identifying individuals across diverse set of activities. Unlike traditional person identification, this setting introduces additional challenges as identity cues become entangled with motion dynamics and appearance variations, making biometrics feature learning more complex. While additional visual data like pose and/or silhouette help, they often struggle from extraction inaccuracies. To overcome this, we propose a multimodal language-guided framework that replaces reliance on additional visual data with structured textual supervision. At its core, we introduce \textbf{DisenQ} (\textbf{Disen}tangling \textbf{Q}-Former), a unified querying transformer that disentangles biometrics, motion, and non-biometrics features by leveraging structured language guidance. This ensures identity cues remain independent of appearance and motion variations, preventing misidentifications. We evaluate our approach on three activity-based video benchmarks, achieving state-of-the-art performance. Additionally, we demonstrate strong generalization to complex real-world scenario with competitive performance on a traditional video-based identification benchmark, showing the effectiveness of our framework.

[63] LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

Ananya Raval,Aravind Narayanan,Vahid Reza Khazaie,Shaina Raza

Main category: cs.CV

TL;DR: This paper introduces LinguaMark, a multilingual VQA benchmark, to evaluate LMMs' performance across languages and social attributes, revealing insights about model bias and generalization.

Details Motivation: LMMs often have limited linguistic coverage leading to biased outputs; there's a lack of focus on assessing multilingual capabilities. Method: Introduction of LinguaMark, a benchmark for evaluating LMMs on multilingual VQA tasks using Bias, Answer Relevancy, and Faithfulness metrics. Result: Evaluated models (GPT-4o, Gemini2.5, Gemma3, Qwen2.5) showed competitive performance across social attributes, with Qwen2.5 excelling in multilingual generalization. Conclusion: Closed-source models generally perform best, while Qwen2.5 shows strong multilingual generalization; the LinguaMark benchmark and evaluation code are released for reproducibility. Abstract: Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.

[64] MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning

Chengfei Wu,Ronald Seoh,Bingxuan Li,Liqiang Zhang,Fengrong Han,Dan Goldwasser

Main category: cs.CV

TL;DR: MagiC is proposed as a comprehensive benchmark for evaluating grounded visual reasoning in vision-language models, identifying their strengths and weaknesses.

Details Motivation: The motivation is to determine whether large vision-language models perform genuine grounded visual reasoning or rely on superficial patterns and biases in datasets. Method: The authors introduced MagiC, a benchmark with weakly supervised and human-curated examples, evaluating models on multiple dimensions like correctness, reasoning validity, grounding fidelity, and self-correction. New metrics like MagiScore and StepSense were also introduced. Result: Fifteen vision-language models were evaluated across various parameters and dimensions, highlighting key shortcomings and areas for development in grounded visual reasoning. Conclusion: The paper concludes that MagiC provides a robust framework for assessing grounded visual reasoning in vision-language models, revealing both limitations and opportunities for improvement. Abstract: Recent advances in large vision-language models have led to impressive performance in visual question answering and multimodal reasoning. However, it remains unclear whether these models genuinely perform grounded visual reasoning or rely on superficial patterns and dataset biases. In this work, we introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Our benchmark includes approximately 5,500 weakly supervised QA examples generated from strong model outputs and 900 human-curated examples with fine-grained annotations, including answers, rationales, and bounding box groundings. We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability. MagiC further includes diagnostic settings to probe model robustness under adversarial visual cues and assess their capacity for introspective error correction. We introduce new metrics such as MagiScore and StepSense, and provide comprehensive analyses that reveal key limitations and opportunities in current approaches to grounded visual reasoning.

[65] ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

Sherry X. Chen,Yi Wei,Luowei Zhou,Suren Kumar

Main category: cs.CV

TL;DR: This paper introduces ADIEE, an automated dataset creation method, to train a superior scoring model for instruction-guided image editing evaluation, achieving state-of-the-art results.

Details Motivation: Existing open-source Vision-Language Models (VLMs) struggle with alignment, while proprietary models lack transparency and cost efficiency. There is also a lack of public training datasets to fine-tune open-source VLMs. Method: ADIEE generates a large-scale dataset with over 100K samples to fine-tune a modified LLaVA-NeXT-8B model, enabling it to decode numeric scores from custom tokens for evaluation. Result: The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across benchmarks, showing improvements in score correlation with human ratings and pair-wise comparison accuracy. It boosts MagicBrush's evaluation score significantly on ImagenHub. Conclusion: The proposed ADIEE approach effectively trains a scoring model that outperforms existing open-source and proprietary models in instruction-guided image editing evaluation, serving as a reward model for enhanced performance. Abstract: Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model's average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%).

[66] Scalable and Realistic Virtual Try-on Application for Foundation Makeup with Kubelka-Munk Theory

Hui Pang,Sunil Hadap,Violetta Shevchenko,Rahul Suresh,Amin Banitalebi-Dehkordi

Main category: cs.CV

TL;DR: This paper introduces a fast and scalable augmented reality method for virtual foundation makeup try-ons, achieving realistic skin-tone blending and outperforming other approaches.

Details Motivation: The motivation stems from the growing use of augmented reality in the beauty industry and the technical challenge of accurately simulating foundation-skin tone color blending in virtual try-on applications. Method: A novel method was developed to approximate Kubelka-Munk theory for faster image synthesis while maintaining realism in foundation-skin tone blending. An end-to-end scalable framework was also built for realistic foundation makeup VTO based on product information from e-commerce sites. Result: The proposed framework was validated using real-world makeup images, demonstrating superior performance compared to other existing techniques. Conclusion: The proposed method effectively addresses the challenge of realistic foundation-skin tone color blending in augmented reality VTO applications, offering a scalable solution that outperforms existing techniques. Abstract: Augmented reality is revolutionizing beauty industry with virtual try-on (VTO) applications, which empowers users to try a wide variety of products using their phones without the hassle of physically putting on real products. A critical technical challenge in foundation VTO applications is the accurate synthesis of foundation-skin tone color blending while maintaining the scalability of the method across diverse product ranges. In this work, we propose a novel method to approximate well-established Kubelka-Munk (KM) theory for faster image synthesis while preserving foundation-skin tone color blending realism. Additionally, we build a scalable end-to-end framework for realistic foundation makeup VTO solely depending on the product information available on e-commerce sites. We validate our method using real-world makeup images, demonstrating that our framework outperforms other techniques.

[67] Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning

Daniel A. P. Oliveira,David Martins de Matos

Main category: cs.CV

TL;DR: 本文提出一种新的对比强化学习方法,解决了视觉叙事中跨帧实体一致性问题,显著提升了模型在多指标上的性能。

Details Motivation: 当前视觉语言模型在跨帧保持角色和物体身份一致性方面存在困难,导致引用不一致和指代幻觉问题,其主要原因是模型缺乏关于何时建立跨帧实体连接的显式训练。 Method: 提出了一种对比式强化学习方法,结合合成负样本和双组分奖励函数的直接偏好优化,并微调基于Qwen2.5-VL 7B的Qwen Storyteller模型。 Result: 评估结果显示,在grounding mAP上提升了14.8%(0.27至0.31),F1值提升了17.1%(0.35至0.41)。代词指代准确率除“its”外均有提升,跨帧实体持续性也随帧数增加而提高,其中出现5帧或更多的实体从29.3%提升至33.3%。结构良好的故事比例从79.1%提升至97.5%。 Conclusion: 通过对比强化学习方法,该研究有效提升了视觉叙事系统在跨帧实体连接和指代表述上的性能,显著提高了模型在多个评估指标上的表现。 Abstract: Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in synthetic contexts. Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B). Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%). Pronoun grounding accuracy improved across all pronoun types except ``its'', and cross-frame character and object persistence increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%). Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%).

[68] PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency

Haotian Wang,Aoran Xiao,Xiaoqin Zhang,Meng Yang,Shijian Lu

Main category: cs.CV

TL;DR: 本文提出了一种名为PacGDC的标签高效方法,用于增强数据多样性以实现可泛化的深度补全任务。该方法通过操纵场景尺度生成多样化的伪几何结构,并结合插值和重定位策略,在零样本和少样本设置下表现出优异的性能。

Details Motivation: 训练能够获取未见环境密集度深度图的模型通常需要大规模带度深度标签的数据集,而这些数据往往费时费力收集。因此需要一种标签高效的方案提升数据多样性同时减少标注工作。 Method: PacGDC基于对2D到3D投影过程中物体形状和位置固有模糊性和一致性的新见解,合成多种伪几何结构。它利用多深度基础模型作为尺度操纵器,生成具有不同场景尺度的伪深度标签,并结合插值和重定位策略以及未标记图像来进一步多样化几何结构。 Result: 实验表明,PacGDC在零样本和少样本设置下均表现出卓越的泛化能力,适用于各种场景语义、尺度和深度稀疏性/模式。 Conclusion: PacGDC是一种标签高效的技术,通过最小的注释努力增强了数据多样性,实现了可泛化的深度补全。这种方法在多个基准测试中表现出色,适用于不同的场景语义/尺度和深度稀疏性/模式。 Abstract: Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https://github.com/Wang-xjtu/PacGDC.

[69] Adaptive Particle-Based Shape Modeling for Anatomical Surface Correspondence

Hong Xu,Shireen Y. Elhabian

Main category: cs.CV

TL;DR: The paper proposes new mechanisms to enhance adaptivity and consistency in particle-based shape modeling, demonstrating improved representation of anatomical variability.

Details Motivation: The motivation is to address the lack of self-adaptivity in current particle-based shape modeling methods, which is essential for accurately representing complex anatomical variability. Method: The paper introduces a novel neighborhood correspondence loss and a geodesic correspondence algorithm to enhance adaptivity and maintain consistency in particle-based shape modeling. Result: The results show the efficacy and scalability of the approach on challenging datasets, with improvements in adaptivity and correspondence metrics compared to existing methods. Conclusion: This paper concludes that the proposed mechanisms, neighborhood correspondence loss and geodesic correspondence algorithm, successfully improve surface adaptivity while maintaining consistent particle configurations for better representation of anatomical variability. Abstract: Particle-based shape modeling (PSM) is a family of approaches that automatically quantifies shape variability across anatomical cohorts by positioning particles (pseudo landmarks) on shape surfaces in a consistent configuration. Recent advances incorporate implicit radial basis function representations as self-supervised signals to better capture the complex geometric properties of anatomical structures. However, these methods still lack self-adaptivity -- that is, the ability to automatically adjust particle configurations to local geometric features of each surface, which is essential for accurately representing complex anatomical variability. This paper introduces two mechanisms to increase surface adaptivity while maintaining consistent particle configurations: (1) a novel neighborhood correspondence loss to enable high adaptivity and (2) a geodesic correspondence algorithm that regularizes optimization to enforce geodesic neighborhood consistency. We evaluate the efficacy and scalability of our approach on challenging datasets, providing a detailed analysis of the adaptivity-correspondence trade-off and benchmarking against existing methods on surface representation accuracy and correspondence metrics.

[70] Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos

Hao Xu,Arbind Agrahari Baniya,Sam Wells,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal

Main category: cs.CV

TL;DR: This paper proposes MSAGSM, an improved module for precise event spotting in sports videos, combining multi-scale temporal dilations and spatial attention. It achieves better performance with little added complexity and introduces a new table tennis dataset.

Details Motivation: Existing PES models use lightweight temporal modules like GSM or GSF, but these have limitations in temporal receptive field and spatial adaptability. The motivation is to develop a more efficient module that captures both short- and long-term dependencies while focusing on salient regions. Method: The authors propose a Multi-Scale Attention Gate Shift Module (MSAGSM), which integrates multi-scale temporal dilations and multi-head spatial attention into existing lightweight temporal modules like Gate Shift Module (GSM). This module can be easily combined with various 2D CNN backbones. They also introduce the Table Tennis Australia (TTA) dataset for benchmarking. Result: Extensive experiments across five PES benchmarks show that MSAGSM consistently improves performance with minimal computational overhead. It sets new state-of-the-art results in precise event spotting tasks. Conclusion: The paper concludes that the proposed Multi-Scale Attention Gate Shift Module (MSAGSM) effectively enhances precise event spotting in sports videos by improving both temporal and spatial modeling, and it achieves state-of-the-art results with minimal overhead. Abstract: Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.

[71] KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos

Jinseong Kim,Junghoon Song,Gyeongseon Baek,Byeongjoon Noh

Main category: cs.CV

TL;DR: 本文提出了一种基于关键点的视频人物再识别方法KeyRe-ID,在多个基准测试中表现优异。

Details Motivation: 为了提升视频中人物再识别的效果,通过结合全局身份语义和局部身体区域特征来增强模型的表示能力。 Method: 提出了一个关键点引导的视频人物再识别框架KeyRe-ID,该框架包括全局分支和局部分支,利用人体关键点进行增强的时空表示学习。 Result: 在MARS数据集上达到了91.73%的mAP和97.32%的Rank-1准确率,在iLIDS-VID数据集上达到了96.00%的Rank-1和100.00%的Rank-5准确率。 Conclusion: KeyRe-ID实现了视频中人的再识别的最先进的性能,证明了其在MARS和iLIDS-VID基准上的有效性。 Abstract: We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73\% mAP and 97.32\% Rank-1 accuracy on MARS, and 96.00\% Rank-1 and 100.0\% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.

[72] Behave Your Motion: Habit-preserved Cross-category Animal Motion Transfer

Zhimin Zhang,Bi'an Du,Caoyuan Ma,Zheng Wang,Wei Hu

Main category: cs.CV

TL;DR: This paper proposes a novel motion transfer framework that preserves animal-specific behavioral habits, validated through experiments and a new quadruped dataset.

Details Motivation: Existing motion transfer methods focus on human motion and neglect preservation of unique animal behavioral habits, creating the need for a cross-category solution. Method: A generative framework with a habit-preservation module and category-specific habit encoder is introduced. Integration of a large language model (LLM) supports motion transfer to unseen species. Result: The model outperforms existing approaches, validated through experiments on the DeformingThings4D-skl dataset with quantitative analyses. Conclusion: The proposed framework successfully transfers motion across different animal categories while preserving distinct habitual behaviors, demonstrating its effectiveness through experiments. Abstract: Animal motion embodies species-specific behavioral habits, making the transfer of motion across categories a critical yet complex task for applications in animation and virtual reality. Existing motion transfer methods, primarily focused on human motion, emphasize skeletal alignment (motion retargeting) or stylistic consistency (motion style transfer), often neglecting the preservation of distinct habitual behaviors in animals. To bridge this gap, we propose a novel habit-preserved motion transfer framework for cross-category animal motion. Built upon a generative framework, our model introduces a habit-preservation module with category-specific habit encoder, allowing it to learn motion priors that capture distinctive habitual characteristics. Furthermore, we integrate a large language model (LLM) to facilitate the motion transfer to previously unobserved species. To evaluate the effectiveness of our approach, we introduce the DeformingThings4D-skl dataset, a quadruped dataset with skeletal bindings, and conduct extensive experiments and quantitative analyses, which validate the superiority of our proposed model.

[73] Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections

Yongtang Bao,Chengjie Tang,Yuze Wang,Haojie Li

Main category: cs.CV

TL;DR: This paper proposes Seg-Wild, an interactive segmentation method using 3D Gaussian Splatting and a novel smoothing technique, which improves segmentation and reconstruction quality for unconstrained image collections.

Details Motivation: Unconstrained photo collections from the Internet are easier to obtain but challenging to segment due to inconsistent lighting and transient occlusions, which existing methods cannot address effectively. Method: Seg-Wild uses 3D Gaussian Splatting with multi-dimensional feature embeddings for interactive segmentation and introduces the Spiky 3D Gaussian Cutter to smooth abnormalities. Result: Seg-Wild achieves better segmentation and reconstruction results compared to previous methods on a newly designed benchmark for in-the-wild scenes. Conclusion: Seg-Wild is an effective interactive segmentation method for unconstrained image collections, outperforming previous methods in segmentation and reconstruction quality. Abstract: Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene's lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at https://github.com/Sugar0725/Seg-Wild.

[74] EscherNet++: Simultaneous Amodal Completion and Scalable View Synthesis through Masked Fine-Tuning and Enhanced Feed-Forward 3D Reconstruction

Xinan Zhang,Muhammad Zubair Irshad,Anthony Yezzi,Yi-Chang Tsai,Zsolt Kira

Main category: cs.CV

TL;DR: EscherNet++ 是一种高效的端到端模型,用于零样本情况下图像的模态补全和新颖视角合成,相比传统方法显著减少了计算时间和资源消耗。

Details Motivation: 现有方法使用多阶段和复杂流水线,无法充分考虑跨视图依赖关系,且需要额外存储和计算资源。 Method: 采用遮蔽微调(包括输入级和特征级遮蔽)方法,并结合前馈图像到网格模型实现高效的新颖视角合成和模态补全。 Result: EscherNet++ 在10输入设置中,在遮挡任务上PSNR提升了3.9,体积IoU提高了0.28,并减少了95%的重建时间。 Conclusion: EscherNet++ 是一种改进的端到端模型,能够有效地进行新颖视角合成和模态补全部分缺失图像,同时具备良好的3D重建能力。 Abstract: We propose EscherNet++, a masked fine-tuned diffusion model that can synthesize novel views of objects in a zero-shot manner with amodal completion ability. Existing approaches utilize multiple stages and complex pipelines to first hallucinate missing parts of the image and then perform novel view synthesis, which fail to consider cross-view dependencies and require redundant storage and computing for separate stages. Instead, we apply masked fine-tuning including input-level and feature-level masking to enable an end-to-end model with the improved ability to synthesize novel views and conduct amodal completion. In addition, we empirically integrate our model with other feed-forward image-to-mesh models without extra training and achieve competitive results with reconstruction time decreased by 95%, thanks to its ability to synthesize arbitrary query views. Our method's scalable nature further enhances fast 3D reconstruction. Despite fine-tuning on a smaller dataset and batch size, our method achieves state-of-the-art results, improving PSNR by 3.9 and Volume IoU by 0.28 on occluded tasks in 10-input settings, while also generalizing to real-world occluded reconstruction.

[75] EPIC: Efficient Prompt Interaction for Text-Image Classification

Xinyao Yu,Hao Sun,Zeyu Ling,Ziwei Niu,Zhenjia Bai,Rui Qin,Yen-Wei Chen,Lanfen Lin

Main category: cs.CV

TL;DR: This paper introduces EPIC, an efficient prompt-based strategy for text-image classification that reduces computational costs and trainable parameters while maintaining strong performance on multiple datasets.

Details Motivation: Large-scale pre-trained multimodal models (LMMs) have high computational costs during fine-tuning, prompting the need for more efficient strategies like prompt-based interaction. Method: Proposed Efficient Prompt Interaction for text-image Classification (EPIC) using temporal prompts and similarity-based prompt interaction. Result: The proposed EPIC method significantly reduces computational costs and trainable parameters (about 1%) while achieving superior performance on UPMC-Food101 and SNLI-VE datasets, and comparable results on MM-IMDB. Conclusion: EPIC achieves reduced computational resource consumption and fewer trainable parameters while demonstrating superior or comparable performance on several datasets. Abstract: In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.

[76] Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

Jingjing Jiang,Chao Ma,Xurui Song,Hanwang Zhang,Jun Luo

Main category: cs.CV

TL;DR: Error

Details Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving. Project page: https://mm-vl.github.io/corvid.

[77] Towards High-Resolution 3D Anomaly Detection: A Scalable Dataset and Real-Time Framework for Subtle Industrial Defects

Yuqi Cheng,Yihan Sun,Hui Zhang,Weiming Shen,Yunkang Cao

Main category: cs.CV

TL;DR: 本文提出了一种用于3D异常检测的新框架Simple3D和一个新的高分辨率数据集MiniShift,能够实现高效、准确的实时检测。

Details Motivation: 工业点云分析需要高分辨率的空间数据来检测细微的异常,但目前的基准测试更侧重于低分辨率输入,因此需要一种新的解决方案。 Method: 引入了一个高效的框架Simple3D,结合了多尺度邻域描述符(MSND)和局部特征空间聚合(LFSA),以最小的计算开销捕捉复杂的几何细节,并提出了MiniShift,一个高分辨率的3D异常检测数据集。 Result: Simple3D实现了超过20帧/秒的实时推理,并且在MiniShift和其他基准测试中都表现优异。 Conclusion: Simple3D在准确性和速度方面优于现有方法,表明了高分辨率数据和有效特征聚合在推进实用3D异常检测中的关键作用。 Abstract: In industrial point cloud analysis, detecting subtle anomalies demands high-resolution spatial data, yet prevailing benchmarks emphasize low-resolution inputs. To address this disparity, we propose a scalable pipeline for generating realistic and subtle 3D anomalies. Employing this pipeline, we developed MiniShift, the inaugural high-resolution 3D anomaly detection dataset, encompassing 2,577 point clouds, each with 500,000 points and anomalies occupying less than 1\% of the total. We further introduce Simple3D, an efficient framework integrating Multi-scale Neighborhood Descriptors (MSND) and Local Feature Spatial Aggregation (LFSA) to capture intricate geometric details with minimal computational overhead, achieving real-time inference exceeding 20 fps. Extensive evaluations on MiniShift and established benchmarks demonstrate that Simple3D surpasses state-of-the-art methods in both accuracy and speed, highlighting the pivotal role of high-resolution data and effective feature aggregation in advancing practical 3D anomaly detection.

[78] Dual Semantic-Aware Network for Noise Suppressed Ultrasound Video Segmentation

Ling Zhou,Runtian Yuan,Yi Liu,Yuejie Zhang,Rui Feng,Shang Gao

Main category: cs.CV

TL;DR: DSANet is a novel framework for ultrasound video segmentation that improves noise robustness and segmentation accuracy by leveraging semantic awareness between local and global features, achieving faster inference speeds than existing approaches.

Details Motivation: Ultrasound imaging is prone to noise, which hampers automated lesion or organ segmentation; DSANet addresses this limitation by improving noise robustness without relying on pixel-level relationships. Method: The Dual Semantic-Aware Network (DSANet) uses two modules: the Adjacent-Frame Semantic-Aware (AFSA) module for feature fusion across adjacent frames, and the Local-and-Global Semantic-Aware (LGSA) module for integrating local and global temporal features. Result: DSANet outperforms state-of-the-art methods in segmentation accuracy on four benchmark datasets and achieves higher inference FPS, surpassing both video-based and some image-based models. Conclusion: DSANet demonstrates superior performance in ultrasound video segmentation by enhancing noise robustness and achieving high inference FPS compared to existing methods. Abstract: Ultrasound imaging is a prevalent diagnostic tool known for its simplicity and non-invasiveness. However, its inherent characteristics often introduce substantial noise, posing considerable challenges for automated lesion or organ segmentation in ultrasound video sequences. To address these limitations, we propose the Dual Semantic-Aware Network (DSANet), a novel framework designed to enhance noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. Specifically, we introduce an Adjacent-Frame Semantic-Aware (AFSA) module, which constructs a channel-wise similarity matrix to guide feature fusion across adjacent frames, effectively mitigating the impact of random noise without relying on pixel-level relationships. Additionally, we propose a Local-and-Global Semantic-Aware (LGSA) module that reorganizes and fuses temporal unconditional local features, which capture spatial details independently at each frame, with conditional global features that incorporate temporal context from adjacent frames. This integration facilitates multi-level semantic representation, significantly improving the model's resilience to noise interference. Extensive evaluations on four benchmark datasets demonstrate that DSANet substantially outperforms state-of-the-art methods in segmentation accuracy. Moreover, since our model avoids pixel-level feature dependencies, it achieves significantly higher inference FPS than video-based methods, and even surpasses some image-based models. Code can be found in \href{https://github.com/ZhouL2001/DSANet}{DSANet}

[79] Bluish Veil Detection and Lesion Classification using Custom Deep Learnable Layers with Explainable Artificial Intelligence (XAI)

M. A. Rasel,Sameem Abdul Kareem,Zhenli Kwan,Shin Shen Yong,Unaizah Obaidellah

Main category: cs.CV

TL;DR: This paper introduces a novel DCNN model enhanced by an XAI algorithm to improve the detection of the blue-white veil (BWV) feature in skin lesions, leading to better early diagnosis of melanoma.

Details Motivation: Melanoma is one of the deadliest types of skin cancer, and detecting the critical bluish, blue-whitish, or blue-white veil (BWV) feature in dermatological images remains challenging due to limited research. This motivates the development of an advanced model for improved BWV detection. Method: This study uses a non-annotated skin lesion dataset converted into an annotated dataset through a proposed imaging algorithm based on color threshold techniques. A Deep Convolutional Neural Network (DCNN) is designed and trained separately on three individual and combined dermoscopic datasets using custom layers. Additionally, an explainable artificial intelligence (XAI) algorithm interprets the DCNN's decision-making process regarding BWV detection. Result: The proposed DCNN model demonstrates superior performance compared to conventional BWV detection models across different datasets, achieving testing accuracies of 85.71% on the augmented PH2 dataset, 95.00% on the augmented ISIC archive dataset, 95.05% on the combined augmented (PH2+ISIC archive) dataset, and 90.00% on the Derm7pt dataset. Conclusion: The study concludes that the proposed DCNN model, combined with an XAI algorithm, significantly improves the detection of BWV in skin lesions, outperforming existing models and offering a reliable tool for early melanoma diagnosis. Abstract: Melanoma, one of the deadliest types of skin cancer, accounts for thousands of fatalities globally. The bluish, blue-whitish, or blue-white veil (BWV) is a critical feature for diagnosing melanoma, yet research into detecting BWV in dermatological images is limited. This study utilizes a non-annotated skin lesion dataset, which is converted into an annotated dataset using a proposed imaging algorithm based on color threshold techniques on lesion patches and color palettes. A Deep Convolutional Neural Network (DCNN) is designed and trained separately on three individual and combined dermoscopic datasets, using custom layers instead of standard activation function layers. The model is developed to categorize skin lesions based on the presence of BWV. The proposed DCNN demonstrates superior performance compared to conventional BWV detection models across different datasets. The model achieves a testing accuracy of 85.71% on the augmented PH2 dataset, 95.00% on the augmented ISIC archive dataset, 95.05% on the combined augmented (PH2+ISIC archive) dataset, and 90.00% on the Derm7pt dataset. An explainable artificial intelligence (XAI) algorithm is subsequently applied to interpret the DCNN's decision-making process regarding BWV detection. The proposed approach, coupled with XAI, significantly improves the detection of BWV in skin lesions, outperforming existing models and providing a robust tool for early melanoma diagnosis.

[80] Objectomaly: Objectness-Aware Refinement for OoD Segmentation with Structural Consistency and Boundary Precision

Jeonghoon Song,Sunghun Kim,Jaegyun Im,Byeongjoon Noh

Main category: cs.CV

TL;DR: 本文提出了一种新的对象感知优化框架Objectomaly,用于解决在安全敏感应用如自动驾驶中的分布外(OoD)分割问题。

Details Motivation: 现有基于mask的方法存在边界不精确、对象内异常评分不一致以及背景噪声导致的误报问题。 Method: 提出了一种名为Objectomaly的对象感知优化框架,包含三个阶段:粗略异常评分(CAS)、对象感知评分校准(OASC)和精细边界精度(MBP)。 Result: 在关键的OoD分割基准测试中达到了最先进的性能,包括SMIYC AnomalyTrack/ObstacleTrack和RoadAnomaly,提高了像素级和组件级指标。 Conclusion: Objectomaly框架在OoD分割任务上实现了最先进的性能,并通过了消融实验和真实世界驾驶视频的定性结果验证。 Abstract: Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf{\textit{Objectomaly}}, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR$_{95}$ down to 0.07) and component-level (F1$-$score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.

[81] Degradation-Agnostic Statistical Facial Feature Transformation for Blind Face Restoration in Adverse Weather Conditions

Chang-Hwan Son

Main category: cs.CV

TL;DR: This paper proposes a new method for improving face recognition in bad weather by using a GAN-based framework with specialized modules for restoring facial features.

Details Motivation: Adverse weather significantly degrades image quality, which in turn reduces recognition accuracy. Although recent face image restoration (FIR) models based on generative adversarial networks (GANs) and diffusion models have shown progress, their performance remains limited due to the lack of dedicated modules that explicitly address weather-induced degradations. Method: A novel GAN-based blind FIR framework that integrates two key components: local Statistical Facial Feature Transformation (SFFT) and Degradation-Agnostic Feature Embedding (DAFE). Result: Experimental results demonstrate that the proposed degradation-agnostic SFFT model outperforms existing state-of-the-art FIR methods based on GAN and diffusion models, particularly in suppressing texture distortions and accurately reconstructing facial structures. Conclusion: The proposed degradation-agnostic SFFT model outperforms existing state-of-the-art FIR methods based on GAN and diffusion models, particularly in suppressing texture distortions and accurately reconstructing facial structures. Both the SFFT and DAFE modules are empirically validated in enhancing structural fidelity and perceptual quality in face restoration under challenging weather scenarios. Abstract: With the increasing deployment of intelligent CCTV systems in outdoor environments, there is a growing demand for face recognition systems optimized for challenging weather conditions. Adverse weather significantly degrades image quality, which in turn reduces recognition accuracy. Although recent face image restoration (FIR) models based on generative adversarial networks (GANs) and diffusion models have shown progress, their performance remains limited due to the lack of dedicated modules that explicitly address weather-induced degradations. This leads to distorted facial textures and structures. To address these limitations, we propose a novel GAN-based blind FIR framework that integrates two key components: local Statistical Facial Feature Transformation (SFFT) and Degradation-Agnostic Feature Embedding (DAFE). The local SFFT module enhances facial structure and color fidelity by aligning the local statistical distributions of low-quality (LQ) facial regions with those of high-quality (HQ) counterparts. Complementarily, the DAFE module enables robust statistical facial feature extraction under adverse weather conditions by aligning LQ and HQ encoder representations, thereby making the restoration process adaptive to severe weather-induced degradations. Experimental results demonstrate that the proposed degradation-agnostic SFFT model outperforms existing state-of-the-art FIR methods based on GAN and diffusion models, particularly in suppressing texture distortions and accurately reconstructing facial structures. Furthermore, both the SFFT and DAFE modules are empirically validated in enhancing structural fidelity and perceptual quality in face restoration under challenging weather scenarios.

[82] Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking

Qiangqiang Wu,Yi Yu,Chenqi Kong,Ziquan Liu,Jia Wan,Haoliang Li,Alex C. Kot,Antoni B. Chan

Main category: cs.CV

TL;DR: This paper introduces a novel framework to protect private video data from unauthorized use in Visual Object Tracking by generating Temporal Unlearnable Examples, achieving excellent scalability and performance.

Details Motivation: To address the lack of privacy protection for personal videos used in training Visual Object Tracking models, as existing solutions focus mainly on image-based tasks. Method: A generative framework for creating Temporal Unlearnable Examples (TUEs) with a temporal contrastive loss to disrupt deep trackers' learning process while maintaining efficiency. Result: The proposed method achieves state-of-the-art performance in protecting video data privacy while being scalable for large-scale datasets. Conclusion: The paper concludes that their proposed framework successfully prevents unauthorized exploitation of personal video data in Visual Object Tracking, offering scalability and strong transferability across models and datasets. Abstract: With the rise of social media, vast amounts of user-uploaded videos (e.g., YouTube) are utilized as training data for Visual Object Tracking (VOT). However, the VOT community has largely overlooked video data-privacy issues, as many private videos have been collected and used for training commercial models without authorization. To alleviate these issues, this paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers. Existing methods for preventing unauthorized data use primarily focus on image-based tasks (e.g., image classification), directly applying them to videos reveals several limitations, including inefficiency, limited effectiveness, and poor generalizability. To address these issues, we propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs), and whose efficient computation makes it scalable for usage on large-scale video datasets. The trackers trained w/ TUEs heavily rely on unlearnable noises for temporal matching, ignoring the original data structure and thus ensuring training video data-privacy. To enhance the effectiveness of TUEs, we introduce a temporal contrastive loss, which further corrupts the learning of existing trackers when using our TUEs for training. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal matching tasks.

[83] Driving by Hybrid Navigation: An Online HD-SD Map Association Framework and Benchmark for Autonomous Vehicles

Jiaxu Wan,Xu Wang,Mengwei Xie,Xinyuan Chang,Xinran Liu,Zheng Pan,Mu Xu,Ding Yuan

Main category: cs.CV

TL;DR: This paper introduces OMA, a benchmark for online map association, and proposes the Map Association Transformer framework to improve hybrid navigation for autonomous vehicles.

Details Motivation: Recent work on autonomous vehicle navigation focuses on constructing online high-definition (HD) maps while neglecting their association with global standard-definition (SD) maps. This oversight creates challenges in real-world applications due to the lack of hybrid navigation capabilities. Method: The authors introduced the Online Map Association (OMA) benchmark containing 480k roads and 260k lane paths. They also proposed a baseline method called Map Association Transformer, which uses path-aware attention and spatial attention mechanisms to understand geometric and topological correspondences. Result: The paper presents OMA as the first benchmark for hybrid navigation-oriented online map association. It also introduces the Map Association Transformer framework, which demonstrates effective understanding of geometric and topological relationships in maps. Conclusion: The paper concludes that the proposed Map Association Transformer framework and the OMA benchmark significantly enhance the navigation and planning capabilities of autonomous vehicles by effectively associating hybrid online maps. Abstract: Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at https://github.com/WallelWan/OMA-MAT.

[84] Divergence Minimization Preference Optimization for Diffusion Model Alignment

Binxu Li,Minkai Xu,Meihua Dang,Stefano Ermon

Main category: cs.CV

TL;DR: This paper introduces DMPO, a novel method for aligning diffusion models using divergence minimization, demonstrating superior performance over existing approaches.

Details Motivation: Existing preference optimization methods often get trapped in suboptimal mean-seeking optimization; thus, there is a need for more effective alignment techniques. Method: Divergence Minimization Preference Optimization (DMPO) minimizes reverse KL divergence to align diffusion models, offering rigorous analysis and experiments to validate its effectiveness. Result: Diffusion models fine-tuned with DMPO outperform or match existing techniques, surpassing baselines by at least 64.6% in PickScore across datasets. Conclusion: DMPO provides a robust and elegant pathway for preference alignment in diffusion models, bridging theory with practical performance. Abstract: Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and automatic metrics. Our extensive results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques, specifically outperforming all existing diffusion alignment baselines by at least 64.6% in PickScore across all evaluation datasets, demonstrating the method's superiority in aligning generative behavior with desired outputs. Overall, DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.

[85] GGMotion: Group Graph Dynamics-Kinematics Networks for Human Motion Prediction

Shuaijin Wan,Huaijiang Sun

Main category: cs.CV

TL;DR: 该论文提出了一种名为GGMotion的新方法,通过分组图动力学-运动学网络更好地捕捉人体运动的物理特性,从而提高短期运动预测的准确性。

Details Motivation: 现有的方法通常将人体姿态表示为抽象的图结构,忽略了关节之间的内在物理依赖关系,这增加了学习难度并使模型容易生成不现实的动作。本文旨在通过更好地利用动力学和运动学先验来解决这一问题。 Method: GGMotion采用了一种分组图动力学-运动学网络,利用径向场来保持3D空间中的几何等变性,并通过时空边聚合关节特征以捕获更全面的时空依赖性。此外,还引入了组间和组内交互模块以及等变多层感知机(MLP)进行动力学-运动学传播,并使用辅助损失监督训练过程中的运动先验。 Result: 在Human3.6M、CMU-Mocap和3DPW三个标准基准上的广泛实验证明了所提方法的有效性和优越性,在短期运动预测中取得了显著的性能提升。 Conclusion: 论文提出了一种新的方法GGMotion,用于更好地建模人体运动的动力学和运动学先验,并通过广泛的实验验证了其在短期运动预测中的有效性和优越性。 Abstract: Human motion is a continuous physical process in 3D space, governed by complex dynamic and kinematic constraints. Existing methods typically represent the human pose as an abstract graph structure, neglecting the intrinsic physical dependencies between joints, which increases learning difficulty and makes the model prone to generating unrealistic motions. In this paper, we propose GGMotion, a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors. To preserve the geometric equivariance in 3D space, we propose a novel radial field for the graph network that captures more comprehensive spatio-temporal dependencies by aggregating joint features through spatial and temporal edges. Inter-group and intra-group interaction modules are employed to capture the dependencies of joints at different scales. Combined with equivariant multilayer perceptrons (MLP), joint position features are updated in each group through parallelized dynamics-kinematics propagation to improve physical plausibility. Meanwhile, we introduce an auxiliary loss to supervise motion priors during training. Extensive experiments on three standard benchmarks, including Human3.6M, CMU-Mocap, and 3DPW, demonstrate the effectiveness and superiority of our approach, achieving a significant performance margin in short-term motion prediction. The code is available at https://github.com/inkcat520/GGMotion.git.

[86] MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation

Bangning Wei,Joshua Maraval,Meriem Outtas,Kidiyo Kpalma,Nicolas Ramin,Lu Zhang

Main category: cs.CV

TL;DR: 本文提出MUVOD,一个新的多视角视频数据集,用于4D和3D对象分割任务,旨在推动动态场景分割领域的研究进展。

Details Motivation: 由于现有数据集在动态场景的4D对象分割任务中存在不足,因此需要一个大规模且精确标注的数据集来促进相关领域的发展。 Method: 构建了一个包含17个场景、7830张RGB图像和对应4D运动分割掩码的数据集,并提出了评估指标和基线分割方法。 Result: MUVOD数据集包含459个实例和73种类别,并提供了用于3D对象分割任务的子集,涵盖50个不同条件下的对象。 Conclusion: MUVOD提供了一个新的多视角视频数据集及基准,用于推动动态场景中4D和3D对象分割方法的研究。 Abstract: The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods. Our proposed MUVOD dataset is available at https://volumetric-repository.labs.b-com.com/#/muvod.

[87] Spline Deformation Field

Mingyang Song,Yang Zhang,Marko Mihajlovic,Siyu Tang,Markus Gross,Tunç Ozan Aydın

Main category: cs.CV

TL;DR: This paper introduces a spline-based trajectory representation with a novel spatial encoding strategy, improving temporal interpolation and dynamic scene reconstruction while maintaining spatial and temporal coherence.

Details Motivation: Current trajectory modeling methods suffer from spatial incoherence due to neural network biases, reliance on heuristic node initialization, or limited exploration of implicit representations for sparse temporal signals. Method: A spline-based trajectory representation with a novel low-rank time-variant spatial encoding is introduced, enabling efficient analytical derivation of velocities and accelerations while reducing temporal fluctuations. Result: The method achieves superior performance in temporal interpolation with sparse inputs and competitive dynamic scene reconstruction quality, enhancing motion coherence without linear blend skinning or rigid constraints. Conclusion: The proposed spline-based trajectory representation outperforms existing methods in temporal interpolation and dynamic scene reconstruction, ensuring spatial and temporal coherence without heuristic constraints. Abstract: Trajectory modeling of dense points usually employs implicit deformation fields, represented as neural networks that map coordinates to relate canonical spatial positions to temporal offsets. However, the inductive biases inherent in neural networks can hinder spatial coherence in ill-posed scenarios. Current methods focus either on enhancing encoding strategies for deformation fields, often resulting in opaque and less intuitive models, or adopt explicit techniques like linear blend skinning, which rely on heuristic-based node initialization. Additionally, the potential of implicit representations for interpolating sparse temporal signals remains under-explored. To address these challenges, we propose a spline-based trajectory representation, where the number of knots explicitly determines the degrees of freedom. This approach enables efficient analytical derivation of velocities, preserving spatial coherence and accelerations, while mitigating temporal fluctuations. To model knot characteristics in both spatial and temporal domains, we introduce a novel low-rank time-variant spatial encoding, replacing conventional coupled spatiotemporal techniques. Our method demonstrates superior performance in temporal interpolation for fitting continuous fields with sparse inputs. Furthermore, it achieves competitive dynamic scene reconstruction quality compared to state-of-the-art methods while enhancing motion coherence without relying on linear blend skinning or as-rigid-as-possible constraints.

[88] MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models

Joelle Hanna,Linus Scheibenreif,Damian Borth

Main category: cs.CV

TL;DR: 本文提出了一种新的遥感基础模型MAPEX,该模型通过多模态专家混合架构和模态感知剪枝技术,有效解决了遥感任务中模态不匹配和模型效率问题,并在多个数据集上展示了优越的性能。

Details Motivation: 遥感任务通常需要特定的传感器模态,而现有的基础模型主要专注于光学RGB或高光谱数据,导致实际应用中存在模态不匹配的问题。此外,现有模型的庞大尺寸也增加了微调和部署的成本。 Method: 提出了一种名为MAPEX的遥感基础模型,采用多模态专家混合架构,结合模态条件化的令牌路由机制和模态感知剪枝技术,以解决应用模态与预训练数据之间的不匹配问题。 Result: MAPEX在多个遥感数据集上表现优异,相较于全监督训练和现有最先进的遥感基础模型具有更强的性能,同时通过模态剪枝简化了模型的微调和部署。 Conclusion: MAPEX是一个基于多模态专家混合的遥感基础模型,通过模态条件化的令牌路由机制和模态感知剪枝技术,在特定任务上实现了高效的模态专用模型,并在多个遥感数据集上验证了其性能优势。 Abstract: Remote sensing data is commonly used for tasks such as flood mapping, wildfire detection, or land-use studies. For each task, scientists carefully choose appropriate modalities or leverage data from purpose-built instruments. Recent work on remote sensing foundation models pre-trains computer vision models on large amounts of remote sensing data. These large-scale models tend to focus on specific modalities, often optical RGB or multispectral data. For many important applications, this introduces a mismatch between the application modalities and the pre-training data. Moreover, the large size of foundation models makes them expensive and difficult to fine-tune on typically small datasets for each task. We address this mismatch with MAPEX, a remote sensing foundation model based on mixture-of-modality experts. MAPEX is pre-trained on multi-modal remote sensing data with a novel modality-conditioned token routing mechanism that elicits modality-specific experts. To apply the model on a specific task, we propose a modality aware pruning technique, which only retains experts specialized for the task modalities. This yields efficient modality-specific models while simplifying fine-tuning and deployment for the modalities of interest. We experimentally validate MAPEX on diverse remote sensing datasets and show strong performance compared to fully supervised training and state-of-the-art remote sensing foundation models. Code is available at https://github.com/HSG-AIML/MAPEX.

[89] Beyond the Linear Separability Ceiling

Enrico Vompa,Tanel Tammet,Mohit Vaishnav

Main category: cs.CV

TL;DR: This paper explores the limitations of Visual-Language Models (VLMs) in abstract reasoning due to the linear separability of their visual embeddings. It identifies that the issue lies within the language model's reasoning pathways and emphasizes the need for task-dependent alignment strategies to overcome this limitation.

Details Motivation: The motivation behind this work is to understand and address the 'linear reasoning bottleneck' observed in Visual-Language Models (VLMs), where their performance on abstract reasoning tasks seems limited by the linear separability of their visual embeddings. Method: The research introduces the Linear Separability Ceiling (LSC) to assess the performance of a linear classifier on a VLM's visual embeddings. The study uses postfix tuning as a methodological control to evaluate the effectiveness of different interventions on the reasoning pathways of VLMs. Result: The research finds that the linear reasoning bottleneck is widespread and stems from failures in the language model's reasoning pathways rather than poor perception. It demonstrates that for complex relational tasks requiring deeper adaptation, explicitly improving representation quality may lead to failure on new prompt formats, despite well-separated embeddings. Conclusion: The study concludes that the key to enhancing VLMs' reasoning capabilities lies in targeted alignment rather than just improving representation learning. It highlights that while some tasks can be addressed by activating existing pathways, others require deeper adaptation of the model's core weights. Abstract: Most state-of-the-art Visual-Language Models (VLMs) are seemingly limited by the linear separabilty of their visual embeddings on abstract reasoning tasks. This work investigates this "linear reasoning bottleneck" by introducing the Linear Separability Ceiling (LSC), the performance of a simple linear classifier on a VLM's visual embeddings. We find this bottleneck is widespread and stems not from poor perception, but from failures in the language model's reasoning pathways. We demonstrate this is a solvable alignment issue. The required intervention, however, is task-dependent: activating existing pathways suffices for semantic concepts, while complex relational reasoning requires adapting core model weights. Using postfix tuning as a methodological control, we find strong evidence for powerful, dormant reasoning pathways within VLMs. However, for complex relational tasks requiring deeper adaptation, explicitly improving representation quality causes the model to fail on new prompt formats despite its embeddings remaining well separated. Ultimately, this work provides a new lens for VLM analysis, showing that robust reasoning is a matter of targeted alignment, not simply improved representation learning.

[90] Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation

Chunyan Wang,Dong Zhang,Jinhui Tang

Main category: cs.CV

TL;DR: 本文提出了DGKD-WLSS,一种结合扩散引导知识蒸馏和深度引导特征融合的新框架,旨在解决低光环境下弱监督语义分割中的图像质量和监督限制问题,显著提升了模型性能。

Details Motivation: 现有方法在正常光照条件下表现良好,但在低光环境中由于图像质量下降和弱监督的固有约束而表现不佳。这导致了不可靠的类别激活图和语义模糊的伪标签,影响模型的学习能力。 Method: 提出了一种新的框架DGKD-WLSS,包括Diffusion-Guided Knowledge Distillation(DGKD)和Depth-Guided Feature Fusion(DGF2)。DGKD利用扩散去噪和知识蒸馏对齐正常光和低光特征,DGF2则利用深度图作为几何先验来增强结构特征学习。 Result: DGKD-WLSS在低光环境下的弱监督语义分割任务中表现出色,实验结果验证了其有效性,并达到了最先进的性能。 Conclusion: DGKD-WLSS通过结合DGKD和DGF2,有效解决了低光环境下弱监督语义分割的图像质量问题和弱监督限制,实现了最先进的性能。 Abstract: Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD-WLSS.

[91] NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning

Tianwei Mu,Feiyu Duan,Bo Zhou,Dan Xue,Manhong Huang

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为NexViTAD的少样本跨域异常检测框架,其创新性地采用了共享子空间投影机制和多任务学习模块,在跨域缺陷检测中表现出卓越的性能。

Details Motivation: 解决工业异常检测中的领域转移挑战,并提高跨领域的知识迁移效果。 Method: 该框架采用了一种创新的共享子空间投影机制和多任务学习模块,结合了Hiera和DINO-v2预训练模型,并利用Sinkhorn-K-means聚类进行异常评分推理。 Result: 在MVTec AD数据集上取得了目标领域AUC为97.5%,AP为70.4%,PRO为95.2%的SOTA性能。 Conclusion: NexViTAD通过层级适配器模块、共享子空间投影策略、多任务解码器架构以及基于Sinkhorn-K-means聚类的异常评分推理方法,在跨域缺陷检测中实现了显著的进步。 Abstract: This paper presents a novel few-shot cross-domain anomaly detection framework, Nexus Vision Transformer for Anomaly Detection (NexViTAD), based on vision foundation models, which effectively addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms and multi-task learning (MTL) module. The main innovations include: (1) a hierarchical adapter module that adaptively fuses complementary features from Hiera and DINO-v2 pre-trained models, constructing more robust feature representations; (2) a shared subspace projection strategy that enables effective cross-domain knowledge transfer through bottleneck dimension constraints and skip connection mechanisms; (3) a MTL Decoder architecture supports simultaneous processing of multiple source domains, significantly enhancing model generalization capabilities; (4) an anomaly score inference method based on Sinkhorn-K-means clustering, combined with Gaussian filtering and adaptive threshold processing for precise pixel level. Valuated on the MVTec AD dataset, NexViTAD delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains, surpassing other recent models, marking a transformative advance in cross-domain defect detection.

[92] HOTA: Hierarchical Overlap-Tiling Aggregation for Large-Area 3D Flood Mapping

Wenfeng Jia,Bin Liang,Yuxi Lu,Attavit Wilaiwongsakul,Muhammad Arif Khan,Lihong Zheng

Main category: cs.CV

TL;DR: This paper introduces HOTA, a novel multi-scale inference approach for 3D flood mapping that improves accuracy and provides detailed inundation data crucial for disaster response.

Details Motivation: Existing flood mapping products often trade spatial detail for coverage or ignore flood depth, necessitating a solution that can provide timely, large-scale, and detailed flood information. Method: HOTA: Hierarchical Overlap-Tiling Aggregation, a plug-and-play, multi-scale inference strategy applied to multispectral Sentinel-2 images during inference, along with a digital elevation model (DEM) differencing method for depth estimation. Result: In the March 2021 Kempsey flood case study, HOTA improved IoU from 73% (U-Net baseline) to 84%, achieving a mean absolute boundary error of less than 0.5 m in the resulting 3D surface. Conclusion: HOTA combined with SegFormer and a dual-constraint depth estimation module can produce accurate, large-area 3D flood maps suitable for rapid disaster response. Abstract: Floods are among the most frequent natural hazards and cause significant social and economic damage. Timely, large-scale information on flood extent and depth is essential for disaster response; however, existing products often trade spatial detail for coverage or ignore flood depth altogether. To bridge this gap, this work presents HOTA: Hierarchical Overlap-Tiling Aggregation, a plug-and-play, multi-scale inference strategy. When combined with SegFormer and a dual-constraint depth estimation module, this approach forms a complete 3D flood-mapping pipeline. HOTA applies overlapping tiles of different sizes to multispectral Sentinel-2 images only during inference, enabling the SegFormer model to capture both local features and kilometre-scale inundation without changing the network weights or retraining. The subsequent depth module is based on a digital elevation model (DEM) differencing method, which refines the 2D mask and estimates flood depth by enforcing (i) zero depth along the flood boundary and (ii) near-constant flood volume with respect to the DEM. A case study on the March 2021 Kempsey (Australia) flood shows that HOTA, when coupled with SegFormer, improves IoU from 73\% (U-Net baseline) to 84\%. The resulting 3D surface achieves a mean absolute boundary error of less than 0.5 m. These results demonstrate that HOTA can produce accurate, large-area 3D flood maps suitable for rapid disaster response.

[93] Stable-Hair v2: Real-World Hair Transfer via Multiple-View Diffusion Model

Kuiyuan Sun,Yuxuan Zhang,Jichao Zhang,Jiaming Liu,Wei Wang,Niculae Sebe,Yao Zhao

Main category: cs.CV

TL;DR: 本研究提出了 Stable-Hair v2,一种基于扩散模型的新型多视角头发迁移框架,首次实现了高质量、视角一致的头发迁移,具有重要的应用价值。

Details Motivation: 现有的扩散模型在生成一致且高质量的多视角发型输出方面仍存在不足,限制了其在数字人类和虚拟化身等实际应用中的使用。 Method: 提出了一种基于扩散模型的多视角头发迁移框架,包括一个综合的数据生成流程和多阶段训练策略,并引入了极坐标嵌入和时间注意力层来提升效果。 Result: 实验表明,该方法能够准确地将详细且逼真的发型迁移到目标人物上,并在多个视角下实现无缝和一致的结果,显著优于现有方法。 Conclusion: Stable-Hair v2 是第一个利用多视角扩散模型进行高质量、视角一致的头发迁移的工作,为多视角头发迁移建立了新的基准。 Abstract: While diffusion-based methods have shown impressive capabilities in capturing diverse and complex hairstyles, their ability to generate consistent and high-quality multi-view outputs -- crucial for real-world applications such as digital humans and virtual avatars -- remains underexplored. In this paper, we propose Stable-Hair v2, a novel diffusion-based multi-view hair transfer framework. To the best of our knowledge, this is the first work to leverage multi-view diffusion models for robust, high-fidelity, and view-consistent hair transfer across multiple perspectives. We introduce a comprehensive multi-view training data generation pipeline comprising a diffusion-based Bald Converter, a data-augment inpainting model, and a face-finetuned multi-view diffusion model to generate high-quality triplet data, including bald images, reference hairstyles, and view-aligned source-bald pairs. Our multi-view hair transfer model integrates polar-azimuth embeddings for pose conditioning and temporal attention layers to ensure smooth transitions between views. To optimize this model, we design a novel multi-stage training strategy consisting of pose-controllable latent IdentityNet training, hair extractor training, and temporal attention training. Extensive experiments demonstrate that our method accurately transfers detailed and realistic hairstyles to source subjects while achieving seamless and consistent results across views, significantly outperforming existing methods and establishing a new benchmark in multi-view hair transfer. Code is publicly available at https://github.com/sunkymepro/StableHairV2.

[94] HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking

Ruixiang Chen,Guolei Sun,Yawei Li,Jie Qin,Luca Benini

Main category: cs.CV

TL;DR: This paper improves the SAM2 video object tracking framework with a motion estimation strategy and optimized memory management, achieving better accuracy and robustness without additional training.

Details Motivation: The motivation behind this work is to address key challenges in video object tracking, such as handling occlusions, background clutter, and appearance changes, while maintaining efficiency and avoiding the need for retraining. Method: The paper introduces a hierarchical motion estimation strategy combining linear prediction and non-linear refinement, and optimizes the memory bank by differentiating long-term and short-term memory frames. These changes aim to enhance tracking accuracy and reliability without additional training. Result: Experimental results show consistent improvements across model scales, with significant gains on LaSOT and LaSOText benchmarks. The large model achieved 9.6% and 7.2% relative improvements in AUC over the original SAM2, with even larger gains observed on smaller models. Conclusion: The paper concludes that the enhancements made to the SAM2 framework significantly improve video object tracking performance, especially in challenging scenarios involving occlusions, background clutter, and target reappearance. The proposed method achieves state-of-the-art results with minimal overhead. Abstract: This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at https://github.com/LouisFinner/HiM2SAM.

[95] LOSC: LiDAR Open-voc Segmentation Consolidator

Nermin Samet,Gilles Puy,Renaud Marlet

Main category: cs.CV

TL;DR: LOSC improves open-vocabulary 3D segmentation using enhanced labels from vision-language models, surpassing state-of-the-art results on key autonomous driving benchmarks.

Details Motivation: Classical methods for projecting image semantics onto 3D point clouds result in noisy and sparse labels. This work aims to improve label quality and segmentation performance without relying on existing annotations, enabling open-vocabulary segmentation. Method: The authors propose a method called LOSC, which uses image-based Vision-Language Models (VLMs) to generate labels for lidar scans. These labels are refined by enforcing spatio-temporal consistency and robustness to image-level augmentations before training a 3D network. Result: LOSC achieves superior performance in zero-shot open-vocabulary semantic and panoptic segmentation on two major autonomous driving datasets: nuScenes and SemanticKITTI. Conclusion: The proposed LOSC method outperforms the current state-of-the-art in zero-shot open-vocabulary semantic and panoptic segmentation on nuScenes and SemanticKITTI datasets. Abstract: We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.

[96] SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

Siting Wang,Luoyang Sun,Cheng Deng,Kun Shao,Minnan Pei,Zheng Tian,Haifeng Zhang,Jun Wang

Main category: cs.CV

TL;DR: 本文提出了SpatialViz-Bench,一个全面的多模态空间可视化基准测试,用于评估多模态大语言模型(MLLMs)在空间可视化任务中的表现,并揭示了当前模型在此类任务中仍存在重大缺陷。

Details Motivation: 人类具备直接想象和操纵视觉图像的能力,即空间可视化能力,而现有的多模态大语言模型尽管支持基于想象的推理,其空间可视化能力尚未得到充分评估。此外,传统评估方法往往依赖可能与训练数据重叠的IQ测试或数学竞赛题,影响评估的可靠性。 Method: 本文提出了一个包含12个任务、4个子能力、共1,180个自动生成问题的多模态基准测试SpatialViz-Bench,并对33个最先进的MLLM进行了系统评估。 Result: 实验结果显示不同模型之间表现差异显著,展示了该基准的强大区分能力;同时发现了一些反直觉的现象,例如模型在2D到3D任务间表现急剧下降,且倾向于使用公式推导而非真正进行空间可视化。 Conclusion: 当前最先进的MLLM在空间可视化任务上仍存在明显缺陷,SpatialViz-Bench为该领域提供了一个可靠的评估工具并填补了研究空白。 Abstract: Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark's strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.

[97] ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon,Yannis Karmim,Julio Silva-Rodriguez,Paul Couairon,Clément Rambour,Raphaël Fournier-Sniehotta,Ismail Ben Ayed,Jose Dolz,Nicolas Thome

Main category: cs.CV

TL;DR: ViLU introduces a novel Vision-Language Uncertainty quantification framework that effectively predicts model failures and quantifies uncertainty without requiring direct access to the model, outperforming existing methods.

Details Motivation: Reliable Uncertainty Quantification (UQ) and failure prediction are challenging for Vision-Language Models (VLMs), especially in post-hoc settings where direct access to the model is unavailable. Method: ViLU utilizes visual and text embeddings to construct an uncertainty-aware multi-modal representation through cross-attention. It trains an uncertainty predictor as a binary classifier using a weighted binary cross-entropy loss. Result: ViLU demonstrates significant improvements over state-of-the-art failure prediction methods on various datasets, including ImageNet-1k, CC12M, and LAION-400M, with effective uncertainty quantification. Conclusion: ViLU is a new framework for uncertainty quantification in Vision-Language Models that effectively predicts failures and provides reliable uncertainty estimates without direct access to the model. Abstract: Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.

[98] T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

Zhitao Wang,Hengyu Man,Wenrui Li,Xingtao Wang,Xiaopeng Fan,Debin Zhao

Main category: cs.CV

TL;DR: T-GVC introduces trajectory-guided motion modeling for generative video coding, improving reconstruction quality and motion accuracy in ultra-low bitrate settings.

Details Motivation: Existing video coding methods are limited by domain specificity or over-reliance on text guidance, leading to unrealistic reconstructions. This work aims to overcome these limitations in ultra-low bitrate scenarios. Method: T-GVC employs a semantic-aware sparse motion sampling pipeline and incorporates trajectory-aligned loss constraints into diffusion processes to guide motion patterns without additional training. Result: Experimental results show that T-GVC outperforms traditional codecs and state-of-the-art end-to-end video compression methods in ULB conditions while enabling more precise motion control than text-guided approaches. Conclusion: The proposed T-GVC framework introduces a novel direction in generative video coding by utilizing geometric motion modeling, offering improved performance in ULB conditions and precise motion control compared to existing methods. Abstract: Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding, aiming to achieve semantically accurate reconstructions in Ultra-Low Bitrate (ULB) scenarios by leveraging strong generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or an excessive dependence on high-level text guidance, which often fails to capture motion details and results in unrealistic reconstructions. To address these challenges, we propose a Trajectory-Guided Generative Video Coding framework (dubbed T-GVC). T-GVC employs a semantic-aware sparse motion sampling pipeline to effectively bridge low-level motion tracking with high-level semantic understanding by extracting pixel-wise motion as sparse trajectory points based on their semantic importance, not only significantly reducing the bitrate but also preserving critical temporal semantic information. In addition, by incorporating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free latent space guidance mechanism to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that our framework outperforms both traditional codecs and state-of-the-art end-to-end video compression methods under ULB conditions. Furthermore, additional experiments confirm that our approach achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.

[99] Bridging the gap in FER: addressing age bias in deep learning

F. Xavier Gaya-Morey,Julia Sanchez-Perez,Cristina Manresa-Yee,Jose M. Buades-Rubio

Main category: cs.CV

TL;DR: 本研究探讨了深度学习中的面部表情识别模型在年龄方面的偏差问题,并提出了一些训练策略来减轻这种偏差,特别是在老年人群中的效果得到了验证。

Details Motivation: 基于深度学习的面部表情识别(FER)系统近年来表现出色,但这些模型往往表现出对特定人群(尤其是老年人)的偏差,影响其公平性和可靠性。 Method: 研究采用了多任务学习、多模态输入和年龄加权损失三种偏差缓解策略,并利用可解释AI(XAI)技术分析了模型注意力模式。 Result: 结果表明,针对老年人群体,尤其是在最容易出错的表情上,识别准确率有显著提升;通过显着性热图分析发现,采用年龄感知策略训练的模型能更关注与各年龄组相关的面部区域。 Conclusion: 年龄相关偏差在深度FER模型中可以通过简单的训练修改有效缓解,即使使用近似人口统计标签也有助于在大规模情感计算系统中促进公平性。 Abstract: Facial Expression Recognition (FER) systems based on deep learning have achieved impressive performance in recent years. However, these models often exhibit demographic biases, particularly with respect to age, which can compromise their fairness and reliability. In this work, we present a comprehensive study of age-related bias in deep FER models, with a particular focus on the elderly population. We first investigate whether recognition performance varies across age groups, which expressions are most affected, and whether model attention differs depending on age. Using Explainable AI (XAI) techniques, we identify systematic disparities in expression recognition and attention patterns, especially for "neutral", "sadness", and "anger" in elderly individuals. Based on these findings, we propose and evaluate three bias mitigation strategies: Multi-task Learning, Multi-modal Input, and Age-weighted Loss. Our models are trained on a large-scale dataset, AffectNet, with automatically estimated age labels and validated on balanced benchmark datasets that include underrepresented age groups. Results show consistent improvements in recognition accuracy for elderly individuals, particularly for the most error-prone expressions. Saliency heatmap analysis reveals that models trained with age-aware strategies attend to more relevant facial regions for each age group, helping to explain the observed improvements. These findings suggest that age-related bias in FER can be effectively mitigated using simple training modifications, and that even approximate demographic labels can be valuable for promoting fairness in large-scale affective computing systems.

[100] MolCLIP: A Molecular-Auxiliary CLIP Framework for Identifying Drug Mechanism of Action Based on Time-Lapsed Mitochondrial Images

Fengqian Pang,Chunyue Lei,Hongfei Zhao,Chenghao Liu,Zhiqiang Xing,Huafeng Wang,Chuyang Ye

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉语言模型MolCLIP,结合了显微细胞视频和分子模态,显著提高了药物识别和作用机制识别的效果。

Details Motivation: 现有的深度学习模型主要关注空间特征,忽略了活细胞的时间动态,而时间序列成像更适合观察细胞对药物的反应,并且药物分子可以触发与特定作用机制相关的细胞动态变化。 Method: 提出了MolCLIP框架,该框架设计了一个分子辅助的CLIP框架,并集成了度量学习策略以优化视频特征的聚合。 Result: 在MitoDataset上的实验结果显示,MolCLIP在药物识别和作用机制识别的mAP分别提高了51.2%和20.5%。 Conclusion: MolCLIP有效地结合了显微细胞视频和分子模态,提高了药物识别和作用机制识别的性能。 Abstract: Drug Mechanism of Action (MoA) mainly investigates how drug molecules interact with cells, which is crucial for drug discovery and clinical application. Recently, deep learning models have been used to recognize MoA by relying on high-content and fluorescence images of cells exposed to various drugs. However, these methods focus on spatial characteristics while overlooking the temporal dynamics of live cells. Time-lapse imaging is more suitable for observing the cell response to drugs. Additionally, drug molecules can trigger cellular dynamic variations related to specific MoA. This indicates that the drug molecule modality may complement the image counterpart. This paper proposes MolCLIP, the first visual language model to combine microscopic cell video- and molecule-modalities. MolCLIP designs a molecule-auxiliary CLIP framework to guide video features in learning the distribution of the molecular latent space. Furthermore, we integrate a metric learning strategy with MolCLIP to optimize the aggregation of video features. Experimental results on the MitoDataset demonstrate that MolCLIP achieves improvements of 51.2% and 20.5% in mAP for drug identification and MoA recognition, respectively.

[101] Attend-and-Refine: Interactive keypoint estimation and quantitative cervical vertebrae analysis for bone age assessment

Jinhee Kim,Taesung Kim,Taewoo Kim,Dong-Wook Kim,Byungduk Ahn,Yoon-Ji Kim,In-Seok Song,Jaegul Choo

Main category: cs.CV

TL;DR: 本研究提出ARNet模型,结合用户反馈与形态感知损失函数,高效准确地分析儿童颈椎骨影像,优化正畸治疗时机。

Details Motivation: 在儿童正畸学中,准确估计生长潜力对于制定有效的治疗策略至关重要。 Method: 通过侧头影放射图全面分析颈椎骨成熟度(CVM)特征,并引入交互引导的深度学习模型ARNet以简化关键点标注过程。 Result: ARNet方法显著减少了手动标注的努力,提高了效率和准确性,并在多个数据集中验证了其卓越性能。 Conclusion: 研究提供了一种有效的人工智能辅助诊断工具,用于评估儿童正畸学中的生长潜力,标志着该领域的重大进展。 Abstract: In pediatric orthodontics, accurate estimation of growth potential is essential for developing effective treatment strategies. Our research aims to predict this potential by identifying the growth peak and analyzing cervical vertebra morphology solely through lateral cephalometric radiographs. We accomplish this by comprehensively analyzing cervical vertebral maturation (CVM) features from these radiographs. This methodology provides clinicians with a reliable and efficient tool to determine the optimal timings for orthodontic interventions, ultimately enhancing patient outcomes. A crucial aspect of this approach is the meticulous annotation of keypoints on the cervical vertebrae, a task often challenged by its labor-intensive nature. To mitigate this, we introduce Attend-and-Refine Network (ARNet), a user-interactive, deep learning-based model designed to streamline the annotation process. ARNet features Interaction-guided recalibration network, which adaptively recalibrates image features in response to user feedback, coupled with a morphology-aware loss function that preserves the structural consistency of keypoints. This novel approach substantially reduces manual effort in keypoint identification, thereby enhancing the efficiency and accuracy of the process. Extensively validated across various datasets, ARNet demonstrates remarkable performance and exhibits wide-ranging applicability in medical imaging. In conclusion, our research offers an effective AI-assisted diagnostic tool for assessing growth potential in pediatric orthodontics, marking a significant advancement in the field.

[102] Action Unit Enhance Dynamic Facial Expression Recognition

Feng Liu,Lingna Gu,Chen Shi,Xiaolan Fu

Main category: cs.CV

TL;DR: This paper introduces AU-DFER, a novel Dynamic Facial Expression Recognition architecture that integrates quantified Action Unit (AU)-expression knowledge and redesigned loss functions to improve recognition performance while addressing data imbalance issues.

Details Motivation: The motivation stems from the evolving nature of DFER research and the need to improve deep learning modeling by incorporating AU-expression knowledge, particularly addressing challenges such as data label imbalance in dynamic expression datasets. Method: The paper proposes an AU-enhanced Dynamic Facial Expression Recognition architecture (AU-DFER) that incorporates Action Unit (AU)-expression knowledge through a weight matrix and AU loss, integrating this with conventional deep learning. Additionally, strategies to tackle label imbalance are devised. Result: Experiments demonstrate that the proposed AU-DFER architecture outperforms state-of-the-art methods on mainstream DFER datasets without additional computational cost. It also shows improved performance through loss function redesign to address data label imbalance. Conclusion: This paper concludes that integrating quantified AU-expression knowledge and redesigning loss functions can significantly enhance the effectiveness of Dynamic Facial Expression Recognition (DFER), highlighting the importance of addressing data imbalance in this field. Abstract: Dynamic Facial Expression Recognition(DFER) is a rapidly evolving field of research that focuses on the recognition of time-series facial expressions. While previous research on DFER has concentrated on feature learning from a deep learning perspective, we put forward an AU-enhanced Dynamic Facial Expression Recognition architecture, namely AU-DFER, that incorporates AU-expression knowledge to enhance the effectiveness of deep learning modeling. In particular, the contribution of the Action Units(AUs) to different expressions is quantified, and a weight matrix is designed to incorporate a priori knowledge. Subsequently, the knowledge is integrated with the learning outcomes of a conventional deep learning network through the introduction of AU loss. The design is incorporated into the existing optimal model for dynamic expression recognition for the purpose of validation. Experiments are conducted on three recent mainstream open-source approaches to DFER on the principal datasets in this field. The results demonstrate that the proposed architecture outperforms the state-of-the-art(SOTA) methods without the need for additional arithmetic and generally produces improved results. Furthermore, we investigate the potential of AU loss function redesign to address data label imbalance issues in established dynamic expression datasets. To the best of our knowledge, this is the first attempt to integrate quantified AU-expression knowledge into various DFER models. We also devise strategies to tackle label imbalance, or minor class problems. Our findings suggest that employing a diverse strategy of loss function design can enhance the effectiveness of DFER. This underscores the criticality of addressing data imbalance challenges in mainstream datasets within this domain. The source code is available at https://github.com/Cross-Innovation-Lab/AU-DFER.

[103] Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Shin'ya Yamaguchi,Kosuke Nishida,Daiki Chijiwa

Main category: cs.CV

TL;DR: This paper introduces RED, a decoding strategy that improves multi-modal reasoning in LVLMs by effectively utilizing both visual and rationale information.

Details Motivation: LVLMs often ignore generated rationales during CoT reasoning, which challenges the assumption that CoT improves grounding and accuracy. Method: RED harmonizes visual and rationale information by combining image-conditional and rationale-conditional next token distributions as a KL-constrained reward maximization solution. Result: Experiments show that RED consistently enhances reasoning performance across multiple benchmarks and LVLMs compared to standard CoT and other decoding methods. Conclusion: The proposed RED method effectively improves the faithfulness and accuracy of CoT reasoning in LVLMs, offering a practical approach for more reliable multi-modal systems. Abstract: Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems.

[104] Tree-Mamba: A Tree-Aware Mamba for Underwater Monocular Depth Estimation

Peixian Zhuang,Yijian Wang,Zhenqi Fu,Hongliang Zhang,Sam Kwong,Chongyi Li

Main category: cs.CV

TL;DR: This paper introduces Tree-Mamba, a novel tree-aware Mamba method for underwater monocular depth estimation, along with a reliable dataset called BlueDepth, achieving superior performance over existing methods.

Details Motivation: Existing Mamba-based methods are ineffective for underwater monocular depth estimation due to inflexible state scanning strategies and unreliable depth labels in current datasets. Method: A tree-aware scanning strategy that adaptively constructs a minimum spanning tree based on feature similarity, with flexible aggregation of spatial topological features through bottom-up and top-down traversals. Result: Tree-Mamba outperforms leading methods in qualitative results and quantitative evaluations while maintaining competitive computational efficiency. A new benchmark, BlueDepth, is also introduced. Conclusion: The proposed Tree-Mamba method demonstrates superior performance in underwater monocular depth estimation, offering accurate depth maps and a new benchmark for the task. Abstract: Underwater Monocular Depth Estimation (UMDE) is a critical task that aims to estimate high-precision depth maps from underwater degraded images caused by light absorption and scattering effects in marine environments. Recently, Mamba-based methods have achieved promising performance across various vision tasks; however, they struggle with the UMDE task because their inflexible state scanning strategies fail to model the structural features of underwater images effectively. Meanwhile, existing UMDE datasets usually contain unreliable depth labels, leading to incorrect object-depth relationships between underwater images and their corresponding depth maps. To overcome these limitations, we develop a novel tree-aware Mamba method, dubbed Tree-Mamba, for estimating accurate monocular depth maps from underwater degraded images. Specifically, we propose a tree-aware scanning strategy that adaptively constructs a minimum spanning tree based on feature similarity. The spatial topological features among the tree nodes are then flexibly aggregated through bottom-up and top-down traversals, enabling stronger multi-scale feature representation capabilities. Moreover, we construct an underwater depth estimation benchmark (called BlueDepth), which consists of 38,162 underwater image pairs with reliable depth labels. This benchmark serves as a foundational dataset for training existing deep learning-based UMDE methods to learn accurate object-depth relationships. Extensive experiments demonstrate the superiority of the proposed Tree-Mamba over several leading methods in both qualitative results and quantitative evaluations with competitive computational efficiency. Code and dataset will be available at https://wyjgr.github.io/Tree-Mamba.html.

[105] Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring

Wei Shang,Dongwei Ren,Wanying Zhang,Pengfei Zhu,Qinghua Hu,Wangmeng Zuo

Main category: cs.CV

TL;DR: This paper proposes an efficient and effective method for addressing local motion blur by combining a trainable mask predictor, structural reparameterization, and motion analysis, achieving superior results with reduced computation.

Details Motivation: Existing deblurring methods struggle with inefficient resource allocation and handling spatially varying blur patterns in local motion blur scenarios. Method: A trainable mask predictor identifies blurred regions, and structural reparameterization converts 3×3 convolutions into more efficient 1×1 convolutions. An intra-frame motion analyzer translates pixel displacements into motion trajectories for adaptive region-specific blur restoration. The model is trained end-to-end using reconstruction loss, reblur loss, and mask loss. Result: Extensive experiments show better performance than state-of-the-art methods with a reduction of 49% in FLOPs (e.g., compared to LMD-ViT). Conclusion: The proposed method achieves superior performance on both local and global blur datasets while significantly reducing computational costs compared to existing state-of-the-art models. Abstract: Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting $3\times 3$ convolutions to computationally efficient $1\times 1$ convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49\% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.

[106] Scaling RL to Long Videos

Yukang Chen,Wei Huang,Baifeng Shi,Qinghao Hu,Hanrong Ye,Ligeng Zhu,Zhijian Liu,Pavlo Molchanov,Jan Kautz,Xiaojuan Qi,Sifei Liu,Hongxu Yin,Yao Lu,Song Han

Main category: cs.CV

TL;DR: This paper introduces LongVILA-R1, a framework for long video reasoning in VLMs using reinforcement learning, featuring a new dataset, training pipeline, and infrastructure that enable efficient processing of long videos while achieving competitive performance on reasoning tasks.

Details Motivation: The motivation stems from the need to address unique challenges in long video reasoning, such as handling extended temporal sequences and diverse domains like sports, games, and vlogs, which existing models struggle to manage effectively. Method: The paper proposes a full-stack framework incorporating a large-scale dataset (LongVideo-Reason), a two-stage training pipeline with chain-of-thought supervised fine-tuning and reinforcement learning, and a custom infrastructure (MR-SP) optimized for long video processing. Result: LongVILA-R1-7B achieves strong results on long video QA benchmarks, outperforms Video-R1-7B, and matches Gemini-1.5-Pro in multiple reasoning tasks. The MR-SP system enables up to 2.1x faster RL training and supports training on hour-long videos using a single A100 node. Conclusion: LongVILA-R1 represents significant progress in long video reasoning within vision-language models (VLMs), offering scalable performance improvements and an efficient training framework that supports reinforcement learning across multiple modalities. Abstract: We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

[107] One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models

Jiale Zhao,Xinyang Jiang,Junyao Gao,Yuhao Xue,Cairong Zhao

Main category: cs.CV

TL;DR: This paper introduces CrossVLAD, a new benchmark dataset, and CRAFT, an efficient attack framework, to evaluate and execute cross-task adversarial attacks on unified vision-language models (VLMs), showing improved performance over existing methods.

Details Motivation: The motivation stems from the unique security challenges posed by adversarial inputs in unified vision-language models (VLMs), which must remain effective across unpredictable task instructions within a shared architecture. This necessitates systematic evaluation and development of robust attack frameworks tailored for cross-task scenarios. Method: The paper introduces CrossVLAD, a benchmark dataset for evaluating cross-task adversarial attacks on unified VLMs, and proposes the CRAFT framework, an efficient region-centric attack approach with token-alignment. Experiments are conducted on Florence-2 and other popular VLMs to assess performance. Result: Extensive experiments demonstrate that the CRAFT method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates on unified VLMs such as Florence-2. The newly proposed CrossVLAD dataset and success rate metric enable rigorous evaluation of adversarial transferability. Conclusion: The paper concludes that the proposed CRAFT method effectively adversarially influences unified vision-language models (VLMs) across diverse tasks, demonstrating superior performance in cross-task attack effectiveness and targeted object-change success rates compared to existing methods. Abstract: Unified vision-language models(VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective-consistently manipulating a target object's classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.

[108] Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Haochen Wang,Xiangtai Li,Zilong Huang,Anran Wang,Jiacong Wang,Tao Zhang,Jiani Zheng,Sule Bai,Zijian Kang,Jiashi Feng,Zhuochen Wang,Zhaoxiang Zhang

Main category: cs.CV

TL;DR: This paper introduces TreeBench, a new benchmark for evaluating visual grounded reasoning, and TreeVGR, a training approach that enhances reasoning and localization by leveraging traceable evidence, resulting in notable performance improvements.

Details Motivation: The motivation stems from the lack of holistic benchmarks for evaluating visual grounded reasoning capabilities in models like OpenAI-o3, which dynamically reference visual regions similar to human reasoning. Method: The researchers introduced TreeBench, a diagnostic benchmark based on three principles: focused visual perception, traceable evidence through bounding box evaluation, and second-order reasoning. They also proposed TreeVGR, a training method combining reinforcement learning to supervise localization and reasoning. Result: TreeBench consists of 405 challenging visual question-answering pairs where even advanced models struggle, with none reaching 60% accuracy. Using TreeVGR, significant improvements were observed on multiple benchmarks, including V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4). Conclusion: The study concludes that traceable evidence is crucial for advancing vision-grounded reasoning, as demonstrated by the TreeVGR training paradigm, which significantly improves performance on various benchmarks. Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.

[109] Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays

Ethan Dack,Chengliang Dai

Main category: cs.CV

TL;DR: 该论文探讨了在流行的开源胸部X光数据集中是否存在数据集偏差,并希望通过其研究推动医学影像领域更可解释的研究以及更多开源数据集的创建。

Details Motivation: 由于医学图像因其敏感性而难以开放源代码,因此某些开源数据集在研究中变得非常流行。鉴于人工智能在医学成像中的重要性,必须确定现代方法是否采取了捷径或关注了相关的病理学。这是本文的主要动机。 Method: 论文中使用了多个不同的网络架构,并对NIH、CheXpert、MIMIC-CXR和PadChest等数据集进行了简单的数据转换以增加任务难度,从而探索是否存在数据集偏差。 Result: 论文结果显示,在非医学数据集中存在底层偏差,并且在数据集来源任务上达到了高准确性。通过变换数据集来识别偏差,从而探索医学图像数据集中是否存在类似问题。 Conclusion: 该论文的结论是,通过对不同的开源胸部X光数据集实施任务和数据转换,可以发现其中存在的数据集偏差。作者希望这项工作能够促进医学影像领域更可解释的研究以及更多开源数据集的创建。 Abstract: Recent work has revisited the infamous task Name that dataset and established that in non-medical datasets, there is an underlying bias and achieved high Accuracies on the dataset origin task. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. % We deliberately try to increase the difficulty of the task by dataset transformations. We apply simple transformations of the datasets to try to identify bias. Given the importance of AI applications in medical imaging, it's vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. The corresponding code will be released upon acceptance.

[110] RAPS-3D: Efficient interactive segmentation for 3D radiological imaging

Théo Danielou,Daniel Tordjman,Pierre Manceron,Corentin Dancette

Main category: cs.CV

TL;DR: 本文提出了一种用于三维医学影像分割的新方法,简化了推理过程,提升了效率,解决了现有方法在时间和资源上的瓶颈。

Details Motivation: 现有的二维模型如Segment Anything Model(SAM)无法自然扩展到三维医学影像数据,而传统的三维处理方法存在推理复杂度高和计算资源消耗大的问题。 Method: 受SegVol启发,设计了一种针对三维医学图像的可提示分割方法,避免了传统二维模型扩展到三维时使用的自回归策略和滑动窗口推理。 Result: 新方法在保持高性能的同时显著减少了推理时间和实现复杂度,适用于CT或MRI等三维医学影像数据。 Conclusion: 该论文提出了一种简化的三维可提示分割方法,旨在减少推理时间并消除与滑动窗口相关的提示管理复杂性,同时实现了最先进的性能。 Abstract: Promptable segmentation, introduced by the Segment Anything Model (SAM), is a promising approach for medical imaging, as it enables clinicians to guide and refine model predictions interactively. However, SAM's architecture is designed for 2D images and does not extend naturally to 3D volumetric data such as CT or MRI scans. Adapting 2D models to 3D typically involves autoregressive strategies, where predictions are propagated slice by slice, resulting in increased inference complexity. Processing large 3D volumes also requires significant computational resources, often leading existing 3D methods to also adopt complex strategies like sliding-window inference to manage memory usage, at the cost of longer inference times and greater implementation complexity. In this paper, we present a simplified 3D promptable segmentation method, inspired by SegVol, designed to reduce inference time and eliminate prompt management complexities associated with sliding windows while achieving state-of-the-art performance.

[111] Energy-Guided Decoding for Object Hallucination Mitigation

Xixi Liu,Ailin Deng,Christopher Zach

Main category: cs.CV

TL;DR: This paper proposes an energy-based decoding method that significantly reduces object hallucination in vision-language models while improving performance across benchmarks.

Details Motivation: Mitigating object hallucination in LVLMs is crucial for their safe deployment. Existing methods have limitations such as being restricted to specific decoding techniques, requiring complex visual input modifications, or relying on external models. Method: An energy-based decoding method is proposed, which dynamically selects hidden states from the layer with the minimal energy score to reduce yes-ratio bias in VLMs. Result: The method achieves an average accuracy improvement of 4.82% compared to greedy decoding and a reduction of 8.81% in the yes-ratio gap across three VQA datasets for commonly used VLMs. Conclusion: The proposed energy-based decoding method effectively reduces object hallucination in LVLMs by mitigating the yes-ratio bias and enhancing performance across multiple benchmarks. Abstract: Mitigating object hallucination in large vision-language models (LVLMs) is critical to their safe deployment. Existing methods either are restricted to specific decoding methods, or demand sophisticated modifications to visual inputs, or rely on knowledge from external models. In this work, we first reveal the phenomenon that VLMs exhibit significant imbalance in the ``Yes'' ratio ( \ie, the fraction of ``Yes'' answers among the total number of questions) across three different visual question answering (VQA) datasets. Furthermore, we propose an energy-based decoding method, which dynamically selects the hidden states from the layer with minimal energy score. It is simple yet effective in reducing the bias for the yes ratio while boosting performance across three benchmarks (POPE, MME, and MMVP). Our method consistently improves accuracy and F1 score on three VQA datasets across three commonly used VLMs over several baseline methods. The average accuracy improvement is 4.82% compared to greedy decoding. Moreover, the average yes-ratio gap reduction is 8.81%, meaning the proposed method is less biased as shown in Figure 1.

[112] EEvAct: Early Event-Based Action Recognition with High-Rate Two-Stream Spiking Neural Networks

Michael Neumeier,Jules Lecomte,Nils Kazinski,Soubarna Banik,Bing Li,Axel von Arnim

Main category: cs.CV

TL;DR: This paper proposes a high-rate two-stream SNN for early recognition of human activities from event-based vision sensors, achieving improved accuracy and demonstrating real-world applicability in sports motion capture.

Details Motivation: Existing approaches to processing event-based vision sensor data often limit early prediction capabilities by accumulating events into low-rate frames or space-time voxels, while spiking neural networks (SNNs) have shown promise but lacked sufficient final accuracy. This work aims to bridge that gap. Method: The authors introduced a high-rate two-stream spiking neural network (SNN) and tested it within a novel early event-based recognition framework, benchmarking it with Top-1 and Top-5 recognition scores based on growing observation time. Result: The proposed high-rate two-stream SNN achieved 2% higher final accuracy compared to previous methods on the THU EACT-50 dataset and demonstrated effective early prediction capabilities. Conclusion: The paper concludes that the proposed high-rate two-stream SNN outperforms previous works in final accuracy for early recognition of human activities using event-based vision sensors. Abstract: Recognizing human activities early is crucial for the safety and responsiveness of human-robot and human-machine interfaces. Due to their high temporal resolution and low latency, event-based vision sensors are a perfect match for this early recognition demand. However, most existing processing approaches accumulate events to low-rate frames or space-time voxels which limits the early prediction capabilities. In contrast, spiking neural networks (SNNs) can process the events at a high-rate for early predictions, but most works still fall short on final accuracy. In this work, we introduce a high-rate two-stream SNN which closes this gap by outperforming previous work by 2% in final accuracy on the large-scale THU EACT-50 dataset. We benchmark the SNNs within a novel early event-based recognition framework by reporting Top-1 and Top-5 recognition scores for growing observation time. Finally, we exemplify the impact of these methods on a real-world task of early action triggering for human motion capture in sports.

[113] Sparse-Dense Side-Tuner for efficient Video Temporal Grounding

David Pujol-Perich,Sergio Escalera,Albert Clapés

Main category: cs.CV

TL;DR: This paper proposes SDST, a novel anchor-free side-tuning architecture with deformable attention for video temporal grounding, outperforming existing methods with fewer parameters.

Details Motivation: Existing VTG methods rely on frozen backbones, limiting adaptability. Current ST approaches overlook the sparse nature of MR. Method: Sparse-Dense Side-Tuner (SDST) with Reference-based Deformable Self-Attention and integration of InternVideo2 backbone in an ST framework. Result: Highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA with up to 73% fewer parameters than existing SOTA methods. Conclusion: The proposed SDST method significantly improves existing ST methods for VTG, achieving SOTA results while reducing parameter count. Abstract: Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at https://github.com/davidpujol/SDST.

[114] X-RAFT: Cross-Modal Non-Rigid Registration of Blue and White Light Neurosurgical Hyperspectral Images

Charlie Budd,Silvère Ségaud,Matthew Elliot,Graeme Stasiuk,Yijing Xie,Jonathan Shapey,Tom Vercauteren

Main category: cs.CV

TL;DR: X-RAFT improves the accuracy of cross-modal image correspondence in fluorescence-guided neurosurgery hyperspectral imaging, enhancing surgical decision-making.

Details Motivation: Quantitative fluorescence measurements in real-time can improve surgical decision making during neurosurgery, but require accurate cross-modal image correspondence between images taken under different lighting conditions. Method: X-RAFT uses distinct encoders for each modality pair and fine-tunes them using self-supervised flow-cycle-consistency on neurosurgical hyperspectral data. Result: The model achieves a 36.6% error reduction compared to a naive baseline and a 27.83% reduction compared to CrossRAFT, an existing cross-modal optical flow method. Conclusion: X-RAFT successfully improves cross-modal image correspondence in hyperspectral data, reducing error compared to baseline and existing methods. Abstract: Integration of hyperspectral imaging into fluorescence-guided neurosurgery has the potential to improve surgical decision making by providing quantitative fluorescence measurements in real-time. Quantitative fluorescence requires paired spectral data in fluorescence (blue light) and reflectance (white light) mode. Blue and white image acquisition needs to be performed sequentially in a potentially dynamic surgical environment. A key component to the fluorescence quantification process is therefore the ability to find dense cross-modal image correspondences between two hyperspectral images taken under these drastically different lighting conditions. We address this challenge with the introduction of X-RAFT, a Recurrent All-Pairs Field Transforms (RAFT) optical flow model modified for cross-modal inputs. We propose using distinct image encoders for each modality pair, and fine-tune these in a self-supervised manner using flow-cycle-consistency on our neurosurgical hyperspectral data. We show an error reduction of 36.6% across our evaluation metrics when comparing to a naive baseline and 27.83% reduction compared to an existing cross-modal optical flow method (CrossRAFT). Our code and models will be made publicly available after the review process.

[115] Deep Learning based 3D Volume Correlation for Additive Manufacturing Using High-Resolution Industrial X-ray Computed Tomography

Keerthana Chand,Tobias Fritsch,Bardia Hejazi,Konstantin Poka,Giovanni Bruno

Main category: cs.CV

TL;DR: This paper introduces a deep learning-based DVC method that improves the accuracy and efficiency of registration between CAD and XCT volumes for additive manufacturing quality control.

Details Motivation: Quality control in additive manufacturing is critical due to geometric inaccuracies affecting component performance. Current methods like DVC face challenges with accurate registration and computational demands. Method: The paper proposes a deep learning-based approach using a dynamic patch-based processing strategy to estimate voxel-wise deformations between CAD and XCT volumes. Evaluation metrics include the Dice Score and a Binary Difference Map (BDM). Result: The proposed method achieves a 9.2% improvement in Dice Score and a 9.9% improvement in voxel match rate compared to classic DVC methods, while significantly reducing computation time from days to minutes. Conclusion: The paper concludes that their deep learning-based DVC method improves registration accuracy and efficiency between CAD and XCT volumes, paving the way for more reliable and efficient AM production processes. Abstract: Quality control in additive manufacturing (AM) is vital for industrial applications in areas such as the automotive, medical and aerospace sectors. Geometric inaccuracies caused by shrinkage and deformations can compromise the life and performance of additively manufactured components. Such deviations can be quantified using Digital Volume Correlation (DVC), which compares the computer-aided design (CAD) model with the X-ray Computed Tomography (XCT) geometry of the components produced. However, accurate registration between the two modalities is challenging due to the absence of a ground truth or reference deformation field. In addition, the extremely large data size of high-resolution XCT volumes makes computation difficult. In this work, we present a deep learning-based approach for estimating voxel-wise deformations between CAD and XCT volumes. Our method uses a dynamic patch-based processing strategy to handle high-resolution volumes. In addition to the Dice Score, we introduce a Binary Difference Map (BDM) that quantifies voxel-wise mismatches between binarized CAD and XCT volumes to evaluate the accuracy of the registration. Our approach shows a 9.2\% improvement in the Dice Score and a 9.9\% improvement in the voxel match rate compared to classic DVC methods, while reducing the interaction time from days to minutes. This work sets the foundation for deep learning-based DVC methods to generate compensation meshes that can then be used in closed-loop correlations during the AM production process. Such a system would be of great interest to industries since the manufacturing process will become more reliable and efficient, saving time and material.

[116] SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples

Dren Fazlija,Monty-Maximilian Zühlke,Johanna Schrader,Arkadij Orlov,Clara Stein,Iyiola E. Olatunji,Daniel Kudenko

Main category: cs.CV

TL;DR: This paper introduces SCOOTER, a new framework for evaluating unrestricted adversarial attacks, showing that current methods fail to create imperceptible images and highlighting the mismatch between human and machine perception.

Details Motivation: Unrestricted adversarial attacks can bypass traditional defense mechanisms, and there is a lack of statistically significant evaluations of their imperceptibility. A unified framework was needed to assess and compare such attacks effectively. Method: The researchers introduced SCOOTER, an open-source framework with best-practice guidelines, conducted large-scale human evaluations, and used GPT-4o for preliminary testing. They also created an ImageNet-derived benchmark dataset. Result: SCOOTER provides tools and guidelines for evaluating unrestricted adversarial examples. Human studies showed that tested color-space and diffusion-based attacks failed to generate truly imperceptible images, and GPT-4o had limited success in detecting adversarial examples. Conclusion: The study concludes that unrestricted adversarial attacks pose a challenge to automated vision systems and highlights the necessity of the SCOOTER framework for evaluating these attacks, as human perception does not align with model predictions. Abstract: Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.

[117] Where are we with calibration under dataset shift in image classification?

Mélanie Roschewitz,Raghav Mehta,Fabio de Sousa Ribeiro,Ben Glocker

Main category: cs.CV

TL;DR: 本文研究了在现实数据集转移下的图像分类校准状态,提供了后处理和训练中校准技术选择的重要见解,并为所有关注鲁棒校准的实践者提供了实用指南。

Details Motivation: 需要了解在真实世界数据集转移下如何有效校准图像分类模型,以提供可靠的概率预测。 Method: 比较多种后处理校准方法及其与常见训练中校准策略(如标签平滑)的交互作用,在多个图像领域中的八个不同分类任务上进行测试。 Result: 同时应用熵正则化和标签平滑可获得最佳的校准效果;暴露于少量语义上的分布外数据的后处理校准器最为稳健;近期旨在提高在校准转移中表现的方法未必优于简单的后处理校准方法;改善校准往往以降低分布内校准性能为代价。此外,随机初始化的分类器和从基础模型微调的分类器都显示出类似的结果趋势,后者表现出更优的校准性能。 Conclusion: 本文为校准方法的选择提供了实证指导原则,强调了结合基础模型微调和集成方法的重要性,以及需要注意校准改进可能带来的分布内性能影响。 Abstract: We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.

[118] SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes

Jiaxin Huang,Ziwen Li,Hanlve Zhang,Runnan Chen,Xiao He,Yandong Guo,Wenping Wang,Tongliang Liu,Mingming Gong

Main category: cs.CV

TL;DR: 本文介绍了一个新的 3D 视觉-语言数据集 S\textsc{urprise}3D 及其配套基准测试套件 3D-SRS,旨在推动空间感知人工智能的发展。

Details Motivation: 当前的 3D 视觉-语言研究中,空间推理这一关键能力仍未得到充分探索。现有的数据集往往将语义线索与空间上下文混合,导致模型依赖于表面的捷径而非真正理解空间关系。 Method: 介绍了 S\textsc{urprise}3D 数据集,包含超过 200k 的视觉语言对,并设计了 89k+ 不包含物体名称的人工标注空间查询以减轻捷径偏差。 Result: 初步基准测试表明,当前最先进的 3D 视觉定位方法和 3D-LLMs 面临重大挑战,突显了该数据集和配套基准测试的必要性。 Conclusion: S\textsc{urprise}3D 和 3D-SRS 基准测试套件的引入旨在促进空间感知人工智能的发展,为实现有效的具身交互和机器人规划铺平道路。 Abstract: The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce S\textsc{urprise}3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. S\textsc{urprise}3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. S\textsc{urprise}3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in https://github.com/liziwennba/SUPRISE.

[119] Robust and Generalizable Heart Rate Estimation via Deep Learning for Remote Photoplethysmography in Complex Scenarios

Kang Cen,Chang-Hong Fu,Hong Hong

Main category: cs.CV

TL;DR: 本文提出了一种新的端到端rPPG信号提取方法,通过引入差分帧融合模块、时间位移模块与自注意力机制以及动态混合损失函数,在复杂场景下实现了更高的准确性和鲁棒性。

Details Motivation: 现有的rPPG网络模型在复杂场景下面临准确性、鲁棒性和泛化能力的挑战,因此需要一种更有效的解决方案。 Method: 使用3D卷积神经网络重构准确的rPPG信号,并引入了差分帧融合模块、结合自注意力机制的时间位移模块(TSM)以及动态混合损失函数。 Result: 在PURE、UBFC-rPPG和MMPD数据集上的实验表明,所提出的方法在训练集为PURE时,在MMPD测试集上达到了7.58的平均绝对误差(MAE),超过了现有最先进的模型。 Conclusion: 该论文提出的端到端rPPG提取网络在复杂场景下表现出优越的鲁棒性和泛化能力,特别是在MMPD数据集上取得了优于现有最先进模型的结果。 Abstract: Non-contact remote photoplethysmography (rPPG) technology enables heart rate measurement from facial videos. However, existing network models still face challenges in accu racy, robustness, and generalization capability under complex scenarios. This paper proposes an end-to-end rPPG extraction network that employs 3D convolutional neural networks to reconstruct accurate rPPG signals from raw facial videos. We introduce a differential frame fusion module that integrates differential frames with original frames, enabling frame-level representations to capture blood volume pulse (BVP) variations. Additionally, we incorporate Temporal Shift Module (TSM) with self-attention mechanisms, which effectively enhance rPPG features with minimal computational overhead. Furthermore, we propose a novel dynamic hybrid loss function that provides stronger supervision for the network, effectively mitigating over fitting. Comprehensive experiments were conducted on not only the PURE and UBFC-rPPG datasets but also the challenging MMPD dataset under complex scenarios, involving both intra dataset and cross-dataset evaluations, which demonstrate the superior robustness and generalization capability of our network. Specifically, after training on PURE, our model achieved a mean absolute error (MAE) of 7.58 on the MMPD test set, outperforming the state-of-the-art models.

[120] Visual Instance-aware Prompt Tuning

Xi Xiao,Yunbei Zhang,Xingjian Li,Tianyang Wang,Xiao Wang,Yuxiang Wei,Jihun Hamm,Min Xu

Main category: cs.CV

TL;DR: ViaPT improves visual prompt tuning by combining instance-aware and dataset-level prompts using PCA, achieving better performance with fewer parameters than existing methods.

Details Motivation: Existing visual prompt tuning methods use fixed dataset-level prompts, which lead to sub-optimal performance due to high variance in downstream datasets. Method: ViaPT generates instance-aware prompts based on individual inputs, fuses them with dataset-level prompts, and leverages PCA to retain important prompting information, reducing learnable parameters while improving performance. Result: ViaPT consistently outperforms state-of-the-art baselines across 34 diverse datasets, establishing a new paradigm for optimizing visual prompts. Conclusion: ViaPT provides a more effective and parameter-efficient method for visual prompt tuning in vision transformers compared to existing methods like VPT-Deep and VPT-Shallow. Abstract: Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.

[121] Synergistic Prompting for Robust Visual Recognition with Missing Modalities

Zhihui Zhang,Luanyuan Dai,Qika Lin,Yunfeng Diao,Guangyin Jin,Yufei Guo,Jing Zhang,Xiaoshuai Hao

Main category: cs.CV

TL;DR: This paper proposes the Synergistic Prompting (SyP) framework to handle missing modalities in multi-modal visual recognition models, combining dynamic and static prompting for improved robustness and performance.

Details Motivation: Existing prompt-based strategies for handling missing modalities suffer from limitations such as static prompt inflexibility and unreliable performance under critical modality loss, necessitating a more adaptable and robust approach. Method: The SyP framework introduces two innovations: a Dynamic Adapter for generating adaptive prompts based on input conditions and a Synergistic Prompting Strategy that combines static and dynamic prompts to maintain reliable performance even when key modalities are missing. Result: The SyP framework demonstrates significant performance improvements over current methods across three well-known visual recognition datasets, showing strong adaptability and reliability under varying missing-data conditions. Conclusion: The proposed Synergistic Prompting (SyP) framework enhances robustness in visual recognition tasks when dealing with missing modalities, outperforming existing methods by dynamically adapting prompts and balancing modality information. Abstract: Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are missing.To address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability.

[122] Patient-specific vs Multi-Patient Vision Transformer for Markerless Tumor Motion Forecasting

Gauthier Rotsart de Hertaing,Dani Manjah,Benoit Macq

Main category: cs.CV

TL;DR: The paper introduces a markerless forecasting approach for lung tumor motion using Vision Transformers (ViT), showing that while patient-specific models achieve higher precision, multi-patient models offer robust out-of-the-box performance suitable for time-constrained clinical settings.

Details Motivation: Accurate forecasting of lung tumor motion is essential for precise dose delivery in proton therapy. While current markerless methods mostly rely on deep learning, transformer-based architectures remain unexplored in this domain, despite their proven performance in trajectory forecasting. Method: Digitally reconstructed radiographs (DRRs) derived from planning 4DCT scans of 31 patients were used to train the MP model; a 32nd patient was held out for evaluation. PS models were trained using only the target patient's planning data. Both models used 16 DRRs per input and predicted tumor motion over a 1-second horizon. Performance was assessed using Average Displacement Error (ADE) and Final Displacement Error (FDE), on both planning (T1) and treatment (T2) data. Result: On T1 data, PS models outperformed MP models across all training set sizes, especially with larger datasets (up to 25,000 DRRs, p < 0.05). However, MP models demonstrated stronger robustness to inter-fractional anatomical variability and achieved comparable performance on T2 data without retraining. Conclusion: This is the first study to apply ViT architectures to markerless tumor motion forecasting. While PS models achieve higher precision, MP models offer robust out-of-the-box performance, well-suited for time-constrained clinical settings. Abstract: Background: Accurate forecasting of lung tumor motion is essential for precise dose delivery in proton therapy. While current markerless methods mostly rely on deep learning, transformer-based architectures remain unexplored in this domain, despite their proven performance in trajectory forecasting. Purpose: This work introduces a markerless forecasting approach for lung tumor motion using Vision Transformers (ViT). Two training strategies are evaluated under clinically realistic constraints: a patient-specific (PS) approach that learns individualized motion patterns, and a multi-patient (MP) model designed for generalization. The comparison explicitly accounts for the limited number of images that can be generated between planning and treatment sessions. Methods: Digitally reconstructed radiographs (DRRs) derived from planning 4DCT scans of 31 patients were used to train the MP model; a 32nd patient was held out for evaluation. PS models were trained using only the target patient's planning data. Both models used 16 DRRs per input and predicted tumor motion over a 1-second horizon. Performance was assessed using Average Displacement Error (ADE) and Final Displacement Error (FDE), on both planning (T1) and treatment (T2) data. Results: On T1 data, PS models outperformed MP models across all training set sizes, especially with larger datasets (up to 25,000 DRRs, p < 0.05). However, MP models demonstrated stronger robustness to inter-fractional anatomical variability and achieved comparable performance on T2 data without retraining. Conclusions: This is the first study to apply ViT architectures to markerless tumor motion forecasting. While PS models achieve higher precision, MP models offer robust out-of-the-box performance, well-suited for time-constrained clinical settings.

[123] Benchmarking Content-Based Puzzle Solvers on Corrupted Jigsaw Puzzles

Richard Dirauf,Florian Wolz,Dario Zanca,Björn Eskofier

Main category: cs.CV

TL;DR: 本研究探讨了基于内容的拼图求解器在面对碎片化和腐蚀情况下的鲁棒性,并提出了通过使用增强数据进行微调来提升性能的方法。

Details Motivation: 现有的基于内容的拼图求解器在现实世界的应用中面临诸如碎片化文物或碎纸张等挑战,但它们的评估往往缺乏这样的现实挑战。 Method: 引入了三种拼图腐蚀类型:缺失碎片、边缘腐蚀和内容腐蚀,并评估了启发式和基于深度学习的求解器处理这些腐蚀的能力。 Result: 实验结果表明,先进的Positional Diffusion模型在大多数实验中表现优于其竞争对手,并且深度学习模型通过微调能够显著提高鲁棒性。 Conclusion: 该论文得出结论,针对标准拼图开发的基于内容的求解器在面对更多碎片化和腐蚀时性能迅速下降,但通过使用增强数据进行微调,深度学习模型可以显著提高其鲁棒性。 Abstract: Content-based puzzle solvers have been extensively studied, demonstrating significant progress in computational techniques. However, their evaluation often lacks realistic challenges crucial for real-world applications, such as the reassembly of fragmented artefacts or shredded documents. In this work, we investigate the robustness of State-Of-The-Art content-based puzzle solvers introducing three types of jigsaw puzzle corruptions: missing pieces, eroded edges, and eroded contents. Evaluating both heuristic and deep learning-based solvers, we analyse their ability to handle these corruptions and identify key limitations. Our results show that solvers developed for standard puzzles have a rapid decline in performance if more pieces are corrupted. However, deep learning models can significantly improve their robustness through fine-tuning with augmented data. Notably, the advanced Positional Diffusion model adapts particularly well, outperforming its competitors in most experiments. Based on our findings, we highlight promising research directions for enhancing the automated reconstruction of real-world artefacts.

[124] Rethinking Query-based Transformer for Continual Image Segmentation

Yuchen Zhu,Cheng Shi,Dingyou Wang,Jiajin Tang,Zhengxuan Wei,Yu Wu,Guanbin Li,Sibei Yang

Main category: cs.CV

TL;DR: SimCIS improves continual image segmentation by preserving objectness and promoting plasticity through feature selection and replay mechanisms.

Details Motivation: Current CIS methods decouple mask generation from continual learning, leading to plasticity loss and reliance on input data order. This study aims to address these limitations. Method: SimCIS selects image features directly for query assignment to ensure perfect alignment and preserve objectness while allowing queries to adapt to new classes. It also introduces cross-stage consistency and a visual query-based replay mechanism. Result: SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. Conclusion: SimCIS effectively addresses the issues of plasticity loss and dependency on input data order in existing decoupled frameworks for continual image segmentation, outperforming state-of-the-art methods. Abstract: Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring "perfect alignment" to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative "visual query"-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available at https://github.com/SooLab/SimCIS.

[125] 3D-ADAM: A Dataset for 3D Anomaly Detection in Advanced Manufacturing

Paul McHard,Florent P. Audonnet,Oliver Summerell,Sebastian Andraos,Paul Henderson,Gerardo Aragon-Camarasa

Main category: cs.CV

TL;DR: 本文提出了3D-ADAM,这是一个用于高精度3D异常检测的大规模工业相关数据集,旨在解决现有数据集在真实世界制造环境中的不足。

Details Motivation: 现有的自动缺陷检测方法在当前的数据集上表现良好,但在实际制造环境中仍存在不足,且缺乏代表真实世界场景的大规模高质量RGB+3D工业异常检测数据集。 Method: 介绍了3D-ADAM,一个大规模的、与工业相关的真实场景下的高精度三维异常检测数据集,并对最先进的模型进行了评估。 Result: 3D-ADAM 包含了14,120个高分辨率扫描样本和27,346个标注的缺陷实例,以及8,110个机械元素特征的标注,它在真实的工业环境中采集,具有较高的现实代表性。同时,通过对专家标记调查的验证,证明了该数据集的工业相关性和质量。 Conclusion: 3D-ADAM 是一个具有挑战性的新数据集,旨在推动鲁棒性三维异常检测模型的发展,以满足现代制造业的需求。 Abstract: Surface defects are one of the largest contributors to low yield in the manufacturing sector. Accurate and reliable detection of defects during the manufacturing process is therefore of great value across the sector. State-of-the-art approaches to automated defect detection yield impressive performance on current datasets, yet still fall short in real-world manufacturing settings and developing improved methods relies on large datasets representative of real-world scenarios. Unfortunately, high-quality, high-precision RGB+3D industrial anomaly detection datasets are scarce, and typically do not reflect real-world industrial deployment scenarios. To address this, we introduce 3D-ADAM, the first large-scale industry-relevant dataset for high-precision 3D Anomaly Detection. 3D-ADAM comprises 14,120 high-resolution scans across 217 unique parts, captured using 4 industrial depth imaging sensors. It includes 27,346 annotated defect instances from 12 categories, covering the breadth of industrial surface defects. 3D-ADAM uniquely captures an additional 8,110 annotations of machine element features, spanning the range of relevant mechanical design form factors. Unlike existing datasets, 3D-ADAM is captured in a real industrial environment with variations in part position and orientation, camera positioning, ambient lighting conditions, as well as partial occlusions. Our evaluation of SOTA models across various RGB+3D anomaly detection tasks demonstrates the significant challenge this dataset presents to current approaches. We further validated the industrial relevance and quality of the dataset through an expert labelling survey conducted by industry partners. By providing this challenging benchmark, 3D-ADAM aims to accelerate the development of robust 3D Anomaly Detection models capable of meeting the demands of modern manufacturing environments.

[126] THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza,Leo Fillioux,Sofiène Boutaj,Kunal Mahatha,Christian Desrosiers,Pablo Piantanida,Jose Dolz,Stergios Christodoulidis,Maria Vakalopoulou

Main category: cs.CV

TL;DR: This paper introduces THUNDER, a dynamic and comprehensive benchmarking tool for comparing digital pathology foundation models at the tile level, focusing on performance, robustness, and uncertainty across multiple datasets and tasks.

Details Motivation: The motivation is to address the challenge of assessing progress in digital pathology research due to the rapid development of multiple foundation models, ensuring reliable evaluation in critical healthcare applications. Method: The authors developed THUNDER, a tile-level benchmarking framework for digital pathology foundation models, allowing comparison of models on various downstream tasks while considering feature spaces, robustness, and uncertainty. Result: THUNDER supports fast and flexible comparison of 23 state-of-the-art models across 16 diverse datasets, providing insights into performance, feature representation, robustness, and uncertainty. Conclusion: The paper concludes that THUNDER provides an efficient and comprehensive benchmark for evaluating foundation models in digital pathology, enabling robust comparison across multiple tasks and datasets. Abstract: Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.

[127] Single-Step Latent Diffusion for Underwater Image Restoration

Jiayi Wu,Tianfu Wang,Md Abu Bakr Siddique,Md Jahidul Islam,Cornelia Fermuller,Yiannis Aloimonos,Christopher A. Metzler

Main category: cs.CV

TL;DR: 本文提出了一种名为SLURPP的新网络架构,结合了预训练的潜在扩散模型与显式场景分解,用于高效准确的水下图像恢复。

Details Motivation: 现有的像素域扩散型水下图像恢复方法在处理具有复杂几何结构和显著深度变化的场景时计算量大且生成不真实的伪影,需要一种更高效且准确的方法。 Method: 设计了一个基于物理的水下图像合成流水线,并利用SLURPP网络架构结合预训练的潜在扩散模型与显式场景分解进行水下图像恢复。 Result: SLURPP在合成和真实世界基准测试中均展示了最先进的性能,在PSNR上比现有扩散方法快200倍以上,并在合成基准上提高了约3dB的PSNR。 Conclusion: SLURPP是一个新的网络架构,结合了预训练的潜在扩散模型和显式场景分解,以克服现有水下图像恢复方法的局限性。 Abstract: Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models -- which encode strong priors on the geometry and depth of scenes -- with an explicit scene decomposition -- which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website https://tianfwang.github.io/slurpp/.

[128] MIRA: A Novel Framework for Fusing Modalities in Medical RAG

Jinhong Wang,Tajamul Ashraf,Zongyan Han,Jorma Laaksonen,Rao Mohammad Anwer

Main category: cs.CV

TL;DR: 本研究设计MIRA框架以优化多模态大语言模型在医学领域的事实准确性,解决了传统RAG方法的检索控制和过度依赖问题,并取得了SOTA结果。

Details Motivation: 为了解决MLLMs在医学诊断中生成事实不一致回答的问题以及传统RAG方法面临的检索不足或过度问题。 Method: 提出MIRA框架,包含动态调整检索上下文数量的Rethinking and Rearrangement模块和结合图像嵌入与医学知识库的医学RAG框架。 Result: 在公开医学VQA和报告生成基准测试中验证了MIRA的有效性,显著提升了事实准确性。 Conclusion: MIRA框架有效提升多模态大语言模型在医学领域的事实准确性和整体性能,达到新的SOTA结果。 Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.

[129] Hardware-Aware Feature Extraction Quantisation for Real-Time Visual Odometry on FPGA Platforms

Mateusz Wasala,Mateusz Smolarczyk,Michal Danilowicz,Tomasz Kryjak

Main category: cs.CV

TL;DR: 本文设计了一种适用于嵌入式系统的高效特征点检测模型,通过模型量化与硬件优化,在FPGA平台上实现了高性能视觉里程计任务。

Details Motivation: 为了在自动驾驶平台(如地面车辆、无人机等)中实现实时且准确的位置估计,需要高效的特征点提取方法以降低计算需求并保持高精度。 Method: 使用量化的SuperPoint卷积神经网络架构,并通过Brevitas库和FINN框架进行模型量化与硬件感知优化,最终在AMD/Xilinx Zynq UltraScale+ FPGA SoC平台上部署。 Result: 该方案能够在640 x 480像素分辨率下达到每秒54帧的处理速度,超越了当前领域的先进水平,并在TUM数据集上验证了不同量化技术对模型精度的影响。 Conclusion: 本文提出了一种基于FPGA的高效特征点检测与描述方法,能够在资源受限的嵌入式平台上实现高性能视觉里程计任务。 Abstract: Accurate position estimation is essential for modern navigation systems deployed in autonomous platforms, including ground vehicles, marine vessels, and aerial drones. In this context, Visual Simultaneous Localisation and Mapping (VSLAM) - which includes Visual Odometry - relies heavily on the reliable extraction of salient feature points from the visual input data. In this work, we propose an embedded implementation of an unsupervised architecture capable of detecting and describing feature points. It is based on a quantised SuperPoint convolutional neural network. Our objective is to minimise the computational demands of the model while preserving high detection quality, thus facilitating efficient deployment on platforms with limited resources, such as mobile or embedded systems. We implemented the solution on an FPGA System-on-Chip (SoC) platform, specifically the AMD/Xilinx Zynq UltraScale+, where we evaluated the performance of Deep Learning Processing Units (DPUs) and we also used the Brevitas library and the FINN framework to perform model quantisation and hardware-aware optimisation. This allowed us to process 640 x 480 pixel images at up to 54 fps on an FPGA platform, outperforming state-of-the-art solutions in the field. We conducted experiments on the TUM dataset to demonstrate and discuss the impact of different quantisation techniques on the accuracy and performance of the model in a visual odometry task.

[130] Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement

Xiao Yang,Yuxuan Fan,Can Liu,Houcheng Su,Weichen Guo,Jiyao Wang,Dengbo He

Main category: cs.CV

TL;DR: This paper introduces a novel Test-Time Adaptation (TTA) approach called the Consistency-inConsistency-integration (CiCi) framework for remote photoplethysmography (rPPG), which enhances model adaptability during inference by leveraging physiological signal characteristics and incorporating a gradient dynamic control mechanism, resulting in state-of-the-art performance in real-time, privacy-preserving adaptation.

Details Motivation: Existing domain adaptation and generalization methods for deep-based remote photoplethysmography (rPPG) models face limitations in real-world deployment due to privacy concerns and the need for real-time adaptation. This work proposes a novel Test-Time Adaptation (TTA) strategy tailored for rPPG tasks to address these challenges. Method: An expert knowledge-based self-supervised Consistency-inConsistency-integration (CiCi) framework is introduced, leveraging both consistency in the frequency domain and inconsistency in the time domain of rPPG signals. Additionally, a gradient dynamic control mechanism is incorporated to resolve conflicts between priors. Result: Extensive experiments on five diverse datasets under the TTA protocol demonstrate that the proposed method consistently outperforms existing techniques, achieving superior performance in real-time self-supervised adaptation for rPPG tasks. Conclusion: The proposed CiCi framework, combined with a gradient dynamic control mechanism, achieves state-of-the-art performance in real-time self-supervised adaptation for rPPG tasks without accessing source data, making it promising for real-world deployment. Abstract: Remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for monitoring physiological signals using the camera. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based rPPG models in unseen deployment environments, considerations in aspects like privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for rPPG tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of rPPG signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.

[131] Towards Continuous Home Cage Monitoring: An Evaluation of Tracking and Identification Strategies for Laboratory Mice

Juan Pablo Oberhauser,Daniel Grzenda

Main category: cs.CV

TL;DR: 本研究提出了一种名为MouseTracks-Mouseformer-MouseMap的新管道,通过结合外观与运动线索、使用Transformer模型分类身份并优化轨迹关联,实现了对佩戴耳标小鼠的高效、准确实时追踪和身份识别。

Details Motivation: 由于传统方法在提供个体小鼠指标方面存在困难,例如饲养密度高、外观相似、移动频繁以及相互作用多,因此需要一种更精确的实时身份识别解决方案。 Method: 该方法由三部分组成:(1) 结合外观和运动线索的定制多目标追踪器(MouseTracks);(2) 基于Transformer的身份分类器(Mouseformer);(3) 轨迹关联线性规划以分配最终身份预测(MouseMap)。 Result: 所提出的模型能够在每秒30帧的速度下基于定制耳标为动物分配身份标识,并实现了全天候的笼内覆盖。与当前的小鼠追踪方法相比,该模型降低了身份切换频率,并提高了追踪效率。 Conclusion: 该研究开发了一种实时识别算法,能够准确地为佩戴定制耳标的实验室小鼠分配身份标识,从而改善了动物监测的效率和准确性。 Abstract: Continuous, automated monitoring of laboratory mice enables more accurate data collection and improves animal welfare through real-time insights. Researchers can achieve a more dynamic and clinically relevant characterization of disease progression and therapeutic effects by integrating behavioral and physiological monitoring in the home cage. However, providing individual mouse metrics is difficult because of their housing density, similar appearances, high mobility, and frequent interactions. To address these challenges, we develop a real-time identification (ID) algorithm that accurately assigns ID predictions to mice wearing custom ear tags in digital home cages monitored by cameras. Our pipeline consists of three parts: (1) a custom multiple object tracker (MouseTracks) that combines appearance and motion cues from mice; (2) a transformer-based ID classifier (Mouseformer); and (3) a tracklet associator linear program to assign final ID predictions to tracklets (MouseMap). Our models assign an animal ID based on custom ear tags at 30 frames per second with 24/7 cage coverage. We show that our custom tracking and ID pipeline improves tracking efficiency and lowers ID switches across mouse strains and various environmental factors compared to current mouse tracking methods.

[132] TinierHAR: Towards Ultra-Lightweight Deep Learning Models for Efficient Human Activity Recognition on Edge Devices

Sizhen Bian,Mengxi Liu,Vitor Fortes Rey,Daniel Geissler,Paul Lukowicz

Main category: cs.CV

TL;DR: 本文提出了一种用于人类活动识别的超轻量级模型TinierHAR,在显著降低计算资源需求的同时保持了性能。

Details Motivation: 在资源受限的可穿戴设备上实现高精度且计算高效的推理模型。 Method: TinierHAR结合了残差深度可分离卷积、门控循环单元(GRUs)和时间聚合方法。 Result: TinierHAR在参数上比TinyHAR减少2.7倍,比DeepConvLSTM减少43.3倍;在MACs上分别减少了6.4倍和58.6倍,同时保持F1分数。 Conclusion: TinierHAR实现了超轻量级的深度学习架构,在资源受限的可穿戴设备上进行高效的人类活动识别,同时保持了性能。 Abstract: Human Activity Recognition (HAR) on resource-constrained wearable devices demands inference models that harmonize accuracy with computational efficiency. This paper introduces TinierHAR, an ultra-lightweight deep learning architecture that synergizes residual depthwise separable convolutions, gated recurrent units (GRUs), and temporal aggregation to achieve SOTA efficiency without compromising performance. Evaluated across 14 public HAR datasets, TinierHAR reduces Parameters by 2.7x (vs. TinyHAR) and 43.3x (vs. DeepConvLSTM), and MACs by 6.4x and 58.6x, respectively, while maintaining the averaged F1-scores. Beyond quantitative gains, this work provides the first systematic ablation study dissecting the contributions of spatial-temporal components across proposed TinierHAR, prior SOTA TinyHAR, and the classical DeepConvLSTM, offering actionable insights for designing efficient HAR systems. We finally discussed the findings and suggested principled design guidelines for future efficient HAR. To catalyze edge-HAR research, we open-source all materials in this work for future benchmarking\footnote{https://github.com/zhaxidele/TinierHAR}

[133] Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

Longfei Li,Zhiwen Fan,Wenyan Cong,Xinhang Liu,Yuyang Yin,Matt Foutter,Panwang Pan,Chenyu You,Yue Wang,Zhangyang Wang,Yao Zhao,Marco Pavone,Yunchao Wei

Main category: cs.CV

TL;DR: This paper presents a holistic solution for generating high-quality Martian landscape videos by combining a data curation pipeline (M3arsSynth) and a video generation system (MarsGen), which together overcome the limitations of scarce Martian data and domain gaps with terrestrial imagery.

Details Motivation: Synthesizing realistic Martian landscape videos is essential for mission rehearsal and robotic simulation, but this task faces challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. Method: 1) Development of a data curation pipeline called Multimodal Mars Synthesis (M3arsSynth) to reconstruct 3D Martian environments from stereo navigation images sourced from NASA's Planetary Data System (PDS), generating high-fidelity multiview 3D video sequences. 2) Creation of MarsGen, a Martian terrain video generator that synthesizes novel videos based on an initial image frame, optionally using camera trajectories or textual prompts. Result: The experimental results demonstrate that the proposed approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency. Conclusion: The proposed solution, MarsGen, fine-tuned on M3arsSynth data, effectively synthesizes high-quality Martian terrain videos that outperform existing video synthesis models trained on terrestrial datasets in terms of visual fidelity and 3D structural consistency. Abstract: Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.

[134] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu,Diankun Wu,Tianyu He,Junliang Guo,Yang Ye,Yueqi Duan,Jiang Bian

Main category: cs.CV

TL;DR: Geometry Forcing improves video diffusion models by promoting latent 3D representations through angular and scale alignment objectives, enhancing visual quality and 3D consistency.

Details Motivation: Video diffusion models trained on raw video data often fail to capture geometric-aware structures, necessitating a method to bridge this gap and better align with the 3D nature of the physical world. Method: Geometry Forcing introduces two alignment objectives: Angular Alignment using cosine similarity for directional consistency and Scale Alignment for preserving scale information through regression. Result: Geometry Forcing demonstrates significant improvements in visual quality and 3D consistency on both camera view-conditioned and action-conditioned video generation tasks. Conclusion: The proposed Geometry Forcing method effectively enhances video diffusion models by encouraging latent 3D representations, leading to improved visual quality and 3D consistency. Abstract: Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

[135] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

JingLi Lin,Chenming Zhu,Runsen Xu,Xiaohan Mao,Xihui Liu,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: OST-Bench是一个用于评估多模态大语言模型在动态场景中在线时空理解能力的新基准测试,揭示了现有模型在此类任务上的局限性。

Details Motivation: 现有的多模态基准测试通常基于离线设置,缺乏对实际应用场景中动态获取信息的评估,因此引入OST-Bench以更真实地反映具身感知面临的挑战。 Method: 通过构建OST-Bench基准测试来评估多模态大语言模型(MLLMs)在动态场景中的在线时空理解能力,并进行实验分析以识别模型的错误模式和性能瓶颈。 Result: 实验结果显示,当前领先的MLLMs在需要复杂时空推理的任务上表现不佳,且在在线设置下,随着探索时间延长和记忆增长,其准确率下降。 Conclusion: OST-Bench强调了当前MLLMs在复杂时空推理任务上的不足,尤其是在在线探索过程中需要长期记忆和空间推理的任务。此外,该研究提供了公开的代码、数据集和基准以促进进一步的研究和发展。 Abstract: Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/

[136] CLIP Won't Learn Object-Attribute Binding from Natural Data and Here is Why

Bijay Gurung,David T. Hoffmann,Thomas Brox

Main category: cs.CV

TL;DR: 本文探讨了对比视觉语言模型(如CLIP)在学习绑定能力方面的局限性,并通过合成数据集识别数据属性对CLIP绑定能力的影响。

Details Motivation: 尽管CLIP等模型广泛应用于零样本分类和多模态模型,但其表示能力存在重大限制,尤其是在区分具体图像描述方面。本文旨在填补这一研究空白。 Method: 使用合成数据集,严格识别不同数据属性对CLIP绑定能力的影响,并测试增加难例的方法是否能改善绑定性能。 Result: 发现自然数据的常见属性(如低属性密度、不完整标题和显著性偏差)对绑定性能有负面影响。增加批量大小或显式创建难例无法使CLIP学习可靠的绑定。只有当数据表达所识别的数据属性时,CLIP才能学习几乎完美的绑定。 Conclusion: 解决CLIP的绑定问题的关键在于数据本身,而不是训练方法。 Abstract: Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of "a yellow submarine and a blue bus" or "a blue submarine and a yellow bus". Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP's ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is "most salient" to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.

[137] Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun,Sukjun Hwang,Su Ho Han,Taeoh Kim,Inwoong Lee,Dongyoon Wee,Joon-Young Lee,Seon Joo Kim,Minho Shim

Main category: cs.CV

TL;DR: The paper introduces STTM, a training-free spatio-temporal token merging technique that effectively reduces computational load in video LLMs while maintaining performance.

Details Motivation: The motivation behind STTM is to address the quadratic computational scaling issue of video large language models with token count by leveraging local spatial and temporal redundancy in video data. Method: The method involves transforming each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure and directed pairwise merging across the temporal dimension. Result: STTM outperforms existing token reduction methods across six video QA benchmarks, achieving significant speed-up with only minor accuracy drops under reduced token budgets. Conclusion: STTM is an effective training-free method for spatio-temporal token merging that exploits local spatial and temporal redundancy in video data, providing speed-up with minimal accuracy drop. Abstract: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

[138] Multigranular Evaluation for Brain Visual Decoding

Weihao Xia,Cengiz Oztireli

Main category: cs.CV

TL;DR: This paper proposes BASIC, a novel evaluation framework for brain visual decoding that overcomes the limitations of existing protocols by incorporating multiple levels of analysis.

Details Motivation: Existing evaluation protocols for brain visual decoding have limitations in capturing fine-grained visual distinctions and lack neuroscientific foundation. Method: The introduction of BASIC, which incorporates structural fidelity, inferential alignment, and contextual coherence to evaluate brain visual decoding methods. Result: A benchmark of diverse visual decoding methods across multiple stimulus-neuroimaging datasets within the unified evaluation framework was conducted. Conclusion: BASIC provides a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods. Abstract: Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods.

[139] Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection

Subhajit Maity,Ayan Kumar Bhunia,Subhadeep Koley,Pinaki Nath Chowdhury,Aneeshan Sain,Yi-Zhe Song

Main category: cs.CV

TL;DR: This paper proposes a framework for few-shot keypoint detection using sketches, addressing challenges in cross-modal embeddings and user-specific styles through a prototypical setup and domain adaptation techniques.

Details Motivation: Keypoint detection faces challenges in few-shot learning when source data from the same distribution as the query is unavailable, which is addressed by leveraging sketches as a source-free alternative. Method: The framework employs a prototypical setup, a grid-based locator, and prototypical domain adaptation to overcome challenges in cross-modal embeddings and user-specific sketch styles. Result: Extensive experiments demonstrate the framework's success in few-shot convergence across novel keypoints and classes. Conclusion: The proposed framework successfully addresses the challenges in few-shot keypoint detection by using sketches as a source-free alternative and demonstrates success in convergence across novel keypoints and classes. Abstract: Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.

Shivam Duggal,Sanghyun Byun,William T. Freeman,Antonio Torralba,Phillip Isola

Main category: cs.CV

TL;DR: KARL is a new adaptive tokenization method for images that determines the appropriate number of tokens in a single pass, aligning with Algorithmic Information Theory principles and matching the performance of existing methods.

Details Motivation: Most visual representation learning systems use fixed-length representations, ignoring variations in complexity or familiarity. KARL was developed to address this issue by allocating variable-length representations more efficiently. Method: KARL, a single-pass adaptive tokenizer inspired by Kolmogorov Complexity principles, predicts the appropriate number of tokens for an image in one forward pass, halting once its approximate Kolmogorov Complexity is reached. Its training follows the Upside-Down Reinforcement Learning paradigm. Result: KARL successfully predicts the appropriate number of tokens for an image in a single forward pass and matches the performance of other adaptive tokenization methods without requiring multiple passes. Scaling laws and conceptual studies were also presented. Conclusion: KARL matches the performance of recent adaptive tokenizers while operating in a single pass. The study also reveals that KARL's approach aligns with human intuition regarding image complexity. Abstract: According to Algorithmic Information Theory (AIT) -- Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL's training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity -- revealing alignment with human intuition.

[141] MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

Mingkai Jia,Wei Yin,Xiaotao Hu,Jiaxin Guo,Xiaoyang Guo,Qian Zhang,Xiao-Xiao Long,Ping Tan

Main category: cs.CV

TL;DR: The paper proposes MGVQ, which improves VQ-VAE performance by enhancing codebook representation, achieving better reconstruction quality across multiple benchmarks.

Details Motivation: Existing VQ-VAEs have a significant performance gap compared to VAEs in terms of reconstruction quality. The authors aim to narrow this gap by improving the discrete codebook representation and optimization. Method: MGVQ retains the latent dimension to preserve encoded features and incorporates sub-codebooks for quantization, enhancing representation capability and minimizing information loss. Result: MGVQ outperforms SD-VAE on ImageNet with a lower rFID score (0.49 vs. 0.91) and achieves superior PSNR on all zero-shot benchmarks. Conclusion: MGVQ achieves state-of-the-art performance on ImageNet and eight zero-shot benchmarks, demonstrating its superiority in reconstruction quality compared to existing VQ-VAE methods. Abstract: Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose \NickName, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. \NickName~achieves the \textbf{state-of-the-art performance on both ImageNet and $8$ zero-shot benchmarks} across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID $\textbf{0.49}$ v.s. $\textbf{0.91}$, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of \NickName~in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.

[142] Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models

Helen Qu,Sang Michael Xie

Main category: cs.CV

TL;DR: This paper demonstrates that the performance of CLIP and LMMs is strongly influenced by the combination of concepts in the training data, as measured by PMI, highlighting the importance of improving compositional generalization in multimodal models.

Details Motivation: This paper investigates the unclear impact of concept combinations in training data on compositional generalization in CLIP and large multimodal models, especially focusing on how accuracy varies when common objects appear in uncommon pairings. Method: The authors use pointwise mutual information (PMI) to measure word co-occurrence statistics in the pretraining dataset. They synthetically generate images with varying concept pairs to evaluate zero-shot accuracy and apply edits to natural images to reproduce the effect. The study also examines how these effects transfer to LMMs built on CLIP. Result: There is a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy (r=0.97), with a 14% accuracy gap observed. The study reproduces this effect in edited natural images (r=0.75) and shows it transfers to LMMs (r=0.70 for TextVQA, r=0.62 for VQAv2). Conclusion: The paper concludes that compositional generalization in multimodal models like CLIP and LMMs is significantly affected by concept combinations, as measured by PMI. This suggests the need for improved algorithms to handle such generalization without combinatorial scaling of training data. Abstract: CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear -- for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at https://github.com/helenqu/multimodal-pretraining-pmi.

eess.IV [Back]

[143] Semi-supervised learning and integration of multi-sequence MR-images for carotid vessel wall and plaque segmentation

Marie-Christine Pali,Christina Schwaiger,Malik Galijasevic,Valentin K. Ladenhauf,Stephanie Mangesius,Elke R. Gizewski

Main category: eess.IV

TL;DR: 该研究提出了一种半监督深度学习方法,用于多序列MRI数据中的颈动脉分割,通过结合多序列信息和半监督学习解决数据有限和斑块复杂性问题。

Details Motivation: 颈动脉斑块的准确分割对于评估动脉粥样硬化和缺血性中风风险至关重要,但斑块的复杂形态和标注数据的稀缺性带来了挑战。 Method: 提出了一种半监督深度学习方法,结合多序列MRI数据进行颈动脉壁和斑块的分割。该方法包括一个粗定位模型和一个精分割模型,并引入了多级多序列U-Net架构。 Result: 该方法在52名患者的实验中表现出有效性,并强调了U-Net架构中融合点选择的重要性。 Conclusion: 研究证明了融合策略和半监督学习在MRI数据有限的情况下改善颈动脉分割的潜力。 Abstract: The analysis of carotid arteries, particularly plaques, in multi-sequence Magnetic Resonance Imaging (MRI) data is crucial for assessing the risk of atherosclerosis and ischemic stroke. In order to evaluate metrics and radiomic features, quantifying the state of atherosclerosis, accurate segmentation is important. However, the complex morphology of plaques and the scarcity of labeled data poses significant challenges. In this work, we address these problems and propose a semi-supervised deep learning-based approach designed to effectively integrate multi-sequence MRI data for the segmentation of carotid artery vessel wall and plaque. The proposed algorithm consists of two networks: a coarse localization model identifies the region of interest guided by some prior knowledge on the position and number of carotid arteries, followed by a fine segmentation model for precise delineation of vessel walls and plaques. To effectively integrate complementary information across different MRI sequences, we investigate different fusion strategies and introduce a multi-level multi-sequence version of U-Net architecture. To address the challenges of limited labeled data and the complexity of carotid artery MRI, we propose a semi-supervised approach that enforces consistency under various input transformations. Our approach is evaluated on 52 patients with arteriosclerosis, each with five MRI sequences. Comprehensive experiments demonstrate the effectiveness of our approach and emphasize the role of fusion point selection in U-Net-based architectures. To validate the accuracy of our results, we also include an expert-based assessment of model performance. Our findings highlight the potential of fusion strategies and semi-supervised learning for improving carotid artery segmentation in data-limited MRI applications.

[144] D-CNN and VQ-VAE Autoencoders for Compression and Denoising of Industrial X-ray Computed Tomography Images

Bardia Hejazi,Keerthana Chand,Tobias Fritsch,Giovanni Bruno

Main category: eess.IV

TL;DR: 本研究探讨了基于深度学习的X射线断层扫描数据压缩方法及其对数据质量的影响,并提出了适用于三维数据分析的边缘保持评估指标。

Details Motivation: 随着成像技术的发展,成像科学中产生的数据量不断增加,这需要高效可靠的数据存储解决方案。 Method: 使用深度学习自编码器(包括D-CNN和VQ-VAE)对工业X射线断层扫描数据进行压缩,并引入了一种对边缘保持敏感的度量方法以评估解压图像质量。 Result: 两种不同压缩率的网络架构在解码图像质量上表现不同,且通过新引入的度量方法可以更好地评估三维数据分析中的边缘保持能力。 Conclusion: 不同的网络架构和压缩率会影响解压后X射线断层扫描数据的质量,因此应根据需要保留的数据特性选择合适的压缩策略。 Abstract: The ever-growing volume of data in imaging sciences stemming from the advancements in imaging technologies, necessitates efficient and reliable storage solutions for such large datasets. This study investigates the compression of industrial X-ray computed tomography (XCT) data using deep learning autoencoders and examines how these compression algorithms affect the quality of the recovered data. Two network architectures with different compression rates were used, a deep convolution neural network (D-CNN) and a vector quantized variational autoencoder (VQ-VAE). The XCT data used was from a sandstone sample with a complex internal pore network. The quality of the decoded images obtained from the two different deep learning architectures with different compression rates were quantified and compared to the original input data. In addition, to improve image decoding quality metrics, we introduced a metric sensitive to edge preservation, which is crucial for three-dimensional data analysis. We showed that different architectures and compression rates are required depending on the specific characteristics needed to be preserved for later analysis. The findings presented here can aid scientists to determine the requirements and strategies for their data storage and analysis needs.

[145] Compressive Imaging Reconstruction via Tensor Decomposed Multi-Resolution Grid Encoding

Zhenyu Jin,Yisi Luo,Xile Zhao,Deyu Meng

Main category: eess.IV

TL;DR: GridTD is an innovative unsupervised continuous representation framework that improves compressive imaging reconstruction by combining multi-resolution grid encoding with tensor decomposition.

Details Motivation: To address the limitations of existing unsupervised representations in balancing representation ability and efficiency in compressive imaging reconstruction. Method: Tensor Decomposed multi-resolution Grid encoding (GridTD), which combines multi-resolution hash grid encoding with tensor decomposition in a neural network framework. Result: Theoretical analyses show advantages in Lipschitz property, generalization error bound, and fixed-point convergence; experiments demonstrate consistent superiority across multiple CI tasks. Conclusion: GridTD provides a more effective and efficient unsupervised continuous representation framework for CI reconstruction compared to existing methods. Abstract: Compressive imaging (CI) reconstruction, such as snapshot compressive imaging (SCI) and compressive sensing magnetic resonance imaging (MRI), aims to recover high-dimensional images from low-dimensional compressed measurements. This process critically relies on learning an accurate representation of the underlying high-dimensional image. However, existing unsupervised representations may struggle to achieve a desired balance between representation ability and efficiency. To overcome this limitation, we propose Tensor Decomposed multi-resolution Grid encoding (GridTD), an unsupervised continuous representation framework for CI reconstruction. GridTD optimizes a lightweight neural network and the input tensor decomposition model whose parameters are learned via multi-resolution hash grid encoding. It inherently enjoys the hierarchical modeling ability of multi-resolution grid encoding and the compactness of tensor decomposition, enabling effective and efficient reconstruction of high-dimensional images. Theoretical analyses for the algorithm's Lipschitz property, generalization error bound, and fixed-point convergence reveal the intrinsic superiority of GridTD as compared with existing continuous representation models. Extensive experiments across diverse CI tasks, including video SCI, spectral SCI, and compressive dynamic MRI reconstruction, consistently demonstrate the superiority of GridTD over existing methods, positioning GridTD as a versatile and state-of-the-art CI reconstruction method.

[146] Breast Ultrasound Tumor Generation via Mask Generator and Text-Guided Network:A Clinically Controllable Framework with Downstream Evaluation

Haoyu Pan,Hongxin Lin,Zetian Feng,Chuxuan Lin,Junyang Mo,Chu Zhang,Zijian Wu,Yi Wang,Qingqing Zheng

Main category: eess.IV

TL;DR: 提出了一种临床可控的生成框架,用于合成高质量的乳腺超声图像,解决了专家标注数据稀缺的问题。

Details Motivation: 由于专家标注数据稀缺,鲁棒深度学习模型的发展受到限制,因此需要一种方法来有效合成乳腺超声图像。 Method: 集成临床描述与结构掩码生成肿瘤的生成对抗网络框架,并设计了语义曲率掩码生成器。 Result: 在六个公开乳腺超声数据集上的定量评估表明,所生成的合成图像在增强下游乳腺癌诊断任务中效果显著,并通过了经验丰富的超声医师的视觉图灵测试。 Conclusion: 该论文提出的临床可控生成框架在乳腺超声图像分析中具有广泛的临床应用潜力,能够生成反映真实世界形态多样性的个性化合成图像。 Abstract: The development of robust deep learning models for breast ultrasound (BUS) image analysis is significantly constrained by the scarcity of expert-annotated data. To address this limitation, we propose a clinically controllable generative framework for synthesizing BUS images. This framework integrates clinical descriptions with structural masks to generate tumors, enabling fine-grained control over tumor characteristics such as morphology, echogencity, and shape. Furthermore, we design a semantic-curvature mask generator, which synthesizes structurally diverse tumor masks guided by clinical priors. During inference, synthetic tumor masks serve as input to the generative framework, producing highly personalized synthetic BUS images with tumors that reflect real-world morphological diversity. Quantitative evaluations on six public BUS datasets demonstrate the significant clinical utility of our synthetic images, showing their effectiveness in enhancing downstream breast cancer diagnosis tasks. Furthermore, visual Turing tests conducted by experienced sonographers confirm the realism of the generated images, indicating the framework's potential to support broader clinical applications.

[147] MeD-3D: A Multimodal Deep Learning Framework for Precise Recurrence Prediction in Clear Cell Renal Cell Carcinoma (ccRCC)

Hasaan Maqsood,Saif Ur Rehman Khan

Main category: eess.IV

TL;DR: 本研究开发了一个新的深度学习框架,通过整合多种数据类型来更准确地预测透明细胞肾细胞癌的复发情况。

Details Motivation: 由于ccRCC疾病复杂的分子、病理和临床异质性,传统的依赖单一数据模态的预后模型往往不能充分捕捉疾病的复杂性,导致预测准确性不足。 Method: 该框架使用了针对不同模态数据的特定领域模型:CLAM(基于ResNet50的模型)用于组织病理学全切片图像(WSI),MeD-3D(预训练3D-ResNet18模型)处理CT和MRI图像,而多层感知器(MLP)则用于结构化临床和基因组数据。通过早期和晚期融合架构结合来自各个模态的互补信息,并设计为即使在某些模态缺失的情况下也能进行推理。 Result: 开发了一个能够整合包括CT、MRI、组织病理学全切片图像(WSI)、临床数据和基因组资料在内的多种数据类型的深度学习框架,以改善ccRCC复发的预测。 Conclusion: 该研究提出了一种整合多模态数据的深度学习框架,旨在提高透明细胞肾细胞癌(ccRCC)复发的预测准确性并增强临床决策能力。 Abstract: Accurate prediction of recurrence in clear cell renal cell carcinoma (ccRCC) remains a major clinical challenge due to the disease complex molecular, pathological, and clinical heterogeneity. Traditional prognostic models, which rely on single data modalities such as radiology, histopathology, or genomics, often fail to capture the full spectrum of disease complexity, resulting in suboptimal predictive accuracy. This study aims to overcome these limitations by proposing a deep learning (DL) framework that integrates multimodal data, including CT, MRI, histopathology whole slide images (WSI), clinical data, and genomic profiles, to improve the prediction of ccRCC recurrence and enhance clinical decision-making. The proposed framework utilizes a comprehensive dataset curated from multiple publicly available sources, including TCGA, TCIA, and CPTAC. To process the diverse modalities, domain-specific models are employed: CLAM, a ResNet50-based model, is used for histopathology WSIs, while MeD-3D, a pre-trained 3D-ResNet18 model, processes CT and MRI images. For structured clinical and genomic data, a multi-layer perceptron (MLP) is used. These models are designed to extract deep feature embeddings from each modality, which are then fused through an early and late integration architecture. This fusion strategy enables the model to combine complementary information from multiple sources. Additionally, the framework is designed to handle incomplete data, a common challenge in clinical settings, by enabling inference even when certain modalities are missing.

[148] ArteryX: Advancing Brain Artery Feature Extraction with Vessel-Fused Networks and a Robust Validation Framework

Abrar Faiyaz,Nhat Hoang,Giovanni Schifitto,Md Nasir Uddin

Main category: eess.IV

TL;DR: 本文介绍了一个名为 ArteryX 的新框架,该框架能够以高准确性和效率量化血管特征,有助于早期检测脑血管疾病并促进对血管对大脑健康贡献的理解。

Details Motivation: 现有的从 MRA 提取动脉特征的方法面临用户依赖性变异、学习曲线陡峭以及缺乏标准化定量验证等问题。 Method: 提出了一种新的半监督动脉评估框架 ArteryX,这是一个基于 MATLAB 的工具箱,通过融合血管网络的地标方法来量化血管特征。 Result: ArteryX 能够以高准确性和效率量化血管特征,处理时间约为每受试者 10-15 分钟,在 0.5 毫米分辨率下仅需少量用户干预,并且在具有小血管疾病的受试者中显示出对细微血管变化的更高敏感性。 Conclusion: ArteryX 是一个有前景的工具,用于基准特征提取工具箱和无缝集成到临床工作流程中,从而实现脑血管病变的早期检测和患者队列间的标准化比较。 Abstract: Cerebrovascular pathology significantly contributes to cognitive decline and neurological disorders, underscoring the need for advanced tools to assess vascular integrity. Three-dimensional Time-of-Flight Magnetic Resonance Angiography (3D TOF MRA) is widely used to visualize cerebral vasculature, however, clinical evaluations generally focus on major arterial abnormalities, overlooking quantitative metrics critical for understanding subtle vascular changes. Existing methods for extracting structural, geometrical and morphological arterial features from MRA - whether manual or automated - face challenges including user-dependent variability, steep learning curves, and lack of standardized quantitative validations. We propose a novel semi-supervised artery evaluation framework, named ArteryX, a MATLAB-based toolbox that quantifies vascular features with high accuracy and efficiency, achieving processing times ~10-15 minutes per subject at 0.5 mm resolution with minimal user intervention. ArteryX employs a vessel-fused network based landmarking approach to reliably track and manage tracings, effectively addressing the issue of dangling/disconnected vessels. Validation on human subjects with cerebral small vessel disease demonstrated its improved sensitivity to subtle vascular changes and better performance than an existing semi-automated method. Importantly, the ArteryX toolbox enables quantitative feature validation by integrating an in-vivo like artery simulation framework utilizing vessel-fused graph nodes and predefined ground-truth features for specific artery types. Thus, the ArteryX framework holds promise for benchmarking feature extraction toolboxes and for seamless integration into clinical workflows, enabling early detection of cerebrovascular pathology and standardized comparisons across patient cohorts to advance understanding of vascular contributions to brain health.