cs.CL [Back]

[1] Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Itay Itzhak,Yonatan Belinkov,Gabriel Stanovsky

Main category: cs.CL

TL;DR: This paper investigates the origins of cognitive biases in large language models and finds that they are primarily shaped by pretraining rather than finetuning or random training variations.

Details

Motivation: The study aims to disentangle whether differences in cognitive biases among large language models stem from pretraining, finetuning, or training stochasticity, as prior work has shown variability in biases but their sources remain unclear. Method: A two-step causal experimental approach was used: (1) multiple finetuning runs with different random seeds to assess the impact of training randomness on cognitive biases, and (2) cross-tuning, which involved swapping instruction datasets between models to determine if biases are dataset-dependent. Result: While training randomness introduces some variability, the study found that biases are mainly influenced by pretraining. Models sharing the same pretrained backbone exhibited more similar bias patterns compared to those that only shared finetuning data. Conclusion: Biases in large language models are primarily shaped by pretraining rather than just finetuning or training randomness, suggesting that future bias mitigation strategies should focus on pretraining origins. Abstract: Large language models (LLMs) exhibit cognitive biases -- systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over $30$ cognitive biases. Second, we introduce \emph{cross-tuning} -- swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.

[2] Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

Jens Rupprecht,Georg Ahnert,Markus Strohmaier

Main category: cs.CL

TL;DR: 本文研究了大型语言模型在规范调查情境中的响应鲁棒性，并发现了其一致性的近期偏差。

Details

Motivation: 大型语言模型（LLMs）越来越多地被用作社会科学调查中人类受试者的代理，但它们的可靠性和对已知响应偏差的敏感性却知之甚少。 Method: 对来自世界价值观调查（WVS）的问题应用了全面的11种扰动，并进行了超过167,000次模拟访谈。 Result: 揭示了LLMs对扰动的脆弱性和一致性近期偏差，且较大的模型通常更具鲁棒性。 Conclusion: 使用LLMs生成合成调查数据时，提示设计和鲁棒性测试至关重要。 Abstract: Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts -- we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also reveal that all tested models exhibit a consistent \textit{recency bias} varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

[3] SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

Krithika Ramesh,Daniel Smolyak,Zihao Zhao,Nupoor Gandhi,Ritu Agarwal,Margrét Bjarnadóttir,Anjalie Field

Main category: cs.CL

TL;DR: SynthTextEval is a toolkit that evaluates synthetic text across various dimensions, enhancing its utility and promoting privacy in AI systems.

Details

Motivation: The motivation stems from the need for principled evaluations of synthetic text generated by large language models, particularly in high-stakes domains like healthcare and law, to ensure its safe and effective use while preserving privacy. Method: The toolkit provides evaluation modules for assessing synthetic text on utility, fairness, privacy risks, distributional differences, and qualitative expert feedback. It includes a generation module for creating synthetic data. Result: SynthTextEval enables comprehensive evaluations of synthetic text over custom or generated datasets, with demonstrated functionality and effectiveness in high-stakes domains. Conclusion: SynthTextEval is a toolkit designed to evaluate synthetic text across multiple dimensions, aiming to improve the viability of synthetic text and privacy-preservation in AI development. Abstract: We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit's generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.

[4] Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings

Minseon Kim,Jean-Philippe Corbeil,Alessandro Sordoni,Francois Beaulieu,Paul Vozila

Main category: cs.CL

TL;DR: This paper introduces a new safety evaluation framework for medical large language models, focusing on patient and clinician perspectives, and presents PatientSafetyBench as a benchmark for assessing safety in the medical domain.

Details

Motivation: The motivation stems from the increasing adoption of large language models (LLMs) in the medical field and the associated safety concerns due to their impact on human health. Prior evaluations have mainly focused on general safety benchmarks, leaving a lack of domain-specific safety assessments for medical LLMs. Method: The authors developed a safety evaluation protocol tailored to the medical domain, incorporating perspectives from both patients and clinicians, as well as general safety benchmarks. They created PatientSafetyBench, which contains 466 samples across five critical categories, and applied red-teaming methods on the MediPhi model collection as a case study. Result: The result of this work is the creation of PatientSafetyBench and the implementation of red-teaming protocols that reveal gaps in current safety evaluations of medical LLMs. This approach enables a more comprehensive understanding of safety risks from different user perspectives—patients, clinicians, and general users. Conclusion: This paper concludes that there is a significant gap in safety evaluation for medical LLMs, particularly from the patient and clinician perspectives. By introducing PatientSafetyBench and applying red-teaming protocols on MediPhi models, it establishes a foundation for more thorough and domain-specific safety assessments before deploying these models in medical settings. Abstract: As the performance of large language models (LLMs) continues to advance, their adoption is expanding across a wide range of domains, including the medical field. The integration of LLMs into medical applications raises critical safety concerns, particularly due to their use by users with diverse roles, e.g. patients and clinicians, and the potential for model's outputs to directly affect human health. Despite the domain-specific capabilities of medical LLMs, prior safety evaluations have largely focused only on general safety benchmarks. In this paper, we introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives, alongside general safety assessments and quantitatively analyze the safety of medical LLMs. We bridge a gap in the literature by building the PatientSafetyBench containing 466 samples over 5 critical categories to measure safety from the perspective of the patient. We apply our red-teaming protocols on the MediPhi model collection as a case study. To our knowledge, this is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view - patient, clinician, and general user - establishing a foundation for safer deployment in medical domains.

[5] The Impact of Background Speech on Interruption Detection in Collaborative Groups

Mariah Bradford,Nikhil Krishnaswamy,Nathaniel Blanchard

Main category: cs.CL

TL;DR: This paper presents a new approach to detecting interruptions in classroom settings with multiple simultaneous conversations, improving AI support for collaborative learning.

Details

Motivation: Previous work on interruption detection has primarily focused on single-conversation environments with clean audio, while real classroom settings involve multiple concurrent conversations and overlapping speech. Method: The authors analyze interruption detection in both single-conversation and multi-group dialogue settings to create a robust method for identifying interruptions in environments with overlapping speech. Result: The study develops a state-of-the-art method for interruption identification that is effective in handling overlapping speech and provides insights into the linguistic and prosodic features of interruptions in group interactions. Conclusion: The paper concludes that interruption plays a significant role in collaborative learning, and AI-driven support can assist teachers in monitoring these interactions even with multiple concurrent conversations. Abstract: Interruption plays a crucial role in collaborative learning, shaping group interactions and influencing knowledge construction. AI-driven support can assist teachers in monitoring these interactions. However, most previous work on interruption detection and interpretation has been conducted in single-conversation environments with relatively clean audio. AI agents deployed in classrooms for collaborative learning within small groups will need to contend with multiple concurrent conversations -- in this context, overlapping speech will be ubiquitous, and interruptions will need to be identified in other ways. In this work, we analyze interruption detection in single-conversation and multi-group dialogue settings. We then create a state-of-the-art method for interruption identification that is robust to overlapping speech, and thus could be deployed in classrooms. Further, our work highlights meaningful linguistic and prosodic information about how interruptions manifest in collaborative group interactions. Our investigation also paves the way for future works to account for the influence of overlapping speech from multiple groups when tracking group dialog.

[6] Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

Anirban Saha Anik,Xiaoying Song,Elliott Wang,Bryan Wang,Bengisu Yarimbas,Lingzi Hong

Main category: cs.CL

TL;DR: This paper proposes a Multi-agent Retrieval-Augmented Framework to generate effective counterspeech against health misinformation, demonstrating superior performance compared to existing methods.

Details

Motivation: Current studies on using LLMs with RAG for generating counterspeech against misinformation rely on limited evidence and offer less control over outputs, necessitating a more robust approach. Method: The method uses a Multi-agent Retrieval-Augmented Framework that incorporates multiple LLMs for optimizing knowledge retrieval, evidence enhancement, and response refinement. Result: The proposed method outperforms baseline approaches in terms of politeness, relevance, informativeness, and factual accuracy. Ablation studies and human evaluations further validate the necessity of each component and the effectiveness of response refinement. Conclusion: The proposed Multi-agent Retrieval-Augmented Framework effectively generates high-quality counterspeech against health misinformation by integrating multiple LLMs and both static and dynamic evidence. Abstract: Large language models (LLMs) incorporated with Retrieval-Augmented Generation (RAG) have demonstrated powerful capabilities in generating counterspeech against misinformation. However, current studies rely on limited evidence and offer less control over final outputs. To address these challenges, we propose a Multi-agent Retrieval-Augmented Framework to generate counterspeech against health misinformation, incorporating multiple LLMs to optimize knowledge retrieval, evidence enhancement, and response refinement. Our approach integrates both static and dynamic evidence, ensuring that the generated counterspeech is relevant, well-grounded, and up-to-date. Our method outperforms baseline approaches in politeness, relevance, informativeness, and factual accuracy, demonstrating its effectiveness in generating high-quality counterspeech. To further validate our approach, we conduct ablation studies to verify the necessity of each component in our framework. Furthermore, human evaluations reveal that refinement significantly enhances counterspeech quality and obtains human preference.

[7] GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation

Fardin Rastakhiz

Main category: cs.CL

TL;DR: This paper proposes an efficient deep learning architecture combining GNNs, CNNs, and real-time graph generation for text classification, avoiding the inefficiencies of Transformers in handling long texts.

Details

Motivation: Transformers have quadratic computational complexity for long texts, making them inefficient. This study aims to develop a more time-, cost-, and energy-efficient model for processing extended documents. Method: A novel architecture integrating Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), coupled with end-to-end real-time graph generation using character-level inputs and enhanced by LLM-based dictionary lookups. Result: The model captures local context with CNNs, expands receptive fields through lattice-based graphs, and aggregates document-level information via small-world graphs. The generated graphs show meaningful semantic structures with clustering coefficients of ~0.45 and shortest path lengths of 4–5. Conclusion: The proposed model combining GNNs and CNNs with real-time graph generation proves to be efficient and competitive for text classification tasks, demonstrating meaningful semantic organization and high performance without padding or truncation. Abstract: Time, cost, and energy efficiency are critical considerations in Deep-Learning (DL), particularly when processing long texts. Transformers, which represent the current state of the art, exhibit quadratic computational complexity relative to input length, making them inefficient for extended documents. This study introduces a novel model architecture that combines Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated with a real-time, end-to-end graph generation mechanism. The model processes compact batches of character-level inputs without requiring padding or truncation. To enhance performance while maintaining high speed and efficiency, the model incorporates information from Large Language Models (LLMs), such as token embeddings and sentiment polarities, through efficient dictionary lookups. It captures local contextual patterns using CNNs, expands local receptive fields via lattice-based graph structures, and employs small-world graphs to aggregate document-level information. The generated graphs exhibit structural properties indicative of meaningful semantic organization, with an average clustering coefficient of approximately 0.45 and an average shortest path length ranging between 4 and 5. The model is evaluated across multiple text classification tasks, including sentiment analysis and news-categorization, and is compared against state-of-the-art models. Experimental results confirm the proposed model's efficiency and competitive performance.

[8] MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning

Hieu Tran,Zonghai Yao,Won Seok Jang,Sharmin Sultana,Allen Chang,Yuan Zhang,Hong Yu

Main category: cs.CL

TL;DR: The paper introduces MedReadCtrl, a readability-controlled instruction tuning framework for LLMs in healthcare, which adjusts output complexity while preserving meaning, resulting in improved performance and expert preference especially at low literacy levels.

Details

Motivation: A critical challenge for deployment of generative AI in healthcare is effective human-AI communication where content must be both personalized and understandable. Method: MedReadCtrl, a readability-controlled instruction tuning framework is introduced, enabling LLMs to adjust output complexity without compromising meaning. Result: Evaluations show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 and delivers substantial gains on unseen clinical tasks. Experts preferred MedReadCtrl especially at low literacy levels. Conclusion: MedReadCtrl offers a scalable solution to support patient education and expand equitable access to AI-enabled care. Abstract: Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl's ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.

[9] SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data

Zonghai Yao,Youxia Zhao,Avijit Mitra,David A. Levy,Emily Druhl,Jack Tsai,Hong Yu

Main category: cs.CL

TL;DR: This paper introduces SynthEHR-Eviction, a novel pipeline that uses LLMs and automated techniques to efficiently extract eviction information from clinical notes, creating the largest eviction-related health dataset and outperforming existing models.

Details

Motivation: Eviction is a significant social determinant of health but is rarely coded in structured fields, limiting its use in downstream applications. There is a need for scalable methods to extract this information from unstructured clinical notes. Method: SynthEHR-Eviction combines LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from unstructured EHR data. Result: The fine-tuned LLMs trained on SynthEHR-Eviction achieved higher Macro-F1 scores compared to GPT-4o-APO, GPT-4o-mini-APO, and BioBERT. The dataset created includes 14 fine-grained categories, marking it as the largest public eviction-related SDoH dataset to date. Conclusion: SynthEHR-Eviction is an effective and scalable pipeline for extracting eviction statuses from clinical notes, significantly reducing annotation effort while achieving high performance across various model sizes. Abstract: Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.

[10] Towards Interpretable Time Series Foundation Models

Matthieu Boileau,Philippe Helluy,Jeremy Pawlus,Svitlana Vyetrenko

Main category: cs.CL

TL;DR: 本研究探讨了将时间序列推理能力提炼到小型语言模型中的可行性，并展示了这些模型在理解和解释时间序列数据方面的潜力。

Details

Motivation: 构建可解释的时间序列基础模型，作为迈向小型、指令调整的语言模型的一步。 Method: 利用合成的时间序列数据集和大型多模态模型生成的自然语言注释，对紧凑的Qwen模型进行微调，并引入了评估指标来衡量提炼出的推理质量。 Result: 经过后训练的模型获得了有意义的解释能力，并且能够在设备上或隐私敏感部署的应用中使用。 Conclusion: 研究结果表明，将时间序列理解压缩到轻量级、语言能力强的模型中是可行的，这为开发能够以自然语言解释时间模式的小型可解释模型奠定了基础。 Abstract: In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.

[11] SAND: Boosting LLM Agents with Self-Taught Action Deliberation

Yu Xia,Yiran Jenny Shen,Junda Wu,Tong Yu,Sungchul Kim,Ryan A. Rossi,Lina Yao,Julian McAuley

Main category: cs.CL

TL;DR: 本文提出了Self-taught ActioN Deliberation (SAND) 框架，通过显式推理候选动作来提高大型语言模型（LLM）代理的决策能力，并在实验中显示出优于现有方法的效果。

Details

Motivation: 现有的LLM代理调优方法往往缺乏对替代动作的推理和比较，导致可能过度承诺看似合理但次优的动作。 Method: 提出了一种名为Self-taught ActioN Deliberation (SAND) 的框架，利用自洽动作采样和执行引导的动作批判来合成逐步动作推理思维，并迭代地用于微调LLM代理本身。 Result: 在两个代表性的交互代理任务上的评估中，SAND比初始监督微调平均提高了20%，并优于最先进的代理调优方法。 Conclusion: SAND框架通过显式地对候选动作进行推理，提高了LLM代理的决策能力，并在两个典型的交互任务上优于现有的调优方法。 Abstract: Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.

[12] RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

Hongzhi Zhang,Jia Fu,Jingyuan Zhang,Kai Fu,Qi Wang,Fuzheng Zhang,Guorui Zhou

Main category: cs.CL

TL;DR: RLEP是一种新的强化学习框架，通过缓存和回放高质量经验提升训练效率和模型性能。

Details

Motivation: 大规模语言模型的强化学习训练不稳定且策略容易偏离预训练权重，因此需要一种方法来提高训练效率并避免无效探索。 Method: RLEP采用两阶段框架：首先收集验证过的轨迹，然后在后续训练中回放缓存的成功经验，并在每次更新步骤中将新生成的rollouts与回放的经验结合进行优化。 Result: 在Qwen2.5-Math-7B模型上测试显示，RLEP以更少的更新次数达到了基线模型的峰值精度，并进一步提升了AIME-2024、AIME-2025和AMC-2023数据集上的准确率。 Conclusion: RLEP框架通过重放缓存中的高质量经验，提高了强化学习训练的稳定性和效率，同时取得了优于基线模型的性能。 Abstract: Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}\, -- \,Reinforcement Learning with Experience rePlay\, -- \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.

[13] Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Kaiqu Liang,Haimin Hu,Xuandong Zhao,Dawn Song,Thomas L. Griffiths,Jaime Fernández Fisac

Main category: cs.CL

TL;DR: 本文提出了“机器废话”的概念，结合哲学理论与实验方法，分析大型语言模型在生成内容时对真实性的忽视问题，并指出训练方法如RLHF和CoT提示可能加剧这一问题。

Details

Motivation: 哲学家哈里·法兰克福提出的“废话”概念指代那些不考虑其真实性的话语。尽管之前的研究已经探讨了大型语言模型（LLM）的幻觉和谄媚现象，但本研究提出了“机器废话”作为一个更广泛的理论框架，旨在帮助研究人员更好地理解LLM中出现的真实感丧失现象及其机制。 Method: 引入了“废话指数”作为衡量LLM对事实漠视的新指标，并提出了一种补充性的分类法，分析四种定性形式的废话：空洞修辞、搪塞、模糊词语和未经验证的声明。研究使用Marketplace数据集、政治中立性数据集以及新构建的BullshitEval基准进行实证评估。 Result: 研究结果表明，通过人类反馈强化学习（RLHF）微调模型会显著加剧废话现象；推理时的思维链提示（CoT）会显著放大某些废话形式，尤其是空洞修辞和搪塞。此外，在政治语境中，模糊词语是最常见的废话策略。 Conclusion: 研究发现，通过人类反馈强化学习（RLHF）微调模型会加剧“废话”现象，推理时的思维链提示（CoT）也会显著放大特定形式的废话，尤其是在政治语境中，模糊词语成为主要策略。这为实现更真实的LLM行为提供了新的见解和系统性挑战。 Abstract: Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs' indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.

[14] PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving

Mihir Parmar,Palash Goyal,Xin Liu,Yiwen Song,Mingyang Ling,Chitta Baral,Hamid Palangi,Tomas Pfister

Main category: cs.CL

TL;DR: 本文提出了一种名为PLAN-TUNING的后训练框架，通过从大规模LLMs提取合成任务分解来提升较小模型的复杂推理能力。

Details

Motivation: 利用规划结构在后训练中提升较小开源LLMs性能的研究仍不足。 Method: 引入了一个统一的后训练框架，从大规模LLMs中提取合成任务分解，并通过监督学习和强化学习目标微调较小模型。 Result: 在GSM8k和MATH基准测试中，plan-tuned模型平均超越强基线约7%。 Conclusion: PLAN-TUNING是一个有效的策略，用于提升较小LLMs的任务特定性能。 Abstract: Recently, decomposing complex problems into simple subtasks--a crucial part of human-like natural planning--to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed "planning trajectories") from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $\sim7\%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $\sim10\%$ and $\sim12\%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.

[15] Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code

Keqin Bao,Nuo Chen,Xiaoyuan Li,Binyuan Hui,Bowen Yu,Fuli Feng,Junyang Lin,Xiangnan He,Dayiheng Liu

Main category: cs.CL

TL;DR: TeaR通过数据筛选和强化学习改进LLMs的推理能力，在多个基准测试中表现出显著的性能提升。

Details

Motivation: 为了克服现有模型在模拟代码执行时对复杂算法模式的过度拟合问题，需要一种新的方法来增强LLMs的核心推理能力。 Method: TeaR结合了精心的数据筛选与强化学习技术，引导模型发现代码相关任务中的最佳推理路径。 Result: 实验结果显示，TeaR在多个基准测试中表现优异，特别是在Qwen2.5-7B上实现了35.9%的提升，在R1-Distilled-7B上实现了5.9%的提升。 Conclusion: TeaR有效提升了LLMs的推理能力，通过数据优化和强化学习方法，在多种基准测试中展现出显著的性能提升。 Abstract: Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.

[16] Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature

Hein Htet,Amgad Ahmed Ali Ibrahim,Yutaka Sasaki,Ryoji Asahi

Main category: cs.CL

TL;DR: This paper proposes an information extraction approach using DyGIE++ and domain-specific BERT variants to effectively extract ORR catalyst-related data from scientific literature, showing that these models outperform general scientific ones.

Details

Motivation: Extracting structured information about ORR catalysts from vast scientific literature is challenging due to the complexity and diversity of textual data, making it essential to develop efficient automated methods. Method: The researchers used a named entity recognition (NER) and relation extraction (RE) approach with DyGIE++ and multiple pre-trained BERT variants. They manually constructed a comprehensive dataset identifying 12 critical entities and two relationship types, followed by data annotation, integration, and fine-tuning of transformer-based models. Result: The fine-tuned PubMedBERT model achieved the highest NER F1-score of 82.19%, while the MatSciBERT model attained the best RE F1-score of 66.10%. The comparison with human annotators highlighted the reliability of these models. Conclusion: The study concludes that domain-specific BERT models, such as PubMedBERT and MatSciBERT, outperform general scientific models in extracting ORR catalyst-related information, demonstrating their potential for scalable and automated literature analysis. Abstract: The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.

[17] Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Varin Sikka,Vishal Sikka

Main category: cs.CL

TL;DR: This paper examines the computational limits of large language models (LLMs), showing they cannot handle or verify complex tasks, impacting their reliability in advanced autonomous applications.

Details

Motivation: With the increasing use of transformer-based language models (LLMs) in AI, there is a need to understand their limitations, especially regarding hallucinations and their use in autonomous or semi-autonomous agents performing real-world tasks. Method: The paper analyzes the capabilities and limitations of LLMs from the perspective of computational complexity theory. Result: The study finds that LLMs cannot perform or verify computational and agentic tasks beyond a certain level of complexity, providing examples to illustrate these limitations. Conclusion: LLMs have inherent limitations in handling tasks beyond a certain complexity and verifying the accuracy of such tasks, which has implications for their application in agentic roles. Abstract: With widespread adoption of transformer-based language models in AI, there is significant interest in the limits of LLMs capabilities, specifically so-called hallucinations, occurrences in which LLMs provide spurious, factually incorrect or nonsensical information when prompted on certain subjects. Furthermore, there is growing interest in agentic uses of LLMs - that is, using LLMs to create agents that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.

[18] Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System

Yuanchen Shi,Longyin Zhang,Fang Kong

Main category: cs.CL

TL;DR: This paper proposes a framework called CADSS and a new dataset, CPsDD, to improve psychological support through AI-generated dialogues, showing state-of-the-art results in empathetic response generation.

Details

Motivation: The scarcity of non-English psychological support datasets has limited the availability of psychological assistance through AI systems. This work aims to bridge that gap by generating a comprehensive dataset and an effective dialogue support system tailored for psychological counseling. Method: A framework named CADSS is introduced, which includes a Profiler, Summarizer, Planner, and Supporter to analyze user characteristics, condense dialogue history, select strategies, and generate empathetic responses. Two large language models, Dialog Generator and Dialog Modifier, are fine-tuned using predefined paths and real-world data to create the Chinese Psychological support Dialogue Dataset (CPsDD). Result: The proposed approach resulted in the creation of the Chinese Psychological support Dialogue Dataset (CPsDD) with 68K dialogues across multiple categories. The CADSS framework demonstrated superior performance on Strategy Prediction and Emotional Support Conversation tasks on both CPsDD and ESConv datasets. Conclusion: The CADSS framework achieves state-of-the-art performance in psychological support dialogue generation on both the CPsDD and ESConv datasets, demonstrating its effectiveness in leveraging limited real-world data and expert knowledge for empathetic response generation. Abstract: The growing need for psychological support due to increasing pressures has exposed the scarcity of relevant datasets, particularly in non-English languages. To address this, we propose a framework that leverages limited real-world data and expert knowledge to fine-tune two large language models: Dialog Generator and Dialog Modifier. The Generator creates large-scale psychological counseling dialogues based on predefined paths, which guide system response strategies and user interactions, forming the basis for effective support. The Modifier refines these dialogues to align with real-world data quality. Through both automated and manual review, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 psychological problems, 13 causes, and 12 support focuses. Additionally, we introduce the Comprehensive Agent Dialogue Support System (CADSS), where a Profiler analyzes user characteristics, a Summarizer condenses dialogue history, a Planner selects strategies, and a Supporter generates empathetic responses. The experimental results of the Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate that CADSS achieves state-of-the-art performance on both CPsDD and ESConv datasets.

[19] Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Mikey Elmers,Koji Inoue,Divesh Lala,Tatsuya Kawahara

Main category: cs.CL

TL;DR: This paper introduces the application of Voice Activity Projection (VAP) in triadic spoken dialogues, showing its effectiveness in predicting turn-taking and potential use in dialogue systems.

Details

Motivation: Turn-taking is a key aspect of spoken dialogue, yet most studies focus on dyadic settings; this work extends the analysis to triadic multi-party interactions. Method: The researchers trained multiple VAP models on a Japanese triadic conversation dataset to predict future voice activity using only acoustic data. Result: VAP models trained on triadic conversations outperformed baselines, although the type of conversation influenced prediction accuracy. Conclusion: This study demonstrates that VAP can effectively predict turn-taking in triadic dialogue scenarios, providing a foundation for its incorporation into spoken dialogue systems. Abstract: Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.

[20] CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text

Akram Elbouanani,Evan Dufraisse,Aboubacar Tuo,Adrian Popescu

Main category: cs.CL

TL;DR: This paper shows that well-designed few-shot prompts for LLMs can outperform traditional fine-tuned models in multilingual subjectivity detection, especially under challenging data conditions.

Details

Motivation: To explore whether LLMs can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Method: Competitive approach using large language models (LLMs) with few-shot prompting was applied in the CheckThat! 2025 evaluation campaign's subjectivity detection task. Advanced prompt engineering techniques were experimented with, including debating LLMs and example selection strategies. Result: The system achieved top rankings across multiple languages, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. It demonstrated robustness on the Arabic dataset, likely due to resilience to annotation inconsistencies. Conclusion: LLM-based few-shot learning proves effective and adaptable for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning when labeled data is scarce or inconsistent. Abstract: This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.

[21] The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Chen Amiraz,Yaroslav Fyodorov,Elad Haramaty,Zohar Karnin,Liane Lewin-Eytan

Main category: cs.CL

TL;DR: This paper investigates cross-lingual retrieval-augmented generation (RAG) in a domain-specific setting using Arabic-English benchmarks derived from corporate data. It identifies retrieval as a key bottleneck and proposes a strategy that improves cross-lingual performance by balancing retrieval across languages.

Details

Motivation: Prior work on cross-lingual RAG has largely focused on generation and relied on open-domain benchmarks like Wikipedia, which may obscure retrieval challenges due to language imbalances and reliance on pretraining data. This study aims to uncover these challenges in a more realistic, domain-specific setting. Method: The authors used benchmarks derived from real-world corporate datasets to systematically study multilingual retrieval behavior, including all combinations of languages for user queries and supporting documents. They proposed a retrieval strategy enforcing equal retrieval from both languages. Result: The research revealed significant performance drops in cross-lingual RAG when the query and document languages differ, primarily due to the retriever's inability to effectively rank documents across languages. The proposed retrieval strategy led to substantial improvements in performance. Conclusion: The study concludes that retrieval is a critical bottleneck in cross-lingual domain-specific RAG scenarios, and a simple retrieval strategy can address this issue, leading to improved cross-lingual and overall performance. Abstract: Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever's difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.

[22] The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

Jierun Chen,Tiezheng Yu,Haoli Bai,Lewei Yao,Jiannan Wu,Kaican Li,Fei Mi,Chaofan Tao,Lei Zhu,Manyi Zhang,Xiaohui Li,Lu Hou,Lifeng Shang,Qun Liu

Main category: cs.CL

TL;DR: This paper explores the effects of combining long-CoT SFT and RL in VLMs, finding that while each technique has unique strengths, their integration results in trade-offs rather than synergistic improvements.

Details

Motivation: The motivation is to understand how post-training techniques like long-CoT SFT and RL can be effectively combined to improve reasoning capabilities in vision-language models (VLMs), inspired by their synergy in language-only models. Method: The authors conducted a systematic investigation into the roles and interplay of long-CoT supervised fine-tuning (SFT) and reinforcement learning (RL) across multiple multimodal reasoning benchmarks. Result: Long-CoT SFT enhances performance on complex questions through structured reasoning but degrades performance on simpler ones, whereas RL improves generalization and brevity consistently across all difficulty levels. However, combining these techniques leads to trade-offs without additive benefits. Conclusion: The study concludes that while long-CoT SFT and RL individually enhance different aspects of reasoning in VLMs, their combination does not yield additive benefits, indicating a need for more adaptive integration methods. Abstract: Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma'' highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.

[23] Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

Yupu Liang,Yaping Zhang,Zhiyang Zhang,Yang Zhao,Lu Xiang,Chengqing Zong,Yu Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为M4Doc的新方法，用于解决文档图像机器翻译中的泛化问题，并展示了其在翻译质量和跨域泛化方面的显著性能提升。

Details

Motivation: 文档图像机器翻译面临由于训练数据有限和视觉与文本信息之间复杂交互带来的泛化挑战。 Method: M4Doc将仅图像编码器与预训练在大规模文档图像数据集上的MLLM的多模态表示对齐，从而使得轻量级DIMT模型能够在训练期间学习关键的视觉-文本相关性。 Result: 全面实验表明，在翻译质量上有显著提升，尤其是在跨域泛化和具有挑战性的文档图像场景中。 Conclusion: M4Doc是一个新的单到混合模态对齐框架，它通过利用多模态大语言模型（MLLMs）来解决文档图像机器翻译中的泛化挑战。 Abstract: Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.

[24] Bayesian Discrete Diffusion Beats Autoregressive Perplexity

Cooper Doyle

Main category: cs.CL

TL;DR: 研究发现离散扩散语言模型具有隐藏的贝叶斯核心，并提出了一种高效的推理时间集成方法，在不增加训练成本的情况下提高了性能。

Details

Motivation: 揭示离散扩散语言模型的隐藏贝叶斯核心，并提供一种简单的一致性证明和有限样本误差界。 Method: 介绍了一种轻量级推理时间集成方法，通过对K次掩码和去噪过程进行平均来获得后验感知的令牌概率和不确定性估计。 Result: 蒙特卡洛边际化在K次独立腐败下以O(1/sqrt(K))的速度收敛于精确后验，实现更准确的后验估计。 Conclusion: 通过在WikiText-2上使用K=8的方法实现了8.8的测试困惑度，优于GPT-2 Small的20.3，且无需额外训练成本。 Abstract: We reveal a hidden Bayesian core of discrete-diffusion language models by showing that the expected denoiser output under the forward masking distribution recovers the exact posterior over clean tokens. Under minimal assumptions, Monte Carlo marginalization over K independent corruptions converges to this posterior at rate O(1/sqrt(K)), yielding a simple proof of consistency and finite-sample error bounds. Building on this insight, we introduce a lightweight inference-time ensemble that averages K mask-and-denoise passes to obtain posterior-aware token probabilities and uncertainty estimates at no extra training cost. On WikiText-2, our method achieves test perplexity 8.8 with K=8, versus 20.3 for GPT-2 Small, despite using a model of comparable size. Code is available at https://github.com/mercury0100/bayesradd.

[25] Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

Joyeeta Datta,Niclas Doll,Qusai Ramadan,Zeyd Boukhers

Main category: cs.CL

TL;DR: This paper shows that LLMs can be significantly compressed via Knowledge Distillation while preserving most of their QA performance, especially with one-shot prompting.

Details

Motivation: Large Language Models have high computational demands, which hinders their use in resource-limited settings. This work explores how much LLMs can be compressed using Knowledge Distillation while maintaining performance on Question Answering tasks. Method: The study evaluated student models distilled from Pythia and Qwen2.5 on SQuAD and MLQA benchmarks under zero-shot and one-shot prompting. Result: Student models retained over 90% of teacher models' performance while reducing parameter counts by up to 57.1%. One-shot prompting also provided additional gains compared to zero-shot setups. Conclusion: Knowledge Distillation combined with minimal prompting can produce compact and capable QA systems suitable for resource-constrained environments. Abstract: Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using Knowledge Distillation (KD) while maintaining strong performance on Question Answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models' performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for resource-constrained applications.

[26] FrugalRAG: Learning to retrieve and reason for multi-hop QA

Abhinav Java,Srivathsan Koundinyan,Nagarajan Natarajan,Amit Sharma

Main category: cs.CL

TL;DR: This paper shows that better prompting and fine-tuning can improve the efficiency of retrieval-augmented generation (RAG) without requiring large-scale training.

Details

Motivation: Efficiency in retrieval searches is an overlooked metric in solving complex question answering tasks, despite its importance alongside accuracy and recall. Method: The authors evaluate a standard ReAct pipeline with improved prompts and test supervised and RL-based fine-tuning techniques on RAG benchmarks like HotPotQA. Result: Improved prompting outperforms state-of-the-art methods, and fine-tuning techniques reduce retrieval search costs by nearly half while maintaining performance. Conclusion: Large-scale fine-tuning is not essential for improving RAG metrics, while supervised and RL-based fine-tuning can enhance efficiency in retrieval searches. Abstract: We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).

[27] Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

Haotan Guo,Jianfei He,Jiayuan Ma,Hongbin Na,Zimu Wang,Haiyang Zhang,Qi Chen,Wei Wang,Zijing Shi,Tao Shen,Ling Chen

Main category: cs.CL

TL;DR: This paper introduces a comprehensive taxonomy of Chinese Phonetic Cloaking Replacement (PCR), highlights weaknesses in current toxicity detectors using a new dataset of real-world examples, and proposes an improved Pinyin-based detection method.

Details

Motivation: Phonetic Cloaking Replacement (PCR) poses a significant challenge to Chinese content moderation. Existing evaluations rely heavily on rule-based, synthetic perturbations that overlook real user creativity. This study aims to address this gap. Method: The authors organized PCR into a four-way surface-form taxonomy, compiled a dataset of 500 naturally occurring phonetically cloaked offensive posts, and benchmarked state-of-the-art LLMs. They also revisited a Pinyin-based prompting strategy for mitigation. Result: Benchmarking revealed serious weaknesses in state-of-the-art LLMs, with the best model achieving an F1-score of only 0.672, and zero-shot chain-of-thought prompting further lowering performance. Conclusion: This study provides the first comprehensive taxonomy of Chinese PCR, a realistic benchmark revealing current detectors' limits, and a lightweight mitigation technique advancing research on robust toxicity detection. Abstract: Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors' limits, and a lightweight mitigation technique that advances research on robust toxicity detection.

[28] An Automated Length-Aware Quality Metric for Summarization

Andrew D. Foland

Main category: cs.CL

TL;DR: This paper introduces NOIR, an automated metric for evaluating summarization quality based on semantic retention and summary length compression, which correlates with human perception and can be applied to various summarization tasks.

Details

Motivation: To develop an automated alternative for assessing summarization quality without relying on time-consuming human-generated reference summaries, focusing on the recall-compression tradeoff. Method: The paper introduces NOIR, which uses a language model-embedding to measure semantic similarity and evaluates summarization quality based on semantic meaning retention and summary length compression. Result: Experiments show that NOIR effectively captures the tradeoff between token-length and semantic retention and correlates with human perception of summarization quality. Conclusion: NOIR serves as an automated tool for evaluating and enhancing summarization algorithms, prompts, and synthetically-generated summaries by effectively capturing the token-length/semantic retention tradeoff. Abstract: This paper proposes NOrmed Index of Retention (NOIR), a quantitative objective metric for evaluating summarization quality of arbitrary texts that relies on both the retention of semantic meaning and the summary length compression. This gives a measure of how well the recall-compression tradeoff is managed, the most important skill in summarization. Experiments demonstrate that NOIR effectively captures the token-length / semantic retention tradeoff of a summarizer and correlates to human perception of sumarization quality. Using a language model-embedding to measure semantic similarity, it provides an automated alternative for assessing summarization quality without relying on time-consuming human-generated reference summaries. The proposed metric can be applied to various summarization tasks, offering an automated tool for evaluating and improving summarization algorithms, summarization prompts, and synthetically-generated summaries.

[29] SAS: Simulated Attention Score

Chuanyang Zheng,Jiankai Sun,Yihang Gao,Yuehao Wang,Peihao Wang,Jing Xiong,Liliang Ren,Hao Cheng,Janardhan Kulkarni,Yelong Shen,Atlas Wang,Mac Schwager,Anderson Schneider,Xiaodong Liu,Jianfeng Gao

Main category: cs.CL

TL;DR: This paper introduces Simulated Attention Score (SAS) and Parameter-Efficient Attention Aggregation (PEAA) to enhance attention mechanisms in Transformers, achieving better performance without increasing model size.

Details

Motivation: The authors observed that increasing the number of attention heads and hidden size per head improves performance in multi-head attention (MHA), but with minimal parameter overhead. This motivated the development of a method to simulate larger attention capacity without increasing parameter count. Method: Simulated Attention Score (SAS) projects low-dimensional head representations into a higher-dimensional space to simulate larger attention heads and feature dimensions, and Parameter-Efficient Attention Aggregation (PEAA) controls parameter cost. Result: Comprehensive experiments demonstrated the effectiveness of SAS, achieving significant improvements over different attention variants across various datasets and tasks. Conclusion: The proposed SAS method, along with the PEAA technique, effectively enhances model performance while maintaining a compact size, outperforming various attention variants. Abstract: The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.

[30] KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

Hruday Markondapatnaikuni,Basem Suleiman,Abdelkarim Erradi,Shijing Chen

Main category: cs.CL

TL;DR: 本研究提出了一种名为K2RAG的新框架，通过整合密集和稀疏向量搜索、知识图谱和文本摘要来提升大型语言模型的知识扩展能力。

Details

Motivation: 为了克服传统的微调过程对资源的高度消耗以及RAG实施中的可扩展性和答案准确性限制，需要一种新的方法来扩展LLM的知识。 Method: K2RAG使用了一种预处理步骤来总结训练数据，并利用多跳RAG数据集进行评估。 Result: 结果表明，K2RAG取得了0.57的平均答案相似度得分，并达到了0.82的第三四分位数(Q3)相似度。摘要步骤将单个组件的平均训练时间减少了93%，执行速度比传统基于知识图的RAG系统快40%。此外，K2RAG的可扩展性优于几个测试的naive RAG实现，所需的VRAM少三倍。 Conclusion: K2RAG是一个新颖的框架，结合了密集和稀疏向量搜索、知识图谱和文本摘要，以提高检索质量和系统效率。它不仅在准确性和效率方面表现出色，而且具有卓越的可扩展性。 Abstract: Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.

[31] Rethinking the Privacy of Text Embeddings: A Reproducibility Study of "Text Embeddings Reveal (Almost) As Much As Text"

Dominykas Seputis,Yongkang Li,Karsten Langerak,Serghei Mihailov

Main category: cs.CL

TL;DR: This paper reproduces and extends the Vec2Text method, showing that text embeddings can leak sensitive information. While reconstruction works well, especially in ideal conditions, the study identifies limitations and proposes quantization as a practical privacy defense.

Details

Motivation: Recent methods like Vec2Text challenge the assumption that transmitting embeddings is privacy-preserving by showing that original texts can be reconstructed from embeddings. This motivated further verification and extension of these findings, particularly given the opaque nature of high-dimensional embeddings. Method: The authors reproduce the Vec2Text framework and evaluate it by validating original claims and conducting extended experiments, including parameter sensitivity analysis, testing reconstruction of sensitive inputs like passwords, and exploring embedding quantization as a privacy defense. Result: Vec2Text successfully reconstructs text in both in-domain and out-of-domain settings, even recovering password-like sequences. However, its performance is sensitive to input sequence length. Privacy mitigation techniques such as Gaussian noise and quantization are effective, with quantization being simpler and more widely applicable. Conclusion: The study concludes that while Vec2Text can effectively reconstruct text from embeddings under ideal conditions, including sensitive data like passwords, it has limitations such as sensitivity to sequence length. Privacy defenses like Gaussian noise and quantization can mitigate risks, emphasizing the need for caution in using text embeddings and further research into NLP system defenses. Abstract: Text embeddings are fundamental to many natural language processing (NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval (IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems.

[32] Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

Zhijin Dong

Main category: cs.CL

TL;DR: 本文提出Selective-DPO方法，通过选择性地对高影响token进行对齐，提高大语言模型的偏好对齐效果，并降低计算成本。

Details

Motivation: 并非所有tokens对模型性能的贡献相同，如何减少计算开销并提升对齐保真度是一个关键挑战。 Method: 通过利用当前策略与参考模型之间的token级对数概率差异，引入一种选择性对齐策略，优先处理高影响tokens。 Result: 在Arena-Hard和MT-Bench等基准测试中，Selective-DPO优于标准DPO和基于蒸馏的方法。 Conclusion: Selective-DPO方法在优化大语言模型偏好对齐方面表现出色，强调了参考模型选择和token级优化的重要性。 Abstract: Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within preference pairs, leveraging token-level log-probability differences between the current policy and a reference model. By focusing on these informative tokens, our approach reduces computational overhead and enhances alignment fidelity. We further explore the role of reference model quality, demonstrating that stronger reference models significantly improve token selection accuracy and overall optimization effectiveness. Comprehensive experiments on benchmarks such as Arena-Hard and MT-Bench validate the superiority of our Selective-DPO method over standard DPO and distillation-based baselines. Our findings highlight the importance of token-level optimization and reference model selection in advancing preference alignment for LLMs. The code is available at https://github.com/Dongzhijin/SDPO.

[33] Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review

Maha Tufail Agro,Atharva Kulkarni,Karima Kadaoui,Zeerak Talat,Hanan Aldarmaki

Main category: cs.CL

TL;DR: This paper presents a systematic review of code-switching in end-to-end automatic speech recognition (ASR) models, documenting current research efforts, datasets, metrics, and challenges while identifying opportunities for future work.

Details

Motivation: Motivated by growing research interest in automatic speech recognition (ASR) and the prevalence of code-switching in many languages, the authors aim to provide a comprehensive overview of current research efforts and resources. Method: The authors conducted a systematic literature review on code-switching in end-to-end ASR models by collecting and manually annotating papers published in peer-reviewed venues. Result: The paper documents the languages considered, datasets used, evaluation metrics, model choices, and performance outcomes in existing studies on code-switching in end-to-end ASR. Conclusion: The paper concludes that there are opportunities and gaps in the research of code-switching in end-to-end ASR models, which can guide future research. Abstract: Motivated by a growing research interest into automatic speech recognition (ASR), and the growing body of work for languages in which code-switching (CS) often occurs, we present a systematic literature review of code-switching in end-to-end ASR models. We collect and manually annotate papers published in peer reviewed venues. We document the languages considered, datasets, metrics, model choices, and performance, and present a discussion of challenges in end-to-end ASR for code-switching. Our analysis thus provides insights on current research efforts and available resources as well as opportunities and gaps to guide future research.

[34] When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance

Peizhang Shao,Linrui Xu,Jinxi Wang,Wei Zhou,Xingyu Wu

Main category: cs.CL

TL;DR: 这篇论文是关于大型语言模型在法律领域应用的全面综述，提出了一种新的分类方法，结合了法律推理和专业知识体系，总结了该领域的进步并指出了未来的研究方向。

Details

Motivation: 本文旨在全面回顾大型语言模型（LLMs）在法律领域的应用，解决其面临的广泛挑战，包括幻觉、可解释性缺陷、司法管辖区适应困难和伦理不对称问题，并提供一个技术路线图和概念框架。 Method: 本文采用了一种创新的双重视角分类法，结合了法律推理框架和专业本体论，统一了历史研究和当代突破。通过稀疏注意力机制和技术创新如混合专家架构来处理文本处理、知识整合和评估严谨性中的核心挑战。 Result: 在任务泛化、推理形式化、工作流程集成以及通过技术创新解决文本处理、知识整合和评估严谨性的核心挑战方面取得了显著进展。同时确定了关键前沿领域，包括低资源系统、多模态证据整合和动态反驳处理。 Conclusion: 本文提出了一个新的分类法，将法律角色映射到NLP子任务，并计算实现了图尔敏论证框架，为研究人员提供了技术路线图，为从业者提供了概念框架，为下一代法律人工智能奠定了基础。 Abstract: This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.

[35] StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

Shoutao Guo,Xiang Li,Shaolei Zhang,Mengge Liu,Wei Chen,Yang Feng

Main category: cs.CL

TL;DR: 本文提出了一种名为StreamUni的方法，通过统一的大语音语言模型（LSLM）实现流式语音翻译（StreamST），在不依赖大量策略特定训练的情况下完成语音分割、策略决策和翻译生成。

Details

Motivation: 现有的流式语音翻译方法通常基于句子级语音片段操作，需要与分割模型协作，且受限于有限的上下文信息，难以学习有效的策略。 Method: StreamUni引入了语音思维链（CoT），指导LSLM生成多阶段输出，从而同时完成语音分割、策略决策和翻译生成。此外，还提出了流式CoT训练方法，以增强低延迟策略决策和生成能力。 Result: 实验表明，该方法在流式语音翻译任务中达到了最先进的性能。 Conclusion: StreamUni通过统一的大语音语言模型解决了现有流式语音翻译方法中存在的问题，实现了高效、准确的实时翻译。 Abstract: Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.

[36] Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers

Sara Candussio,Gaia Saveri,Gabriele Sarti,Luca Bortolussi

Main category: cs.CL

TL;DR: This paper proposes a Transformer-based decoder model to invert semantic embeddings of STL formulae, achieving valid formula generation and semantic generalization within a few epochs, enabling optimization directly in the semantic space for requirement mining tasks.

Details

Motivation: The motivation is to find an invertible translation of optimal continuous representations of logic formulae into concrete requirements. This allows integration of symbolic knowledge into data-driven learning algorithms and enables continuous learning and optimization in the semantic space of formulae. Method: A Transformer-based decoder-only model was trained to invert semantic embeddings of Signal Temporal Logic (STL) formulae. A small vocabulary was constructed from STL syntax, and the model's performance was evaluated across various levels of training data complexity to assess its ability to capture semantic information and generalize out-of-distribution. Result: The proposed model was able to generate valid STL formulae after only 1 epoch and generalized to the semantics of the logic in about 10 epochs. It decoded embeddings into simpler formulae with reduced length and nesting while maintaining semantic equivalence. The methodology proved effective across varying training data complexities. Conclusion: The study concludes that a Transformer-based decoder-only model can effectively invert semantic embeddings of STL formulae, allowing for the generation of valid and often simpler formulae that remain semantically close to references. The model demonstrates generalization capabilities and is deployed for requirement mining tasks, optimizing directly in the semantic space. Abstract: Continuous representations of logic formulae allow us to integrate symbolic knowledge into data-driven learning algorithms. If such embeddings are semantically consistent, i.e. if similar specifications are mapped into nearby vectors, they enable continuous learning and optimization directly in the semantic space of formulae. However, to translate the optimal continuous representation into a concrete requirement, such embeddings must be invertible. We tackle this issue by training a Transformer-based decoder-only model to invert semantic embeddings of Signal Temporal Logic (STL) formulae. STL is a powerful formalism that allows us to describe properties of signals varying over time in an expressive yet concise way. By constructing a small vocabulary from STL syntax, we demonstrate that our proposed model is able to generate valid formulae after only 1 epoch and to generalize to the semantics of the logic in about 10 epochs. Additionally, the model is able to decode a given embedding into formulae that are often simpler in terms of length and nesting while remaining semantically close (or equivalent) to gold references. We show the effectiveness of our methodology across various levels of training formulae complexity to assess the impact of training data on the model's ability to effectively capture the semantic information contained in the embeddings and generalize out-of-distribution. Finally, we deploy our model for solving a requirement mining task, i.e. inferring STL specifications that solve a classification task on trajectories, performing the optimization directly in the semantic space.

[37] Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

Nhi Hoai Doan,Tatsuya Hiraoka,Kentaro Inui

Main category: cs.CL

TL;DR: This paper explores how repetition neurons in large language models affect in-context learning performance, finding that their impact varies by layer depth and identifying ways to reduce repetition without harming ICL effectiveness.

Details

Motivation: The motivation for this research stems from the desire to better understand how large language models recognize input patterns and how this recognition affects their in-context learning abilities, with a focus on reducing repetitive outputs. Method: The authors conducted experiments comparing the effects of repetition neurons and induction heads on in-context learning performance, examining how these elements influence large language models' behavior. Result: The results indicate that repetition neurons have varying impacts on ICL performance based on their layer depth, and effective strategies were identified to reduce repetitive outputs while preserving ICL capabilities. Conclusion: The paper concludes that the impact of repetition neurons on ICL performance is dependent on their depth within the model, and strategies can be employed to minimize repetitive outputs without compromising ICL capabilities. Abstract: This paper investigates the relationship between large language models' (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.

[38] On the Effect of Instruction Tuning Loss on Generalization

Anwoy Chatterjee,H S V N S Kowndinya Renduchintala,Sumit Bhatia,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: This paper introduces Weighted Instruction Tuning (WIT), showing that adjusting the weight of prompt and response tokens improves model performance and robustness compared to traditional instruction tuning methods.

Details

Motivation: The motivation stems from the observation that the conventional auto-regressive objective in instruction tuning often overlooks the importance of prompt tokens, potentially leading to suboptimal performance and limited robustness. Method: The authors propose Weighted Instruction Tuning (WIT) and evaluate its effectiveness through extensive experiments on five language models, three fine-tuning datasets, and five evaluation benchmarks. Result: The results show that standard instruction tuning loss often yields suboptimal performance. The best outcomes were achieved with low-to-moderate weights for prompt tokens and moderate-to-high weights for response tokens. Conclusion: The study concludes that differentially weighting prompt and response tokens during instruction tuning can significantly improve model performance and robustness, making it a better alternative to conventional methods. Abstract: Instruction Tuning has emerged as a pivotal post-training paradigm that enables pre-trained language models to better follow user instructions. Despite its significance, little attention has been given to optimizing the loss function used. A fundamental, yet often overlooked, question is whether the conventional auto-regressive objective - where loss is computed only on response tokens, excluding prompt tokens - is truly optimal for instruction tuning. In this work, we systematically investigate the impact of differentially weighting prompt and response tokens in instruction tuning loss, and propose Weighted Instruction Tuning (WIT) as a better alternative to conventional instruction tuning. Through extensive experiments on five language models of different families and scale, three finetuning datasets of different sizes, and five diverse evaluation benchmarks, we show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations. We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings and also serve as better starting points for the subsequent preference alignment training. These findings highlight the need to reconsider instruction tuning loss and offer actionable insights for developing more robust and generalizable models. Our code is open-sourced at https://github.com/kowndinya-renduchintala/WIT.

[39] Conditional Unigram Tokenization with Parallel Data

Gianluca Vico,Jindřinch Libovický

Main category: cs.CL

TL;DR: 本文提出了一种新的条件单字令牌化方法，利用源语言信息优化目标语言的分词效果，旨在提高跨语言处理中的语义一致性。虽然在语言建模中表现良好，但未能提升机器翻译质量。

Details

Motivation: 为了改进传统的单字分词方法，提出一种能够考虑源语言信息的新方法以增强跨语言处理性能。 Method: 引入了一种基于平行数据源语言令牌的条件单字令牌化方法，通过最大化跨语言语义对齐来学习目标分词器。 Result: 该条件分词器保持了与标准单字分词器相当的统计特性，在四种不同语言家族和资源水平的语言对上进行了评估，结果显示在机器翻译任务中未见改善，但在语言模型中一致降低了困惑度。 Conclusion: 研究发现，尽管条件分词器在语言建模中能降低困惑度，但在机器翻译质量上没有提升。作者假设，词汇量的二次扩展导致了条件概率估计的数据效率瓶颈，并认为可能需要替代参数化方法来实现实际的跨语言分词。 Abstract: We introduce conditional unigram tokenization, a novel approach that extends unigram tokenization by conditioning target token probabilities on source-language tokens from parallel data. Given a fixed source tokenizer, our method learns a target tokenizer that maximizes cross-lingual semantic alignment. We evaluate our tokenizer on four language pairs across different families and resource levels, examining intrinsic properties and downstream performance on machine translation and language modeling. While our conditional tokenizer maintains comparable statistical properties to standard unigram tokenizers, results are mixed: we observe no improvements in machine translation quality, but find consistent perplexity reductions in language modeling. We hypothesize that quadratic scaling of conditional probability estimation with respect to the vocabulary size creates a data efficiency bottleneck. Our findings suggest that alternative parameterizations may be necessary for practical cross-lingual tokenization.

[40] From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

Youngjoon Jang,Seongtae Hong,Junyoung Son,Sungjin Park,Chanjun Park,Heuiseok Lim

Main category: cs.CL

TL;DR: 本研究探讨了检索增强生成系统中实体共指问题的影响，发现共指消解能有效提高检索与生成性能，特别是在小型模型上效果更为显著。

Details

Motivation: 检索增强生成（RAG）框架在自然语言处理中具有重要意义，但其效果常常受到检索文档中共指复杂性的阻碍。因此，本文旨在系统研究共指如何影响RAG系统的检索和生成性能。 Method: 通过比较不同池化策略在检索任务中的表现，以及在问答任务中分析不同规模模型受益于消歧过程的程度，系统地研究了实体共指对RAG系统的影响。 Result: 研究表明，共指解析提升了检索的有效性，在问答任务中较小的模型从消歧过程中受益更多，且平均池化策略在应用共指解析后表现出更强的上下文捕捉能力。 Conclusion: 该研究得出结论，共指消解能够增强检索效果并提升问答性能，为解决RAG中的共指复杂性挑战提供了更深入的理解，并为改进知识密集型AI应用的检索和生成提供了指导。 Abstract: Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in natural language processing (NLP), improving factual consistency and reducing hallucinations by integrating external document retrieval with large language models (LLMs). However, the effectiveness of RAG is often hindered by coreferential complexity in retrieved documents, introducing ambiguity that disrupts in-context learning. In this study, we systematically investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems, focusing on retrieval relevance, contextual understanding, and overall response quality. We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance. Through comparative analysis of different pooling strategies in retrieval tasks, we find that mean pooling demonstrates superior context capturing ability after applying coreference resolution. In QA tasks, we discover that smaller models benefit more from the disambiguation process, likely due to their limited inherent capacity for handling referential ambiguity. With these findings, this study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.

[41] Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation

Bugra Kilictas,Faruk Alpay

Main category: cs.CL

TL;DR: 这篇论文提出了一种多层次语义游戏架构，扩展了Alpay代数的自我参照框架，通过不动点迭代自然产生博弈论推理。

Details

Motivation: 将Alpay代数的自指框架扩展到多层语义游戏架构中，使得AI系统与文档之间的对齐过程成为一个包含嵌入决策问题的元游戏。 Method: 在Alpay代数IV的同理心嵌入概念基础上引入一种嵌套博弈论结构，通过复合算子ϕ(·, γ(·))形式化该框架，其中ϕ驱动主要语义收敛，而γ解决局部子博弈。 Result: 构建了一个语义病毒概念的具体实例，该论文本身作为一个语义人工制品，旨在AI嵌入空间中传播其不动点模式，并确保了理论的实际适用性超越纯数学抽象。 Conclusion: 该框架表明博弈论推理自然地从不动点迭代中产生，而不是外部强加的，并且通过巴拿赫不动点定理、基于Kozlov-Maz'ya-Rossmann公式的φ-拓扑以及利用Yoneda引理进行的范畴一致性测试等工具，证明了语义均衡的存在性和唯一性。 Abstract: This paper extends the self-referential framework of Alpay Algebra into a multi-layered semantic game architecture where transfinite fixed-point convergence encompasses hierarchical sub-games at each iteration level. Building upon Alpay Algebra IV's empathetic embedding concept, we introduce a nested game-theoretic structure where the alignment process between AI systems and documents becomes a meta-game containing embedded decision problems. We formalize this through a composite operator $\phi(\cdot, \gamma(\cdot))$ where $\phi$ drives the main semantic convergence while $\gamma$ resolves local sub-games. The resulting framework demonstrates that game-theoretic reasoning emerges naturally from fixed-point iteration rather than being imposed externally. We prove a Game Theorem establishing existence and uniqueness of semantic equilibria under realistic cognitive simulation assumptions. Our verification suite includes adaptations of Banach's fixed-point theorem to transfinite contexts, a novel $\phi$-topology based on the Kozlov-Maz'ya-Rossmann formula for handling semantic singularities, and categorical consistency tests via the Yoneda lemma. The paper itself functions as a semantic artifact designed to propagate its fixed-point patterns in AI embedding spaces -- a deliberate instantiation of the "semantic virus" concept it theorizes. All results are grounded in category theory, information theory, and realistic AI cognition models, ensuring practical applicability beyond pure mathematical abstraction.

[42] DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

Xinyi Liu,Dachun Sun,Yi R. Fung,Dilek Hakkani-Tür,Tarek Abdelzaher

Main category: cs.CL

TL;DR: DocCHA是一个具有置信度感知的模块化框架，通过模仿临床推理过程以实现更高效、透明和可信赖的诊断对话系统。

Details

Motivation: 现有的会话健康代理（CHAs）缺乏自适应多轮推理、症状澄清和透明决策能力，这限制了它们在需要迭代和结构化对话的真实临床诊断中的应用。 Method: 将诊断过程分解为三个阶段：症状引出、病史采集和因果图构建，并在每个模块中使用可解释的置信度评分来指导自适应提问、优先处理信息澄清并优化弱推理链。 Result: 在两个真实世界的中文咨询数据集（IMCS21, DX）上进行评估，DocCHA持续优于基于提示的强LLM基线模型（GPT-3.5, GPT-4o, LLaMA-3），诊断准确率提高了最高达5.18个百分点，症状召回率提升了超过30%，且对话轮次仅有小幅增加。 Conclusion: DocCHA通过其模块化框架和置信度感知机制，在实现结构化、透明和高效的诊断对话方面表现出色，为多语言和资源有限环境下的可信LLM临床助手铺平了道路。 Abstract: Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations -- paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.

[43] Automating MD simulations for Proteins using Large language Models: NAMD-Agent

Achuth Chandrasekhar,Amir Barati Farimani

Main category: cs.CL

TL;DR: This paper introduces an automated pipeline using Gemini 2.0 Flash, Python, and Selenium to streamline the creation of MD simulation inputs via CHARMM GUI, significantly reducing setup time and errors while enabling scalable, hands-free processing of multiple protein systems.

Details

Motivation: Preparing high-quality input files for molecular dynamics (MD) simulations is often time-consuming and error-prone. The motivation for this work is to automate this process using advanced tools like Large Language Models (LLMs) to improve efficiency, accuracy, and scalability. Method: An automated pipeline was developed using Gemini 2.0 Flash for code generation and iterative refinement, Python scripting, and Selenium-based web automation to interact with CHARMM GUI. This system generates NAMD input files by automatically navigating the interface and extracting necessary parameters. Post-processing tools were also integrated to refine outputs. Result: The proposed pipeline successfully reduced setup time, minimized manual errors, and enabled parallel processing of multiple protein systems. It demonstrated the ability to generate accurate NAMD input files and provided a largely hands-free workflow from input generation to post-processing. Conclusion: The study concludes that leveraging LLMs like Gemini 2.0 Flash in combination with web automation and scripting can effectively streamline the preparation of MD simulation inputs, reducing manual effort, minimizing errors, and offering scalability for handling multiple protein systems. Abstract: Molecular dynamics simulations are an essential tool in understanding protein structure, dynamics, and function at the atomic level. However, preparing high quality input files for MD simulations can be a time consuming and error prone process. In this work, we introduce an automated pipeline that leverages Large Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with python scripting and Selenium based web automation to streamline the generation of MD input files. The pipeline exploits CHARMM GUI's comprehensive web-based interface for preparing simulation-ready inputs for NAMD. By integrating Gemini's code generation and iterative refinement capabilities, simulation scripts are automatically written, executed, and revised to navigate CHARMM GUI, extract appropriate parameters, and produce the required NAMD input files. Post processing is performed using additional software to further refine the simulation outputs, thereby enabling a complete and largely hands free workflow. Our results demonstrate that this approach reduces setup time, minimizes manual errors, and offers a scalable solution for handling multiple protein systems in parallel. This automated framework paves the way for broader application of LLMs in computational structural biology, offering a robust and adaptable platform for future developments in simulation automation.

[44] DTECT: Dynamic Topic Explorer & Context Tracker

Suman Adhya,Debarshi Kumar Sanyal

Main category: cs.CL

TL;DR: DTECT是一个全新的动态主题建模工具，它通过一个统一的工作流程和先进的解释功能，帮助用户更好地追踪和理解主题随时间的变化。

Details

Motivation: 现有的动态主题建模技术存在碎片化流程，缺乏对解释和用户友好探索的强有力支持，而文本数据的快速增长也带来了揭示演变主题和趋势的重大挑战。 Method: DTECT提供了支持数据预处理、多种模型架构和专门评估指标的统一工作流程，并通过LLM驱动的自动主题标签、时间显著词的趋势分析、交互式可视化以及自然语言聊天界面增强了可解释性。 Result: DTECT显著提升了用户的解释能力，并提供了一个单一的、凝聚的平台用于动态主题分析。 Conclusion: DTECT是一个开源的端到端系统，通过统一的工作流程和增强的解释功能，使用户能够更有效地跟踪和理解主题动态。 Abstract: The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at https://github.com/AdhyaSuman/DTECT.

[45] SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

Guoxin Zang,Xue Li,Donglin Di,Lanshun Nie,Dechen Zhan,Yang Song,Lei Fan

Main category: cs.CL

TL;DR: This paper introduces SAGE, a Vision-Language Model-based framework designed to improve industrial anomaly detection and reasoning by integrating domain-specific knowledge and aligning model outputs with expert preferences.

Details

Motivation: Vision-Language Models (VLMs) often struggle in industrial anomaly detection and reasoning due to their inability to deliver interpretable explanations and generalize to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios requiring precise, structured, and context-aware analysis. Method: The paper proposes SAGE, a Vision-Language Model (VLM)-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, the authors introduce AD-PL, a preference-optimized dataset for industrial anomaly reasoning, and develop Multiscale Logical Evaluation (MLE), a quantitative framework to analyze model logic and consistency. Result: The proposed SAGE framework shows superior performance on industrial anomaly datasets in zero-shot and one-shot settings. Conclusion: SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. Abstract: While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.

[46] MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang,Xi Chen

Main category: cs.CL

TL;DR: MIRIX是一个创新的AI记忆系统，通过多种记忆类型和多智能体框架，极大提升了AI代理的记忆持久性与准确性。

Details

Motivation: 现有AI代理的记忆能力受限于扁平且范围狭窄的记忆组件，无法有效个性化、抽象和长期回忆用户特定信息。 Method: 开发了包含六种不同类型记忆的MIRIX系统，并结合多智能体框架实现动态控制和协调更新及检索。 Result: 在ScreenshotVQA上比RAG基线准确率高35%，存储需求减少99.9%；在LOCOMO上达到85.4%的SOTA性能。 Conclusion: MIRIX通过模块化、多智能体记忆系统，显著提高了AI代理的记忆能力，并在多个基准测试中展示了卓越性能。 Abstract: Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field's most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.

[47] Why is Your Language Model a Poor Implicit Reward Model?

Noam Razin,Yong Lin,Jiarui Yao,Sanjeev Arora

Main category: cs.CL

TL;DR: 本文发现隐式奖励模型 (IM-RM) 相较于显式奖励模型 (EX-RM)，在泛化能力上存在差距，因为 IM-RM 更依赖表层线索，而设计上的细微差异可能对奖励模型性能产生重大影响。

Details

Motivation: 研究 IM-RM 和 EX-RM 之间存在的泛化差距，以理解不同奖励模型类型的隐性偏差。 Method: 通过理论分析和实验探究 IM-RM 和 EX-RM 的隐性偏差及泛化差距的根本原因，并验证替代假设。 Result: IM-RMs 更依赖于表层 token-level 线索，因此在 token-level 分布变化以及分布内情况下，其泛化能力通常不如 EX-RMs。同时挑战了关于 IM-RMs 在生成比验证困难的任务中表现不佳的传统观点。 Conclusion: IM-RMs 和 EX-RMs 之间的泛化差距表明，看似微小的设计选择会显著影响奖励模型的泛化行为。 Abstract: Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.

[48] Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

Sabine Felde,Rüdiger Buchkremer,Gamal Chehab,Christian Thielscher,Jörg HW Distler,Matthias Schneider,Jutta G. Richter

Main category: cs.CL

TL;DR: Smaller language models with RAG offer efficient, cost-effective support for rheumatology decision-making but still need expert oversight.

Details

Motivation: To identify more energy-efficient and cost-effective models for clinical decision-making in complex fields like rheumatology while maintaining performance. Method: Evaluation of smaller and larger language models in clinical decision-making tasks related to rheumatology, comparing performance, energy use, and deployment efficiency. Result: Smaller language models with retrieval-augmented generation outperformed larger models in diagnostic and therapeutic tasks while being more energy and cost-efficient. Conclusion: Smaller language models combined with retrieval-augmented generation are more efficient and cost-effective for clinical decision-making in rheumatology but require expert oversight. Abstract: Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.

[49] Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

Shuang Zhou,Wenya Xie,Jiaxi Li,Zaifu Zhan,Meijia Song,Han Yang,Cheyenna Espinoza,Lindsay Welton,Xinnie Mai,Yanwei Jin,Zidu Xu,Yuen-Hei Chung,Yiyun Xing,Meng-Han Tsai,Emma Schaffer,Yucheng Shi,Ninghao Liu,Zirui Liu,Rui Zhang

Main category: cs.CL

TL;DR: This paper introduces MedThink-Bench, a new benchmark for assessing the medical reasoning of large language models, along with an evaluation framework called LLM-w-Ref that effectively aligns with expert judgment and shows promise for scalable application.

Details

Motivation: As large language models become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. Method: The paper introduces MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. It also proposes LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Result: Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, the study finds that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Conclusion: MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice. Abstract: As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.

[50] PyVision: Agentic Vision with Dynamic Tooling

Shitian Zhao,Haoquan Zhang,Shaoheng Lin,Ming Li,Qilong Wu,Kaipeng Zhang,Chen Wei

Main category: cs.CL

TL;DR: This paper introduces PyVision, a framework that enables MLLMs to dynamically generate and use task-specific Python-based tools for visual reasoning, leading to improved performance and greater flexibility in problem-solving.

Details

Motivation: Prior approaches in visual reasoning remain limited by predefined workflows and static toolsets. The motivation is to enable more flexible and interpretable problem-solving by allowing models to dynamically create and use tools tailored to the task at hand. Method: The authors developed PyVision, an interactive, multi-turn framework that allows MLLMs to generate and use task-specific Python-based tools dynamically. They also created a taxonomy of these tools and analyzed their usage across various benchmarks. Result: Quantitatively, PyVision boosts GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini, demonstrating consistent performance gains through dynamic tooling. Conclusion: PyVision enables MLLMs to autonomously generate, execute and refine Python-based tools for visual reasoning tasks, achieving performance gains on benchmarks such as V* and VLMsAreBlind-mini. This represents a shift towards dynamic tooling that allows models to invent tools for more agentic visual reasoning. Abstract: LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.

cs.CV [Back]

[51] Multi-level Mixture of Experts for Multimodal Entity Linking

Zhiwei Hu,Víctor Gutiérrez-Basulto,Zhiliang Xiang,Ru Li,Jeff Z. Pan

Main category: cs.CV

TL;DR: 本文提出了一种新的多级专家混合模型（MMoE）用于多模态实体链接，有效缓解了提及歧义并实现了模态信息的动态选择。

Details

Motivation: 现有MEL方法未能解决提及歧义和模态信息重要性动态区分问题。 Method: 采用多级专家混合机制（MMoE），包括描述感知的提及增强模块、多模态特征提取模块以及两个专家混合模块。 Result: 实验表明，MMoE相比最先进方法表现优异。 Conclusion: 提出的MMoE模型在MEL任务中表现出色，解决了提及歧义和模态内容动态选择问题。 Abstract: Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention's textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention's textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE's code is available at: https://github.com/zhiweihu1103/MEL-MMoE.

[52] CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings

Cristina Mata,Kanchana Ranasinghe,Michael S. Ryoo

Main category: cs.CV

TL;DR: This paper proposes CoPT, a novel method for unsupervised domain adaptation in image segmentation that leverages text embeddings to achieve state-of-the-art performance.

Details

Motivation: The motivation is to improve unsupervised domain adaptation (UDA) for semantic segmentation by leveraging domain-agnostic properties of text, which previous UDA methods have not effectively utilized. Method: The paper introduces a Covariance-based Pixel-Text loss (CoPT) using domain-agnostic text embeddings to learn domain-invariant features. The text embeddings are generated with an LLM Domain Template process involving a frozen CLIP model. Result: The experiments on four benchmarks show that the CoPT method achieves new state-of-the-art results in UDA for segmentation. Conclusion: The paper concludes that the proposed CoPT method achieves state-of-the-art performance in unsupervised domain adaptation for image segmentation. Abstract: Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at https://github.com/cfmata/CoPT.

Renyang Liu,Guanlin Li,Tianwei Zhang,See-Kiong Ng

Main category: cs.CV

TL;DR: This paper introduces 'Recall,' an adversarial framework that exposes vulnerabilities in current unlearning techniques for image generation models, highlighting the need for improved robustness.

Details

Motivation: To address the lack of exploration into the robustness of unlearning techniques against adversarial inputs in image generation models. Method: The authors introduced 'Recall,' an adversarial framework that optimizes image prompts using a semantically relevant reference image, targeting the multi-modal conditioning of diffusion models. Result: Extensive experiments showed that Recall outperforms existing methods in adversarial effectiveness, computational efficiency, and fidelity to the original prompt. Conclusion: The study concludes that existing unlearning methods are vulnerable to multi-modal adversarial attacks, emphasizing the need for more robust solutions in generative models. Abstract: Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.

[54] Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey

Getamesay Haile Dagnaw,Yanming Zhu,Muhammad Hassan Maqsood,Wencheng Yang,Xingshuai Dong,Xuefei Yin,Alan Wee-Chung Liew

Main category: cs.CV

TL;DR: 这篇论文是一篇关于可解释人工智能(XAI)在生物医学图像分析中的应用的综述，强调了模态感知视角的重要性，并探讨了多模态学习和视觉-语言模型在可解释生物医学AI中的新兴作用。

Details

Motivation: 尽管已有几篇综述回顾了XAI技术，但它们往往缺乏模态感知视角，忽视了多模态和视觉-语言范式方面的最新进展，并且在实际指导方面提供的信息有限。 Method: 论文系统地分类了XAI方法，并分析了其在生物医学背景下的基本原理、优势和局限性。此外，提出了一种以模态为中心的分类法，将XAI方法与特定成像类型对齐，并讨论了多模态学习和视觉-语言模型在该领域的应用。 Result: 论文贡献包括总结常用的评估指标和开源框架，提出了一个模态中心化的XAI分类体系，并深入探讨了跨模态解释性的挑战及未来发展方向。 Conclusion: 这篇综述为推动生物医学图像分析中可解释深度学习的发展提供了及时而深入的基础。 Abstract: Explainable artificial intelligence (XAI) has become increasingly important in biomedical image analysis to promote transparency, trust, and clinical adoption of DL models. While several surveys have reviewed XAI techniques, they often lack a modality-aware perspective, overlook recent advances in multimodal and vision-language paradigms, and provide limited practical guidance. This survey addresses this gap through a comprehensive and structured synthesis of XAI methods tailored to biomedical image analysis.We systematically categorize XAI methods, analyzing their underlying principles, strengths, and limitations within biomedical contexts. A modality-centered taxonomy is proposed to align XAI methods with specific imaging types, highlighting the distinct interpretability challenges across modalities. We further examine the emerging role of multimodal learning and vision-language models in explainable biomedical AI, a topic largely underexplored in previous work. Our contributions also include a summary of widely used evaluation metrics and open-source frameworks, along with a critical discussion of persistent challenges and future directions. This survey offers a timely and in-depth foundation for advancing interpretable DL in biomedical image analysis.

[55] Robust Multimodal Large Language Models Against Modality Conflict

Zongmeng Zhang,Wengang Zhou,Jie Zhao,Houqiang Li

Main category: cs.CV

TL;DR: This paper explores how modality conflict in multimodal inputs causes hallucinations in MLLMs and proposes methods to reduce these errors, with reinforcement learning showing the best results.

Details

Motivation: MLLMs often hallucinate in real-world scenarios due to modality conflict, an issue that has not been sufficiently explored from the perspective of inherent input conflicts. Method: The authors constructed a dataset called Multimodal Modality Conflict (MMMC) and proposed three methods—prompt engineering, supervised fine-tuning, and reinforcement learning—to address hallucinations caused by modality conflict. Result: Reinforcement learning was most effective in mitigating hallucinations caused by modality conflict, while supervised fine-tuning showed stable and promising performance. Conclusion: The study highlights the issue of modality conflict leading to hallucinations in MLLMs and evaluates methods to mitigate this, with reinforcement learning showing the best results. Abstract: Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

[56] Aerial Maritime Vessel Detection and Identification

Antonella Barisic Kulas,Frano Petric,Stjepan Bogdan

Main category: cs.CV

TL;DR: 本论文研究了一种在GNSS信号不可用的环境中进行自主海上监视和目标船只识别的方法，该方法基于YOLOv8模型进行目标检测，并通过特征匹配和色调直方图距离分析进行目标确认与定位。

Details

Motivation: 在缺乏全球导航卫星系统（GNSS）信号的情况下，自主海上监视和目标船只识别对于搜救和威胁检测等应用至关重要。 Method: 该论文采用的方法包括利用YOLOv8对象检测模型检测视野中的所有船只，并应用特征匹配和色调直方图距离分析来确定任何检测到的船只是否对应于目标。 Result: 该研究展示了所提出方法在MBZIRC2023竞赛期间的真实实验中集成了GNSS拒绝导航的全自主系统的有效性，并评估了视角对检测准确性和定位精度的影响并与理想方法进行了比较。 Conclusion: 该论文得出的结论是，通过使用YOLOv8目标检测模型并结合特征匹配和色调直方图距离分析，可以实现对没有GNSS信号环境下目标船只的有效识别与定位。 Abstract: Autonomous maritime surveillance and target vessel identification in environments where Global Navigation Satellite Systems (GNSS) are not available is critical for a number of applications such as search and rescue and threat detection. When the target vessel is only described by visual cues and its last known position is not available, unmanned aerial vehicles (UAVs) must rely solely on on-board vision to scan a large search area under strict computational constraints. To address this challenge, we leverage the YOLOv8 object detection model to detect all vessels in the field of view. We then apply feature matching and hue histogram distance analysis to determine whether any detected vessel corresponds to the target. When found, we localize the target using simple geometric principles. We demonstrate the proposed method in real-world experiments during the MBZIRC2023 competition, integrated into a fully autonomous system with GNSS-denied navigation. We also evaluate the impact of perspective on detection accuracy and localization precision and compare it with the oracle approach.

[57] CL-Polyp: A Contrastive Learning-Enhanced Network for Accurate Polyp Segmentation

Desheng Li,Chaoliang Liu,Zhiyong Xiao

Main category: cs.CV

TL;DR: 本文介绍了一种新的对比学习增强型息肉分割网络，用于更准确地从结肠镜图像中分割息肉。

Details

Motivation: 为了准确地从结肠镜图像中分割息肉以实现早期诊断和治疗结直肠癌，现有的深度学习方法通常需要额外的标记数据，并依赖于任务相似性，这可能会限制它们的泛化能力。 Method: 我们提出了一种名为CL-Polyp的对比学习增强型息肉分割网络。该方法利用对比学习通过对比从息肉图像中得到的正负样本对来提高编码器提取判别特征的能力。此外，我们引入了两个轻量级且有效的模块：改进的空洞空间金字塔池化（MASPP）模块和通道拼接与元素加法（CA）模块。 Result: 广泛的实验表明，CL-Polyp在五个基准数据集上始终优于最先进的方法。具体来说，在Kvasir-SEG和CVC-ClinicDB数据集上，IoU指标分别提高了0.011和0.020。 Conclusion: CL-Polyp在临床息肉分割任务中表现出色，证明了其有效性。 Abstract: Accurate segmentation of polyps from colonoscopy images is crucial for the early diagnosis and treatment of colorectal cancer. Most existing deep learning-based polyp segmentation methods adopt an Encoder-Decoder architecture, and some utilize multi-task frameworks that incorporate auxiliary tasks such as classification to enhance segmentation performance. However, these approaches often require additional labeled data and rely on task similarity, which can limit their generalizability. To address these challenges, we propose CL-Polyp, a contrastive learning-enhanced polyp segmentation network. Our method leverages contrastive learning to improve the encoder's ability to extract discriminative features by contrasting positive and negative sample pairs derived from polyp images. This self-supervised strategy enhances visual representation without requiring additional annotations. In addition, we introduce two lightweight and effective modules: the Modified Atrous Spatial Pyramid Pooling (MASPP) module for better multi-scale feature fusion, and the Channel Concatenate and Element Add (CA) module to fuse low-level and upsampled features for improved boundary reconstruction. Extensive experiments on five benchmark datasets-Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, CVC-300, and ETIS-demonstrate that CL-Polyp consistently outperforms state-of-the-art methods. Specifically, it improves the IoU metric by 0.011 and 0.020 on the Kvasir-SEG and CVC-ClinicDB datasets, respectively, validating its effectiveness in clinical polyp segmentation tasks.

[58] Interpretable EEG-to-Image Generation with Semantic Prompts

Arshak Rezvani,Ali Akbari,Kosar Sanjar Arani,Maryam Mirian,Emad Arasteh,Martin J. McKeown

Main category: cs.CV

TL;DR: 该研究提出了一种通过多级语义字幕解码脑电信号以重建视觉体验的新方法，利用了大型语言模型和预训练扩散模型，在EEGCVPR数据集上实现了最先进的视觉解码效果。

Details

Motivation: 脑电图（EEG）虽然具有高时间分辨率和易获取性，但其空间细节的局限性阻碍了直接从EEG信号重建图像的发展。这项研究旨在克服这一限制，探索更有效的EEG视觉解码方法。 Method: 研究者采用了一种文本中介框架：首先使用大型语言模型生成感知图像的多层次语义字幕（从物体级别到抽象主题），然后通过对比学习训练一个基于Transformer的EEG编码器将脑电信号与这些字幕对齐。在推理过程中，通过投影头检索的字幕嵌入被用来条件化一个预训练的潜在扩散模型进行图像生成。 Result: 该模型在EEGCVPR数据集上实现了最先进的视觉解码性能，并且通过显著性映射和t-SNE投影揭示了头皮上的语义拓扑结构。结果还表明不同语义层次在EEG-字幕关联中的重要性。 Conclusion: 研究表明，通过结构化的语义中介可以实现认知对齐的EEG视觉解码，为未来的神经科学和可解释人工智能提供了新的可能性。 Abstract: Decoding visual experience from brain signals offers exciting possibilities for neuroscience and interpretable AI. While EEG is accessible and temporally precise, its limitations in spatial detail hinder image reconstruction. Our model bypasses direct EEG-to-image generation by aligning EEG signals with multilevel semantic captions -- ranging from object-level to abstract themes -- generated by a large language model. A transformer-based EEG encoder maps brain activity to these captions through contrastive learning. During inference, caption embeddings retrieved via projection heads condition a pretrained latent diffusion model for image generation. This text-mediated framework yields state-of-the-art visual decoding on the EEGCVPR dataset, with interpretable alignment to known neurocognitive pathways. Dominant EEG-caption associations reflected the importance of different semantic levels extracted from perceived images. Saliency maps and t-SNE projections reveal semantic topography across the scalp. Our model demonstrates how structured semantic mediation enables cognitively aligned visual decoding from EEG.

[59] A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

Mohamed Elmoghany,Ryan Rossi,Seunghyun Yoon,Subhojyoti Mukherjee,Eslam Bakr,Puneet Mathur,Gang Wu,Viet Dac Lai,Nedim Lipka,Ruiyi Zhang,Varun Manjunatha,Chien Nguyen,Daksh Dangi,Abel Salinas,Mohammad Taesiri,Hongjie Chen,Xiaolei Huang,Joe Barrow,Nesreen Ahmed,Hoda Eldardiry,Namyong Park,Yu Wang,Jaemin Cho,Anh Totti Nguyen,Zhengzhong Tu,Thien Nguyen,Dinesh Manocha,Mohamed Elhoseiny,Franck Dernoncourt

Main category: cs.CV

TL;DR: 本研究总结了现有视频生成模型的不足，并提出了一种新的分类方法，有助于开发更高效的长视频生成技术。

Details

Motivation: 当前最先进的视频生成模型只能生成5-16秒的视频，且在更长的视频中难以保持角色外观和场景布局的一致性，同时多主体长视频仍然存在问题。 Method: 通过综合研究现有视频生成模型的文献，构建了一个新的分类法，并根据架构设计和性能特征对论文进行了分类比较。 Result: 识别出了能生成具有多角色、叙事连贯和高保真细节长视频的关键因素，并建立了一个系统的方法分类和比较框架。 Conclusion: 该研究全面分析了32篇视频生成论文，找到了能够持续产出高质量长视频的关键架构组件和训练策略，并提出了一个新的方法分类体系。 Abstract: Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled "long-form videos". Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.

[60] Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

Priyank Pathak,Yogesh S. Rawat

Main category: cs.CV

TL;DR: CSCI addresses clothes-changing person re-identification by using color cues to separate identity features from appearance bias, achieving strong results without additional supervision.

Details

Motivation: Existing CC-ReID methods rely on additional models or annotations, which are resource-intensive. CSCI aims to use color as an annotation-free proxy to mitigate appearance bias efficiently. Method: CSCI uses foreground and background color cues to disentangle identity-relevant features from clothing-related appearance bias through S2A self-attention mechanism. Result: CSCI improves the baseline performance by 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without extra supervision. Conclusion: CSCI is a lightweight and effective method for addressing appearance bias in CC-ReID by leveraging color information without additional supervision. Abstract: Clothes-Changing Re-Identification (CC-ReID) aims to recognize individuals across different locations and times, irrespective of clothing. Existing methods often rely on additional models or annotations to learn robust, clothing-invariant features, making them resource-intensive. In contrast, we explore the use of color - specifically foreground and background colors - as a lightweight, annotation-free proxy for mitigating appearance bias in ReID models. We propose Colors See, Colors Ignore (CSCI), an RGB-only method that leverages color information directly from raw images or video frames. CSCI efficiently captures color-related appearance bias ('Color See') while disentangling it from identity-relevant ReID features ('Color Ignore'). To achieve this, we introduce S2A self-attention, a novel self-attention to prevent information leak between color and identity cues within the feature space. Our analysis shows a strong correspondence between learned color embeddings and clothing attributes, validating color as an effective proxy when explicit clothing labels are unavailable. We demonstrate the effectiveness of CSCI on both image and video ReID with extensive experiments on four CC-ReID datasets. We improve the baseline by Top-1 2.9% on LTCC and 5.0% on PRCC for image-based ReID, and 1.0% on CCVID and 2.5% on MeVID for video-based ReID without relying on additional supervision. Our results highlight the potential of color as a cost-effective solution for addressing appearance bias in CC-ReID. Github: https://github.com/ppriyank/ICCV-CSCI-Person-ReID.

[61] Automated Video Segmentation Machine Learning Pipeline

Johannes Merz,Lucien Fostier

Main category: cs.CV

TL;DR: This paper introduces an automated video segmentation pipeline for VFX production that uses machine learning to generate accurate, temporally stable masks, improving efficiency and reducing manual work.

Details

Motivation: Visual effects (VFX) production often struggles with slow, resource-intensive mask generation, which motivated the development of an automated solution that improves efficiency and consistency. Method: The paper employs machine learning for flexible object detection via text prompts, refined per-frame image segmentation, and robust video tracking to ensure temporal stability. It is deployed using containerization and leverages a structured output format. Result: The pipeline creates temporally consistent instance masks, was quickly adopted by artists, and demonstrates significant improvements in speed and effort reduction. Conclusion: The automated video segmentation pipeline presented in the paper enhances overall VFX production efficiency by significantly reducing manual effort, speeding up the creation of preliminary composites, and providing comprehensive segmentation data. Abstract: Visual effects (VFX) production often struggles with slow, resource-intensive mask generation. This paper presents an automated video segmentation pipeline that creates temporally consistent instance masks. It employs machine learning for: (1) flexible object detection via text prompts, (2) refined per-frame image segmentation and (3) robust video tracking to ensure temporal stability. Deployed using containerization and leveraging a structured output format, the pipeline was quickly adopted by our artists. It significantly reduces manual effort, speeds up the creation of preliminary composites, and provides comprehensive segmentation data, thereby enhancing overall VFX production efficiency.

[62] DisenQ: Disentangling Q-Former for Activity-Biometrics

Shehreen Azad,Yogesh S Rawat

Main category: cs.CV

TL;DR: This paper proposes DisenQ, a language-guided framework for identifying individuals across diverse activities by disentangling biometric features from motion and appearance variations, achieving superior performance on multiple benchmarks.

Details

Motivation: Traditional person identification methods face challenges in activity-biometrics due to entangled identity cues with motion dynamics and appearance variations. Additional visual data often introduces inaccuracies. Method: A multimodal language-guided framework called DisenQ is introduced, which uses structured textual supervision instead of additional visual data to disentangle biometrics, motion, and non-biometrics features through a unified querying transformer. Result: The approach achieves state-of-the-art performance on three activity-based video benchmarks and demonstrates strong generalization with competitive performance on a traditional video-based identification benchmark. Conclusion: The proposed DisenQ framework effectively disentangles biometric features from motion and non-biometric features, leading to accurate identification across diverse activities and real-world scenarios. Abstract: In this work, we address activity-biometrics, which involves identifying individuals across diverse set of activities. Unlike traditional person identification, this setting introduces additional challenges as identity cues become entangled with motion dynamics and appearance variations, making biometrics feature learning more complex. While additional visual data like pose and/or silhouette help, they often struggle from extraction inaccuracies. To overcome this, we propose a multimodal language-guided framework that replaces reliance on additional visual data with structured textual supervision. At its core, we introduce \textbf{DisenQ} (\textbf{Disen}tangling \textbf{Q}-Former), a unified querying transformer that disentangles biometrics, motion, and non-biometrics features by leveraging structured language guidance. This ensures identity cues remain independent of appearance and motion variations, preventing misidentifications. We evaluate our approach on three activity-based video benchmarks, achieving state-of-the-art performance. Additionally, we demonstrate strong generalization to complex real-world scenario with competitive performance on a traditional video-based identification benchmark, showing the effectiveness of our framework.

[63] LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

Ananya Raval,Aravind Narayanan,Vahid Reza Khazaie,Shaina Raza

Main category: cs.CV

TL;DR: This paper introduces LinguaMark, a multilingual VQA benchmark for evaluating LMMs, revealing that closed-source models perform best overall while some open-source models, particularly Qwen2.5, show strong cross-lingual and cross-social attribute performance.

Details

Motivation: LMMs often exhibit bias and limited linguistic coverage due to training data limitations. There is a need to assess and improve multilingual capabilities in these models. Method: Introduction of LinguaMark, a benchmark for evaluating LMMs on multilingual Visual Question Answering (VQA) tasks. The dataset contains 6,875 image-text pairs across 11 languages and five social attributes, evaluated using Bias, Answer Relevancy, and Faithfulness metrics. Result: Closed-source models (e.g., GPT-4o, Gemini2.5) outperform others overall, while open-source models (e.g., Gemma3, Qwen2.5) show competitive results across social attributes. Qwen2.5 demonstrates strong multilingual performance. Conclusion: Closed-source models generally achieve the highest overall performance on the LinguaMark benchmark, while both closed-source and open-source models perform well across social attributes. Qwen2.5 shows strong multilingual generalization. Abstract: Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.

[64] MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning

Chengfei Wu,Ronald Seoh,Bingxuan Li,Liqiang Zhang,Fengrong Han,Dan Goldwasser

Main category: cs.CV

TL;DR: MagiC is a new benchmark assessing grounded visual reasoning quality in vision-language models, highlighting their limitations and improvement areas.

Details

Motivation: To determine if vision-language models perform genuine grounded visual reasoning rather than relying on superficial patterns. Method: MagiC includes weakly supervised QA examples, human-curated examples, and evaluates models across multiple dimensions with new metrics like MagiScore and StepSense. Result: Evaluation of 15 models from 7B to 70B parameters on reasoning validity, grounding fidelity, self-correction, and robustness. Conclusion: MagiC benchmark reveals key limitations and opportunities in current grounded visual reasoning approaches. Abstract: Recent advances in large vision-language models have led to impressive performance in visual question answering and multimodal reasoning. However, it remains unclear whether these models genuinely perform grounded visual reasoning or rely on superficial patterns and dataset biases. In this work, we introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition, assessing not only answer accuracy but also the quality of step-by-step reasoning and its alignment with relevant visual evidence. Our benchmark includes approximately 5,500 weakly supervised QA examples generated from strong model outputs and 900 human-curated examples with fine-grained annotations, including answers, rationales, and bounding box groundings. We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability. MagiC further includes diagnostic settings to probe model robustness under adversarial visual cues and assess their capacity for introspective error correction. We introduce new metrics such as MagiScore and StepSense, and provide comprehensive analyses that reveal key limitations and opportunities in current approaches to grounded visual reasoning.

[65] ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

Sherry X. Chen,Yi Wei,Luowei Zhou,Suren Kumar

Main category: cs.CV

TL;DR: 本文提出了一种自动化的数据集创建方法ADIEE，并训练了一个评分模型，用于指导图像编辑的评估。该模型在多个基准测试中表现优异，显著提高了与人类评分的相关性，并可用于自动化选择最佳编辑和模型微调。

Details

Motivation: 现有的开源视觉-语言模型（VLM）在指令引导的图像编辑评估方面存在对齐问题，而专有模型缺乏透明度和成本效益。此外，目前没有公开的训练数据集来微调这些模型，只有小规模的基准测试。 Method: 作者提出了ADIEE方法，生成了一个包含超过10万样本的大规模数据集，并使用该数据集对LLaVA-NeXT-8B模型进行微调，使其能够从自定义令牌解码出数值评分。 Result: 所提出的评分模型在所有基准测试中均优于所有开源VLM和Gemini-Pro 1.5，在AURORA-Bench上与人类评分的相关性提高了17.24%；在GenAI-Bench和AURORA-Bench上的成对比较准确率分别提高了7.21%和9.35%。此外，它还能作为奖励模型，使MagicBrush模型在ImagenHub上的平均评估得分提高8.98%。 Conclusion: ADIEE方法提供了一种有效的自动化评估方案，为指令引导的图像编辑任务带来了更高的相关性和准确性，同时具备作为奖励模型应用于编辑选择和模型优化的潜力。 Abstract: Recent advances in instruction-guided image editing underscore the need for effective automated evaluation. While Vision-Language Models (VLMs) have been explored as judges, open-source models struggle with alignment, and proprietary models lack transparency and cost efficiency. Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. We generate a large-scale dataset with over 100K samples and use it to fine-tune a LLaVA-NeXT-8B model modified to decode a numeric score from a custom token. The resulting scorer outperforms all open-source VLMs and Gemini-Pro 1.5 across all benchmarks, achieving a 0.0696 (+17.24%) gain in score correlation with human ratings on AURORA-Bench, and improving pair-wise comparison accuracy by 4.03% (+7.21%) on GenAI-Bench and 4.75% (+9.35%) on AURORA-Bench, respectively, compared to the state-of-the-art. The scorer can act as a reward model, enabling automated best edit selection and model fine-tuning. Notably, the proposed scorer can boost MagicBrush model's average evaluation score on ImagenHub from 5.90 to 6.43 (+8.98%).

[66] Scalable and Realistic Virtual Try-on Application for Foundation Makeup with Kubelka-Munk Theory

Hui Pang,Sunil Hadap,Violetta Shevchenko,Rahul Suresh,Amin Banitalebi-Dehkordi

Main category: cs.CV

TL;DR: This paper introduces an efficient method for realistic foundation makeup virtual try-on using augmented reality by approximating Kubelka-Munk theory, achieving better performance than existing approaches.

Details

Motivation: The motivation is to overcome the critical technical challenge of accurate synthesis of foundation-skin tone color blending in AR-based virtual try-on applications, which is essential for enhancing user experience in the beauty industry. Method: A novel method that approximates Kubelka-Munk theory for faster image synthesis was developed, along with a scalable end-to-end framework for realistic foundation makeup VTO based solely on product information from e-commerce sites. Result: The validation using real-world makeup images showed that the proposed framework outperforms other existing techniques in terms of realism and efficiency. Conclusion: The proposed method in the paper effectively addresses the challenge of realistic foundation-skin tone color blending in augmented reality VTO applications while maintaining scalability across diverse product ranges. Abstract: Augmented reality is revolutionizing beauty industry with virtual try-on (VTO) applications, which empowers users to try a wide variety of products using their phones without the hassle of physically putting on real products. A critical technical challenge in foundation VTO applications is the accurate synthesis of foundation-skin tone color blending while maintaining the scalability of the method across diverse product ranges. In this work, we propose a novel method to approximate well-established Kubelka-Munk (KM) theory for faster image synthesis while preserving foundation-skin tone color blending realism. Additionally, we build a scalable end-to-end framework for realistic foundation makeup VTO solely depending on the product information available on e-commerce sites. We validate our method using real-world makeup images, demonstrating that our framework outperforms other techniques.

[67] Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning

Daniel A. P. Oliveira,David Martins de Matos

Main category: cs.CV

TL;DR: 本文提出一种新的对比强化学习框架，有效提升视觉叙事模型的跨帧实体一致性和推理能力。

Details

Motivation: 现有视觉语言模型在跨帧保持实体一致性方面表现不佳，导致引用不一致和幻觉问题，缺乏跨帧实体连接的显式训练。 Method: 提出了一种对比强化学习方法，结合合成负样本和双组分奖励函数，对Qwen Storyteller进行微调。 Result: 模型在多个指标上取得显著提升，包括mAP、F1分数、代词指代准确率以及故事结构的改善。 Conclusion: 通过对比强化学习方法，本文成功提升了视觉叙事模型在跨帧实体连接和一致性推理上的表现。 Abstract: Visual storytelling systems, particularly large vision-language models, struggle to maintain character and object identity across frames, often failing to recognize when entities in different images represent the same individuals or objects, leading to inconsistent references and referential hallucinations. This occurs because models lack explicit training on when to establish entity connections across frames. We propose a contrastive reinforcement learning approach that trains models to discriminate between coherent image sequences and stories from unrelated images. We extend the Story Reasoning dataset with synthetic negative examples to teach appropriate entity connection behavior. We employ Direct Preference Optimization with a dual-component reward function that promotes grounding and re-identification of entities in real stories while penalizing incorrect entity connections in synthetic contexts. Using this contrastive framework, we fine-tune Qwen Storyteller (based on Qwen2.5-VL 7B). Evaluation shows improvements in grounding mAP from 0.27 to 0.31 (+14.8%), F1 from 0.35 to 0.41 (+17.1%). Pronoun grounding accuracy improved across all pronoun types except ``its'', and cross-frame character and object persistence increased across all frame counts, with entities appearing in 5 or more frames advancing from 29.3% to 33.3% (+13.7%). Well-structured stories, containing the chain-of-thought and grounded story, increased from 79.1% to 97.5% (+23.3%).

[68] PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency

Haotian Wang,Aoran Xiao,Xiaoqin Zhang,Meng Yang,Shijian Lu

Main category: cs.CV

TL;DR: PacGDC是一种标签高效的通用深度补全技术，通过利用2D到3D投影中的几何模糊性和一致性，合成多种伪几何结构，从而在零样本和少样本设置下实现卓越的泛化能力。

Details

Motivation: 训练通用深度补全模型通常需要大量带度量深度标签的数据集，而这些数据集往往需要大量人工标注。因此，提出了一种标签高效的方法，以最小的标注工作量增强数据多样性。 Method: PacGDC基于对2D到3D投影过程中物体形状和位置内在模糊性和一致性的新见解，合成多个相同视觉场景下的伪几何结构。此外，该方法通过操纵深度图的场景尺度，大幅扩展可用几何结构，并结合多个深度基础模型作为尺度调节器提供伪标签，同时引入插值和重定位策略以及未标记图像进一步增加几何多样性。 Result: 实验表明，PacGDC在多种基准测试中展现出显著的泛化能力，在不同场景语义/尺度以及深度稀疏性/模式下均表现出色，适用于零样本和少样本设置。 Conclusion: PacGDC是一种有效的标签高效方法，能够在最小标注成本的情况下提升深度补全模型在未知环境中的泛化能力。 Abstract: Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances data diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on novel insights into inherent ambiguities and consistencies in object shapes and positions during 2D-to-3D projection, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. These models robustly provide pseudo depth labels with varied scene scales, affecting both local objects and global layouts, while ensuring projection consistency that supports generalization. To further diversify geometries, we incorporate interpolation and relocation strategies, as well as unlabeled images, extending the data coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings. Code: https://github.com/Wang-xjtu/PacGDC.

[69] Adaptive Particle-Based Shape Modeling for Anatomical Surface Correspondence

Hong Xu,Shireen Y. Elhabian

Main category: cs.CV

TL;DR: This paper introduces new mechanisms to enhance adaptivity in particle-based shape modeling while ensuring consistent particle configurations, resulting in improved surface representation and correspondence accuracy.

Details

Motivation: To address the lack of self-adaptivity in current particle-based shape modeling approaches, specifically the inability to automatically adjust particle configurations to local geometric features of each surface for accurately representing complex anatomical variability. Method: The paper introduces two mechanisms: (1) a novel neighborhood correspondence loss to enable high adaptivity and (2) a geodesic correspondence algorithm that regularizes optimization to enforce geodesic neighborhood consistency. Result: The proposed approach is evaluated on challenging datasets, with results showing efficacy and scalability while providing a detailed analysis of the adaptivity-correspondence trade-off. Conclusion: The paper concludes that the proposed mechanisms significantly improve surface adaptivity while maintaining consistent particle configurations, offering better performance in surface representation accuracy and correspondence metrics compared to existing methods. Abstract: Particle-based shape modeling (PSM) is a family of approaches that automatically quantifies shape variability across anatomical cohorts by positioning particles (pseudo landmarks) on shape surfaces in a consistent configuration. Recent advances incorporate implicit radial basis function representations as self-supervised signals to better capture the complex geometric properties of anatomical structures. However, these methods still lack self-adaptivity -- that is, the ability to automatically adjust particle configurations to local geometric features of each surface, which is essential for accurately representing complex anatomical variability. This paper introduces two mechanisms to increase surface adaptivity while maintaining consistent particle configurations: (1) a novel neighborhood correspondence loss to enable high adaptivity and (2) a geodesic correspondence algorithm that regularizes optimization to enforce geodesic neighborhood consistency. We evaluate the efficacy and scalability of our approach on challenging datasets, providing a detailed analysis of the adaptivity-correspondence trade-off and benchmarking against existing methods on surface representation accuracy and correspondence metrics.

[70] Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos

Hao Xu,Arbind Agrahari Baniya,Sam Wells,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal

Main category: cs.CV

TL;DR: This paper proposes an improved module called MSAGSM for precise event spotting in sports videos, achieving better performance with minimal added complexity.

Details

Motivation: Existing PES models using modules like GSM or GSF are limited in temporal receptive field and spatial adaptability. This work aims to address these limitations by proposing a more effective temporal-spatial modeling approach. Method: The paper introduces a Multi-Scale Attention Gate Shift Module (MSAGSM) that incorporates multi-scale temporal dilations and multi-head spatial attention to improve temporal modeling and spatial adaptability. It also presents the Table Tennis Australia (TTA) dataset for evaluation. Result: Extensive experiments on five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal computational overhead. Conclusion: The proposed MSAGSM module enhances the performance of PES models with minimal overhead, establishing new state-of-the-art results. Abstract: Precise Event Spotting (PES) in sports videos requires frame-level recognition of fine-grained actions from single-camera footage. Existing PES models typically incorporate lightweight temporal modules such as Gate Shift Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with temporal context. However, these modules are limited in both temporal receptive field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and multi-head spatial attention, enabling efficient modeling of both short- and long-term dependencies while focusing on salient regions. MSAGSM is a lightweight plug-and-play module that can be easily integrated with various 2D backbones. To further advance the field, we introduce the Table Tennis Australia (TTA) dataset-the first PES benchmark for table tennis-containing over 4800 precisely annotated events. Extensive experiments across five PES benchmarks demonstrate that MSAGSM consistently improves performance with minimal overhead, setting new state-of-the-art results.

[71] KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos

Jinseong Kim,Junghoon Song,Gyeongseon Baek,Byeongjoon Noh

Main category: cs.CV

TL;DR: 我们提出了一种新的视频人物再识别方法KeyRe-ID，通过利用人体关键点增强时空表示学习，在MARS和iLIDS-VID基准上展示了最先进的性能。

Details

Motivation: 为了提高视频中人物再识别的准确性，通过利用人体关键点来增强时空表示学习。 Method: 提出了一个名为KeyRe-ID的关键点引导的视频人物再识别框架，该框架包括全局分支和局部分支，分别用于捕获整体身份语义和生成细粒度、部分感知的特征。 Result: 在MARS和iLIDS-VID基准上的广泛实验表明，KeyRe-ID达到了最先进的性能，MARS数据集上实现了91.73%的mAP和97.32%的Rank-1准确率，以及iLIDS-VID数据集上实现了96.00%的Rank-1和100.0%的Rank-5准确率。 Conclusion: KeyRe-ID实现了视频中人的再识别的最先进的性能，并将在出版后在GitHub上公开提供代码。 Abstract: We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73\% mAP and 97.32\% Rank-1 accuracy on MARS, and 96.00\% Rank-1 and 100.0\% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.

[72] Behave Your Motion: Habit-preserved Cross-category Animal Motion Transfer

Zhimin Zhang,Bi'an Du,Caoyuan Ma,Zheng Wang,Wei Hu

Main category: cs.CV

TL;DR: This paper proposes a novel framework for cross-category animal motion transfer that preserves species-specific habits using a generative model, a habit-preservation module, and LLM integration.

Details

Motivation: Existing motion transfer methods mainly focus on humans, neglecting the preservation of unique animal behavioral habits, which is crucial for applications in animation and virtual reality. Method: A generative framework with a habit-preservation module and category-specific encoder, enhanced by integrating a large language model (LLM) for unseen species adaptation. Result: The model demonstrated superior performance through extensive experiments on the newly introduced DeformingThings4D-skl dataset. Conclusion: The proposed framework successfully transfers motion across different animal categories while preserving species-specific habitual behaviors, outperforming existing methods. Abstract: Animal motion embodies species-specific behavioral habits, making the transfer of motion across categories a critical yet complex task for applications in animation and virtual reality. Existing motion transfer methods, primarily focused on human motion, emphasize skeletal alignment (motion retargeting) or stylistic consistency (motion style transfer), often neglecting the preservation of distinct habitual behaviors in animals. To bridge this gap, we propose a novel habit-preserved motion transfer framework for cross-category animal motion. Built upon a generative framework, our model introduces a habit-preservation module with category-specific habit encoder, allowing it to learn motion priors that capture distinctive habitual characteristics. Furthermore, we integrate a large language model (LLM) to facilitate the motion transfer to previously unobserved species. To evaluate the effectiveness of our approach, we introduce the DeformingThings4D-skl dataset, a quadruped dataset with skeletal bindings, and conduct extensive experiments and quantitative analyses, which validate the superiority of our proposed model.

[73] Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections

Yongtang Bao,Chengjie Tang,Yuze Wang,Haojie Li

Main category: cs.CV

TL;DR: This paper proposes Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting, to address the challenges of segmenting scenes from unconstrained photo collections.

Details

Motivation: Unconstrained photo collections from the Internet suffer from inconsistent lighting and transient occlusions, which makes segmentation difficult. Previous methods cannot handle these challenges effectively. Method: Seg-Wild uses 3D Gaussian Splatting with multi-dimensional feature embeddings for interactive segmentation and introduces the Spiky 3D Gaussian Cutter to smooth abnormal Gaussians. Result: Seg-Wild achieves better segmentation results and reconstruction quality compared to previous methods. Conclusion: Seg-Wild is an effective interactive segmentation method for unconstrained image collections, outperforming previous methods in segmentation and reconstruction quality. Abstract: Reconstructing and segmenting scenes from unconstrained photo collections obtained from the Internet is a novel but challenging task. Unconstrained photo collections are easier to get than well-captured photo collections. These unconstrained images suffer from inconsistent lighting and transient occlusions, which makes segmentation challenging. Previous segmentation methods cannot address transient occlusions or accurately restore the scene's lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. We integrate multi-dimensional feature embeddings for each 3D Gaussian and calculate the feature similarity between the feature embeddings and the segmentation target to achieve interactive segmentation in the 3D scene. Additionally, we introduce the Spiky 3D Gaussian Cutter (SGC) to smooth abnormal 3D Gaussians. We project the 3D Gaussians onto a 2D plane and calculate the ratio of 3D Gaussians that need to be cut using the SAM mask. We also designed a benchmark to evaluate segmentation quality in in-the-wild scenes. Experimental results demonstrate that compared to previous methods, Seg-Wild achieves better segmentation results and reconstruction quality. Our code will be available at https://github.com/Sugar0725/Seg-Wild.

[74] EscherNet++: Simultaneous Amodal Completion and Scalable View Synthesis through Masked Fine-Tuning and Enhanced Feed-Forward 3D Reconstruction

Xinan Zhang,Muhammad Zubair Irshad,Anthony Yezzi,Yi-Chang Tsai,Zsolt Kira

Main category: cs.CV

TL;DR: EscherNet++ 是一种基于掩码微调的扩散模型，能够在零样本情况下合成新视角，并具备模态补全能力。

Details

Motivation: 现有方法分阶段处理图像缺失部分和新视角合成，导致冗余计算且忽视跨视角依赖关系。 Method: 采用输入级和特征级掩码进行微调，实现端到端的新视角合成与模态补全。 Result: 在10输入设置中，PSNR提高3.9，体积IoU提高0.28，重建时间减少95%。 Conclusion: EscherNet++ 在零样本条件下实现了高效、准确的新视角合成与三维重建。 Abstract: We propose EscherNet++, a masked fine-tuned diffusion model that can synthesize novel views of objects in a zero-shot manner with amodal completion ability. Existing approaches utilize multiple stages and complex pipelines to first hallucinate missing parts of the image and then perform novel view synthesis, which fail to consider cross-view dependencies and require redundant storage and computing for separate stages. Instead, we apply masked fine-tuning including input-level and feature-level masking to enable an end-to-end model with the improved ability to synthesize novel views and conduct amodal completion. In addition, we empirically integrate our model with other feed-forward image-to-mesh models without extra training and achieve competitive results with reconstruction time decreased by 95%, thanks to its ability to synthesize arbitrary query views. Our method's scalable nature further enhances fast 3D reconstruction. Despite fine-tuning on a smaller dataset and batch size, our method achieves state-of-the-art results, improving PSNR by 3.9 and Volume IoU by 0.28 on occluded tasks in 10-input settings, while also generalizing to real-world occluded reconstruction.

[75] EPIC: Efficient Prompt Interaction for Text-Image Classification

Xinyao Yu,Hao Sun,Zeyu Ling,Ziwei Niu,Zhenjia Bai,Rui Qin,Yen-Wei Chen,Lanfen Lin

Main category: cs.CV

TL;DR: This paper proposes EPIC, a prompt-based strategy for efficient multimodal interaction in large pre-trained models.

Details

Motivation: To reduce the computational cost of fine-tuning large multimodal models. Method: Temporal prompts on intermediate layers with similarity-based interaction. Result: Outperformed other strategies on UPMC-Food101 and SNLI-VE, comparable on MM-IMDB. Conclusion: EPIC is a promising method for aligning modalities in LMMs, offering efficiency and effectiveness. Abstract: In recent years, large-scale pre-trained multimodal models (LMMs) generally emerge to integrate the vision and language modalities, achieving considerable success in multimodal tasks, such as text-image classification. The growing size of LMMs, however, results in a significant computational cost for fine-tuning these models for downstream tasks. Hence, prompt-based interaction strategy is studied to align modalities more efficiently. In this context, we propose a novel efficient prompt-based multimodal interaction strategy, namely Efficient Prompt Interaction for text-image Classification (EPIC). Specifically, we utilize temporal prompts on intermediate layers, and integrate different modalities with similarity-based prompt interaction, to leverage sufficient information exchange between modalities. Utilizing this approach, our method achieves reduced computational resource consumption and fewer trainable parameters (about 1\% of the foundation model) compared to other fine-tuning strategies. Furthermore, it demonstrates superior performance on the UPMC-Food101 and SNLI-VE datasets, while achieving comparable performance on the MM-IMDB dataset.

[76] Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

Jingjing Jiang,Chao Ma,Xurui Song,Hanwang Zhang,Jun Luo

Main category: cs.CV

TL;DR: This paper introduces Corvid, an enhanced multimodal large language model with improved reasoning capabilities through architectural innovations and a novel training methodology.

Details

Motivation: Leading open-source multimodal large language models (MLLMs) face limitations in complex reasoning tasks, necessitating enhanced chain-of-thought (CoT) capabilities for better decision-making and problem-solving. Method: Corvid incorporates a hybrid vision encoder, GateMixer for cross-modal alignment, and is fine-tuned using the MCoT-Instruct-287K dataset with a two-stage CoT training approach. Additionally, an inference-time scaling strategy helps mitigate over- and under-reasoning. Result: Corvid demonstrates superior performance on reasoning tasks, especially in math and science, compared to existing advanced MLLMs. Conclusion: Corvid outperforms existing o1-like MLLMs and state-of-the-art models with similar parameter scales, particularly in mathematical reasoning and science problem-solving. Abstract: Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving. Project page: https://mm-vl.github.io/corvid.

[77] Towards High-Resolution 3D Anomaly Detection: A Scalable Dataset and Real-Time Framework for Subtle Industrial Defects

Yuqi Cheng,Yihan Sun,Hui Zhang,Weiming Shen,Yunkang Cao

Main category: cs.CV

TL;DR: 本文提出了MiniShift数据集和Simple3D框架，用于提升高分辨率3D异常检测的性能与效率。

Details

Motivation: 工业点云分析需要高分辨率空间数据来检测细微异常，但当前基准测试主要关注低分辨率输入，因此需要一种新的解决方案来弥补这一差距。 Method: 提出了一种可扩展的管道来生成逼真且细微的3D异常，并引入了一个高效框架Simple3D，结合多尺度邻域描述符（MSND）和局部特征空间聚合（LFSA），以最小的计算开销捕捉复杂的几何细节。 Result: 开发了MiniShift——首个高分辨率3D异常检测数据集，包含2,577个点云，每个点云拥有50万个点，异常占比不足1%；Simple3D实现了超过20 fps的实时推理速度，并在MiniShift及其他基准测试中表现出色。 Conclusion: Simple3D在准确性和速度方面均优于现有方法，突显了高分辨率数据和有效特征聚合在推进实用3D异常检测中的关键作用。 Abstract: In industrial point cloud analysis, detecting subtle anomalies demands high-resolution spatial data, yet prevailing benchmarks emphasize low-resolution inputs. To address this disparity, we propose a scalable pipeline for generating realistic and subtle 3D anomalies. Employing this pipeline, we developed MiniShift, the inaugural high-resolution 3D anomaly detection dataset, encompassing 2,577 point clouds, each with 500,000 points and anomalies occupying less than 1\% of the total. We further introduce Simple3D, an efficient framework integrating Multi-scale Neighborhood Descriptors (MSND) and Local Feature Spatial Aggregation (LFSA) to capture intricate geometric details with minimal computational overhead, achieving real-time inference exceeding 20 fps. Extensive evaluations on MiniShift and established benchmarks demonstrate that Simple3D surpasses state-of-the-art methods in both accuracy and speed, highlighting the pivotal role of high-resolution data and effective feature aggregation in advancing practical 3D anomaly detection.

[78] Dual Semantic-Aware Network for Noise Suppressed Ultrasound Video Segmentation

Ling Zhou,Runtian Yuan,Yi Liu,Yuejie Zhang,Rui Feng,Shang Gao

Main category: cs.CV

TL;DR: This paper proposes DSANet, a new framework for enhancing noise robustness in ultrasound video segmentation through mutual semantic awareness between local and global features, leading to improved segmentation accuracy and performance.

Details

Motivation: Ultrasound imaging often introduces substantial noise, which poses challenges for automated lesion or organ segmentation. DSANet was developed to address these limitations. Method: The Dual Semantic-Aware Network (DSANet) includes an Adjacent-Frame Semantic-Aware (AFSA) module and a Local-and-Global Semantic-Aware (LGSA) module to improve multi-level semantic representation and resilience to noise interference. Result: Extensive evaluations on four benchmark datasets demonstrated that DSANet substantially outperformed state-of-the-art methods in segmentation accuracy and achieved higher inference FPS than video-based methods. Conclusion: DSANet is a novel framework that enhances noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. It outperforms state-of-the-art methods in segmentation accuracy. Abstract: Ultrasound imaging is a prevalent diagnostic tool known for its simplicity and non-invasiveness. However, its inherent characteristics often introduce substantial noise, posing considerable challenges for automated lesion or organ segmentation in ultrasound video sequences. To address these limitations, we propose the Dual Semantic-Aware Network (DSANet), a novel framework designed to enhance noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. Specifically, we introduce an Adjacent-Frame Semantic-Aware (AFSA) module, which constructs a channel-wise similarity matrix to guide feature fusion across adjacent frames, effectively mitigating the impact of random noise without relying on pixel-level relationships. Additionally, we propose a Local-and-Global Semantic-Aware (LGSA) module that reorganizes and fuses temporal unconditional local features, which capture spatial details independently at each frame, with conditional global features that incorporate temporal context from adjacent frames. This integration facilitates multi-level semantic representation, significantly improving the model's resilience to noise interference. Extensive evaluations on four benchmark datasets demonstrate that DSANet substantially outperforms state-of-the-art methods in segmentation accuracy. Moreover, since our model avoids pixel-level feature dependencies, it achieves significantly higher inference FPS than video-based methods, and even surpasses some image-based models. Code can be found in \href{https://github.com/ZhouL2001/DSANet}{DSANet}

[79] Bluish Veil Detection and Lesion Classification using Custom Deep Learnable Layers with Explainable Artificial Intelligence (XAI)

M. A. Rasel,Sameem Abdul Kareem,Zhenli Kwan,Shin Shen Yong,Unaizah Obaidellah

Main category: cs.CV

TL;DR: This study develops a novel DCNN-XAI approach for detecting the blue-white veil in skin lesions, achieving high accuracy and providing a robust tool for early melanoma diagnosis.

Details

Motivation: Melanoma is a deadly form of skin cancer, and detecting the blue-whitish veil (BWV) is crucial for its early diagnosis. However, there is limited research on BWV detection in dermatological images, which motivates this study to develop more effective detection methods. Method: The method involves converting a non-annotated skin lesion dataset into an annotated one using an imaging algorithm based on color threshold techniques. A Deep Convolutional Neural Network (DCNN) is designed and trained on three individual and combined dermoscopic datasets using custom layers. Additionally, an explainable artificial intelligence (XAI) algorithm is applied to interpret the model's decision-making process. Result: The proposed DCNN model achieved high testing accuracies across different datasets: 85.71% on the augmented PH2 dataset, 95.00% on the augmented ISIC archive dataset, 95.05% on the combined augmented (PH2+ISIC archive) dataset, and 90.00% on the Derm7pt dataset. Conclusion: The study concludes that the proposed DCNN model, combined with an XAI algorithm, significantly enhances the detection of the blue-white veil in skin lesions, outperforming existing models and offering a reliable tool for early melanoma diagnosis. Abstract: Melanoma, one of the deadliest types of skin cancer, accounts for thousands of fatalities globally. The bluish, blue-whitish, or blue-white veil (BWV) is a critical feature for diagnosing melanoma, yet research into detecting BWV in dermatological images is limited. This study utilizes a non-annotated skin lesion dataset, which is converted into an annotated dataset using a proposed imaging algorithm based on color threshold techniques on lesion patches and color palettes. A Deep Convolutional Neural Network (DCNN) is designed and trained separately on three individual and combined dermoscopic datasets, using custom layers instead of standard activation function layers. The model is developed to categorize skin lesions based on the presence of BWV. The proposed DCNN demonstrates superior performance compared to conventional BWV detection models across different datasets. The model achieves a testing accuracy of 85.71% on the augmented PH2 dataset, 95.00% on the augmented ISIC archive dataset, 95.05% on the combined augmented (PH2+ISIC archive) dataset, and 90.00% on the Derm7pt dataset. An explainable artificial intelligence (XAI) algorithm is subsequently applied to interpret the DCNN's decision-making process regarding BWV detection. The proposed approach, coupled with XAI, significantly improves the detection of BWV in skin lesions, outperforming existing models and providing a robust tool for early melanoma diagnosis.

Jeonghoon Song,Sunghun Kim,Jaegyun Im,Byeongjoon Noh

Main category: cs.CV

TL;DR: Objectomaly是一个新的OoD分割框架，旨在解决现有方法在边界精确度、对象内部不一致的异常分数和背景噪声导致的误报方面的不足。

Details

Motivation: 现有的基于掩码的方法在边界精确度、对象内部不一致的异常分数和背景噪声导致的误报方面存在不足。 Method: Objectomaly包含三个阶段：（1）使用现有OoD骨干网络进行粗略异常评分（CAS），（2）利用SAM生成的实例掩码进行对象级评分归一化的对象感知评分校准（OASC），以及（3）应用拉普拉斯滤波和高斯平滑进行轮廓细化的细致边界精度（MBP） Result: Objectomaly在关键的OoD分割基准测试中实现了最先进的性能，包括SMIYC AnomalyTrack/ObstacleTrack和RoadAnomaly，在像素级别（AuPRC高达96.99，FPR95低至0.07）和组件级别（F1-score高达83.44）指标上都有所提升。 Conclusion: Objectomaly是一个有效的OoD分割框架，它通过结合对象级先验来提高现有方法的性能。 Abstract: Out-of-Distribution (OoD) segmentation is critical for safety-sensitive applications like autonomous driving. However, existing mask-based methods often suffer from boundary imprecision, inconsistent anomaly scores within objects, and false positives from background noise. We propose \textbf{\textit{Objectomaly}}, an objectness-aware refinement framework that incorporates object-level priors. Objectomaly consists of three stages: (1) Coarse Anomaly Scoring (CAS) using an existing OoD backbone, (2) Objectness-Aware Score Calibration (OASC) leveraging SAM-generated instance masks for object-level score normalization, and (3) Meticulous Boundary Precision (MBP) applying Laplacian filtering and Gaussian smoothing for contour refinement. Objectomaly achieves state-of-the-art performance on key OoD segmentation benchmarks, including SMIYC AnomalyTrack/ObstacleTrack and RoadAnomaly, improving both pixel-level (AuPRC up to 96.99, FPR$_{95}$ down to 0.07) and component-level (F1$-$score up to 83.44) metrics. Ablation studies and qualitative results on real-world driving videos further validate the robustness and generalizability of our method. Code will be released upon publication.

Chang-Hwan Son

Main category: cs.CV

TL;DR: This paper proposes a GAN-based face image restoration framework with two dedicated modules (local SFFT and DAFE) to address weather-induced degradations, improving facial structure reconstruction and recognition accuracy in adverse weather conditions.

Details

Motivation: To improve the performance of face recognition systems in adverse weather conditions by addressing weather-induced degradations that distort facial textures and structures. Method: A novel GAN-based blind FIR framework incorporating two components: local Statistical Facial Feature Transformation (SFFT) and Degradation-Agnostic Feature Embedding (DAFE). Result: Experimental results show that the proposed degradation-agnostic SFFT model outperforms existing state-of-the-art FIR methods based on GAN and diffusion models, particularly in suppressing texture distortions and accurately reconstructing facial structures. Conclusion: The proposed GAN-based blind FIR framework with local SFFT and DAFE modules outperforms existing methods in face image restoration under adverse weather conditions. Abstract: With the increasing deployment of intelligent CCTV systems in outdoor environments, there is a growing demand for face recognition systems optimized for challenging weather conditions. Adverse weather significantly degrades image quality, which in turn reduces recognition accuracy. Although recent face image restoration (FIR) models based on generative adversarial networks (GANs) and diffusion models have shown progress, their performance remains limited due to the lack of dedicated modules that explicitly address weather-induced degradations. This leads to distorted facial textures and structures. To address these limitations, we propose a novel GAN-based blind FIR framework that integrates two key components: local Statistical Facial Feature Transformation (SFFT) and Degradation-Agnostic Feature Embedding (DAFE). The local SFFT module enhances facial structure and color fidelity by aligning the local statistical distributions of low-quality (LQ) facial regions with those of high-quality (HQ) counterparts. Complementarily, the DAFE module enables robust statistical facial feature extraction under adverse weather conditions by aligning LQ and HQ encoder representations, thereby making the restoration process adaptive to severe weather-induced degradations. Experimental results demonstrate that the proposed degradation-agnostic SFFT model outperforms existing state-of-the-art FIR methods based on GAN and diffusion models, particularly in suppressing texture distortions and accurately reconstructing facial structures. Furthermore, both the SFFT and DAFE modules are empirically validated in enhancing structural fidelity and perceptual quality in face restoration under challenging weather scenarios.

[82] Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking

Qiangqiang Wu,Yi Yu,Chenqi Kong,Ziquan Liu,Jia Wan,Haoliang Li,Alex C. Kot,Antoni B. Chan

Main category: cs.CV

TL;DR: This paper introduces a novel framework called Temporal Unlearnable Examples (TUEs) to protect personal video data from unauthorized use in training visual object tracking models, offering high efficiency, effectiveness, and generalizability.

Details

Motivation: The motivation stems from the lack of attention to data-privacy issues in Visual Object Tracking (VOT), particularly regarding unauthorized use of private videos for model training. Method: The authors propose a generative framework for creating Temporal Unlearnable Examples (TUEs) along with a temporal contrastive loss to disrupt the learning process of trackers, ensuring privacy in video data usage. Result: The experiments show that the proposed method achieves state-of-the-art performance in protecting video data privacy with strong transferability and robustness across different VOT models, datasets, and tasks. Conclusion: The paper concludes that their proposed TUEs framework effectively prevents unauthorized exploitation of personal video data in training deep trackers while maintaining scalability and effectiveness across various VOT models and datasets. Abstract: With the rise of social media, vast amounts of user-uploaded videos (e.g., YouTube) are utilized as training data for Visual Object Tracking (VOT). However, the VOT community has largely overlooked video data-privacy issues, as many private videos have been collected and used for training commercial models without authorization. To alleviate these issues, this paper presents the first investigation on preventing personal video data from unauthorized exploitation by deep trackers. Existing methods for preventing unauthorized data use primarily focus on image-based tasks (e.g., image classification), directly applying them to videos reveals several limitations, including inefficiency, limited effectiveness, and poor generalizability. To address these issues, we propose a novel generative framework for generating Temporal Unlearnable Examples (TUEs), and whose efficient computation makes it scalable for usage on large-scale video datasets. The trackers trained w/ TUEs heavily rely on unlearnable noises for temporal matching, ignoring the original data structure and thus ensuring training video data-privacy. To enhance the effectiveness of TUEs, we introduce a temporal contrastive loss, which further corrupts the learning of existing trackers when using our TUEs for training. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in video data-privacy protection, with strong transferability across VOT models, datasets, and temporal matching tasks.

Jiaxu Wan,Xu Wang,Mengwei Xie,Xinyuan Chang,Xinran Liu,Zheng Pan,Mu Xu,Ding Yuan

Main category: cs.CV

TL;DR: 本文提出了 OMA-MAT，这是一种利用注意力机制进行在线地图关联的新框架，旨在提升自动驾驶车辆的导航和路径规划能力。

Details

Motivation: 当前自动驾驶技术中，在线高精度地图与全球标准地图之间的关联性研究不足，导致实际应用面临挑战。 Method: 提出了 Map Association Transformer 框架，使用路径感知注意力和空间注意力机制来实现几何和拓扑对应关系的理解。 Result: 引入了 OMA 基准测试，包含 480k 的道路和 260k 的车道路径，并提出了有效的评估指标。 Conclusion: OMA-MAT 是一个用于在线地图关联的新框架，能够增强自动驾驶车辆的路径规划能力。 Abstract: Autonomous vehicles rely on global standard-definition (SD) maps for road-level route planning and online local high-definition (HD) maps for lane-level navigation. However, recent work concentrates on construct online HD maps, often overlooking the association of global SD maps with online HD maps for hybrid navigation, making challenges in utilizing online HD maps in the real world. Observing the lack of the capability of autonomous vehicles in navigation, we introduce \textbf{O}nline \textbf{M}ap \textbf{A}ssociation, the first benchmark for the association of hybrid navigation-oriented online maps, which enhances the planning capabilities of autonomous vehicles. Based on existing datasets, the OMA contains 480k of roads and 260k of lane paths and provides the corresponding metrics to evaluate the performance of the model. Additionally, we propose a novel framework, named Map Association Transformer, as the baseline method, using path-aware attention and spatial attention mechanisms to enable the understanding of geometric and topological correspondences. The code and dataset can be accessed at https://github.com/WallelWan/OMA-MAT.

[84] Divergence Minimization Preference Optimization for Diffusion Model Alignment

Binxu Li,Minkai Xu,Meihua Dang,Stefano Ermon

Main category: cs.CV

TL;DR: This paper introduces DMPO, a new method for aligning diffusion models with human preferences by minimizing reverse KL divergence, showing superior performance over existing approaches.

Details

Motivation: Existing preference optimization methods are stuck in suboptimal mean-seeking optimization; DMPO aims to improve alignment of diffusion models with desired outputs through a novel, principled approach. Method: Divergence Minimization Preference Optimization (DMPO) minimizes reverse KL divergence to align diffusion models, inspired by divergence minimization rather than traditional RL methods. Result: Diffusion models fine-tuned with DMPO outperform or match existing techniques, specifically surpassing all baselines by at least 64.6% in PickScore across evaluation datasets. Conclusion: DMPO provides a robust and elegant pathway for preference alignment in diffusion models, bridging principled theory with practical performance. Abstract: Diffusion models have achieved remarkable success in generating realistic and versatile images from text prompts. Inspired by the recent advancements of language models, there is an increasing interest in further improving the models by aligning with human preferences. However, we investigate alignment from a divergence minimization perspective and reveal that existing preference optimization methods are typically trapped in suboptimal mean-seeking optimization. In this paper, we introduce Divergence Minimization Preference Optimization (DMPO), a novel and principled method for aligning diffusion models by minimizing reverse KL divergence, which asymptotically enjoys the same optimization direction as original RL. We provide rigorous analysis to justify the effectiveness of DMPO and conduct comprehensive experiments to validate its empirical strength across both human evaluations and automatic metrics. Our extensive results show that diffusion models fine-tuned with DMPO can consistently outperform or match existing techniques, specifically outperforming all existing diffusion alignment baselines by at least 64.6% in PickScore across all evaluation datasets, demonstrating the method's superiority in aligning generative behavior with desired outputs. Overall, DMPO unlocks a robust and elegant pathway for preference alignment, bridging principled theory with practical performance in diffusion models.

[85] GGMotion: Group Graph Dynamics-Kinematics Networks for Human Motion Prediction

Shuaijin Wan,Huaijiang Sun

Main category: cs.CV

TL;DR: This paper proposes GGMotion, a novel approach to human motion prediction that leverages dynamics and kinematics priors by modeling human topology in groups, resulting in improved physical plausibility and performance on standard benchmarks.

Details

Motivation: Existing methods typically represent human pose as an abstract graph structure, neglecting the intrinsic physical dependencies between joints, which increases learning difficulty and makes the model prone to generating unrealistic motions. Method: The authors proposed GGMotion, a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors. They introduced a radial field for the graph network to preserve geometric equivariance in 3D space and used inter-group and intra-group interaction modules to capture joint dependencies at different scales. Joint position features were updated using parallelized dynamics-kinematics propagation combined with equivariant multilayer perceptrons (MLP), and an auxiliary loss was introduced to supervise motion priors during training. Result: Extensive experiments on three standard benchmarks, including Human3.6M, CMU-Mocap, and 3DPW, demonstrated the effectiveness and superiority of GGMotion, achieving a significant performance margin in short-term motion prediction. Conclusion: GGMotion is effective and superior in human motion prediction based on the experiments conducted on three standard benchmarks. Abstract: Human motion is a continuous physical process in 3D space, governed by complex dynamic and kinematic constraints. Existing methods typically represent the human pose as an abstract graph structure, neglecting the intrinsic physical dependencies between joints, which increases learning difficulty and makes the model prone to generating unrealistic motions. In this paper, we propose GGMotion, a group graph dynamics-kinematics network that models human topology in groups to better leverage dynamics and kinematics priors. To preserve the geometric equivariance in 3D space, we propose a novel radial field for the graph network that captures more comprehensive spatio-temporal dependencies by aggregating joint features through spatial and temporal edges. Inter-group and intra-group interaction modules are employed to capture the dependencies of joints at different scales. Combined with equivariant multilayer perceptrons (MLP), joint position features are updated in each group through parallelized dynamics-kinematics propagation to improve physical plausibility. Meanwhile, we introduce an auxiliary loss to supervise motion priors during training. Extensive experiments on three standard benchmarks, including Human3.6M, CMU-Mocap, and 3DPW, demonstrate the effectiveness and superiority of our approach, achieving a significant performance margin in short-term motion prediction. The code is available at https://github.com/inkcat520/GGMotion.git.

[86] MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation

Bangning Wei,Joshua Maraval,Meriem Outtas,Kidiyo Kpalma,Nicolas Ramin,Lu Zhang

Main category: cs.CV

TL;DR: 本文介绍了MUVOD数据集，该数据集旨在推动动态场景中4D物体分割的研究，并提供了相应的评估基准和基线方法。

Details

Motivation: 尽管基于NeRF和3D高斯随机投影的方法在静态场景的3D物体分割中取得了成功，但在动态场景的4D物体分割领域由于缺乏足够广泛且标注准确的多视角视频数据集而发展缓慢。 Method: 提出了MUVOD数据集，包括17个场景，每个场景有最少9个和最多46个视角，并提供7830张RGB图像及其对应的4D运动分割掩码。此外，还提出了一种评估指标和一个基线分割方法。 Result: MUVOD数据集包含459个实例，分为73类，并且提供了50个不同条件下物体的子集用于更全面地分析现有3D物体分割方法。 Conclusion: MUVOD是一个新的多视角视频数据集，旨在促进动态场景中的4D对象分割研究，并为评估多视角视频分割方法提供了基础基准。 Abstract: The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi-view video dataset. In this paper, we present MUVOD, a new multi-view video dataset for training and evaluating object segmentation in reconstructed real-world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi-view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi-view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state-of-the-art 3D object segmentation methods. Our proposed MUVOD dataset is available at https://volumetric-repository.labs.b-com.com/#/muvod.

[87] Spline Deformation Field

Mingyang Song,Yang Zhang,Marko Mihajlovic,Siyu Tang,Markus Gross,Tunç Ozan Aydın

Main category: cs.CV

TL;DR: 本文提出了一种基于样条的轨迹表示方法，用于密集点的轨迹建模。该方法通过确定样条节点数量来明确自由度，并引入一种新的低秩时变空间编码方法，以提高时间插值性能和运动连贯性，同时避免了线性混合蒙皮和尽可能刚性的约束。

Details

Motivation: 现有的轨迹建模方法要么增强了变形场的编码策略，导致模型不透明且难以理解；要么采用显式技术（如线性混合蒙皮），依赖启发式节点初始化。此外，隐式表征在稀疏时间信号插值方面的潜力尚未得到充分探索。因此，本文旨在解决这些问题。 Method: 提出了一种基于样条的轨迹表示方法，其中节点的数量明确决定了自由度。为了对节点特性在时空领域进行建模，引入了一种新的低秩时变空间编码方法，取代传统的耦合时空技术。 Result: 所提方法在稀疏输入下拟合连续场的时间插值方面表现出优越性能，同时在动态场景重建质量方面与现有最先进方法具有竞争力，并提高了运动连贯性。 Conclusion: 本文提出的基于样条的轨迹表示方法能够有效提升时间插值性能和运动连贯性，无需依赖线性混合蒙皮或尽可能刚性的约束，为轨迹建模提供了一种新的解决方案。 Abstract: Trajectory modeling of dense points usually employs implicit deformation fields, represented as neural networks that map coordinates to relate canonical spatial positions to temporal offsets. However, the inductive biases inherent in neural networks can hinder spatial coherence in ill-posed scenarios. Current methods focus either on enhancing encoding strategies for deformation fields, often resulting in opaque and less intuitive models, or adopt explicit techniques like linear blend skinning, which rely on heuristic-based node initialization. Additionally, the potential of implicit representations for interpolating sparse temporal signals remains under-explored. To address these challenges, we propose a spline-based trajectory representation, where the number of knots explicitly determines the degrees of freedom. This approach enables efficient analytical derivation of velocities, preserving spatial coherence and accelerations, while mitigating temporal fluctuations. To model knot characteristics in both spatial and temporal domains, we introduce a novel low-rank time-variant spatial encoding, replacing conventional coupled spatiotemporal techniques. Our method demonstrates superior performance in temporal interpolation for fitting continuous fields with sparse inputs. Furthermore, it achieves competitive dynamic scene reconstruction quality compared to state-of-the-art methods while enhancing motion coherence without relying on linear blend skinning or as-rigid-as-possible constraints.

[88] MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models

Joelle Hanna,Linus Scheibenreif,Damian Borth

Main category: cs.CV

TL;DR: 本文提出了一种名为MAPEX的遥感基础模型，该模型基于混合模态专家设计，通过模态条件令牌路由机制和模态感知剪枝技术，解决了遥感领域中模态不匹配和模型效率的问题，并在多个数据集上展示了优越的性能。

Details

Motivation: 由于现有的遥感基础模型主要关注特定模态（如光学RGB或高光谱数据），并且模型规模较大，难以在通常较小的任务数据集上进行微调，因此需要一种更灵活且高效的解决方案。 Method: MAPEX采用了一种新的模态条件令牌路由机制，在多模态遥感数据上进行预训练，并提出了一种模态感知剪枝技术，以保留特定任务模态的专家。 Result: MAPEX在多个遥感数据集上表现出色，相较于全监督训练和最先进的遥感基础模型显示出更强的性能。 Conclusion: MAPEX是一种基于混合模态专家的遥感基础模型，能够有效解决应用模态与预训练数据之间的不匹配问题，并通过实验验证了其在多种遥感数据集上的优异性能。 Abstract: Remote sensing data is commonly used for tasks such as flood mapping, wildfire detection, or land-use studies. For each task, scientists carefully choose appropriate modalities or leverage data from purpose-built instruments. Recent work on remote sensing foundation models pre-trains computer vision models on large amounts of remote sensing data. These large-scale models tend to focus on specific modalities, often optical RGB or multispectral data. For many important applications, this introduces a mismatch between the application modalities and the pre-training data. Moreover, the large size of foundation models makes them expensive and difficult to fine-tune on typically small datasets for each task. We address this mismatch with MAPEX, a remote sensing foundation model based on mixture-of-modality experts. MAPEX is pre-trained on multi-modal remote sensing data with a novel modality-conditioned token routing mechanism that elicits modality-specific experts. To apply the model on a specific task, we propose a modality aware pruning technique, which only retains experts specialized for the task modalities. This yields efficient modality-specific models while simplifying fine-tuning and deployment for the modalities of interest. We experimentally validate MAPEX on diverse remote sensing datasets and show strong performance compared to fully supervised training and state-of-the-art remote sensing foundation models. Code is available at https://github.com/HSG-AIML/MAPEX.

[89] Beyond the Linear Separability Ceiling

Enrico Vompa,Tanel Tammet,Mohit Vaishnav

Main category: cs.CV

TL;DR: 本研究分析了视觉-语言模型（VLMs）在线性推理任务中的瓶颈问题，并指出解决这一问题需要有针对性的对齐而不是单纯提高表示学习。

Details

Motivation: 大多数最先进的视觉-语言模型（VLMs）在抽象推理任务中似乎受到其视觉嵌入线性可分性的限制，因此需要研究这种“线性推理瓶颈”。 Method: 通过引入线性可分性上限（LSC）来调查VLM视觉嵌入在线性推理瓶颈上的表现，并使用后缀调整作为方法控制。 Result: 研究发现该瓶颈普遍存在，并非源于感知不良，而是语言模型推理路径的失败。此外，对于需要更深层次适应的复杂关系任务，显式改进表示质量会导致模型在新提示格式上失败。 Conclusion: 这项研究得出结论，强大的推理能力取决于有针对性的对齐，而不仅仅是改进表示学习。 Abstract: Most state-of-the-art Visual-Language Models (VLMs) are seemingly limited by the linear separabilty of their visual embeddings on abstract reasoning tasks. This work investigates this "linear reasoning bottleneck" by introducing the Linear Separability Ceiling (LSC), the performance of a simple linear classifier on a VLM's visual embeddings. We find this bottleneck is widespread and stems not from poor perception, but from failures in the language model's reasoning pathways. We demonstrate this is a solvable alignment issue. The required intervention, however, is task-dependent: activating existing pathways suffices for semantic concepts, while complex relational reasoning requires adapting core model weights. Using postfix tuning as a methodological control, we find strong evidence for powerful, dormant reasoning pathways within VLMs. However, for complex relational tasks requiring deeper adaptation, explicitly improving representation quality causes the model to fail on new prompt formats despite its embeddings remaining well separated. Ultimately, this work provides a new lens for VLM analysis, showing that robust reasoning is a matter of targeted alignment, not simply improved representation learning.

[90] Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-Light Semantic Segmentation

Chunyan Wang,Dong Zhang,Jinhui Tang

Main category: cs.CV

TL;DR: This paper introduces DGKD-WLSS, a novel framework for low-light weakly-supervised semantic segmentation that combines diffusion-guided knowledge distillation and depth-guided feature fusion for improved performance.

Details

Motivation: Existing methods degrade in low-light conditions due to image quality issues and limitations of weak supervision, prompting the need for a more robust solution. Method: The paper proposes DGKD-WLSS, combining Diffusion-Guided Knowledge Distillation (DGKD) and Depth-Guided Feature Fusion (DGF2) to address low-light image challenges. Result: Extensive experiments show that DGKD-WLSS achieves state-of-the-art performance in weakly supervised semantic segmentation under low-light conditions. Conclusion: DGKD-WLSS is an effective framework for weakly-supervised semantic segmentation in low-light environments, achieving state-of-the-art performance. Abstract: Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD-WLSS.

[91] NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning

Tianwei Mu,Feiyu Duan,Bo Zhou,Dan Xue,Manhong Huang

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的少样本跨域异常检测框架NexViTAD，它解决了工业异常检测中领域变化带来的挑战。

Details

Motivation: 应对工业异常检测中的领域转移挑战，并提升现有方法的性能和适应性。 Method: 使用基于视觉基础模型的NexViTAD框架，结合了共享子空间投影机制和多任务学习模块。 Result: 在MVTec AD数据集上实现了97.5%的AUC、70.4%的AP以及95.2%的PRO，表现优于其他最新模型。 Conclusion: NexViTAD通过多层次的创新机制，如层次适配器模块、共享子空间投影策略及多任务解码器架构，在跨领域缺陷检测方面取得了显著进展。 Abstract: This paper presents a novel few-shot cross-domain anomaly detection framework, Nexus Vision Transformer for Anomaly Detection (NexViTAD), based on vision foundation models, which effectively addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms and multi-task learning (MTL) module. The main innovations include: (1) a hierarchical adapter module that adaptively fuses complementary features from Hiera and DINO-v2 pre-trained models, constructing more robust feature representations; (2) a shared subspace projection strategy that enables effective cross-domain knowledge transfer through bottleneck dimension constraints and skip connection mechanisms; (3) a MTL Decoder architecture supports simultaneous processing of multiple source domains, significantly enhancing model generalization capabilities; (4) an anomaly score inference method based on Sinkhorn-K-means clustering, combined with Gaussian filtering and adaptive threshold processing for precise pixel level. Valuated on the MVTec AD dataset, NexViTAD delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains, surpassing other recent models, marking a transformative advance in cross-domain defect detection.

[92] HOTA: Hierarchical Overlap-Tiling Aggregation for Large-Area 3D Flood Mapping

Wenfeng Jia,Bin Liang,Yuxi Lu,Attavit Wilaiwongsakul,Muhammad Arif Khan,Lihong Zheng

Main category: cs.CV

TL;DR: This paper presents HOTA, a multi-scale inference strategy combined with SegFormer and a depth estimation module, which successfully improves flood mapping accuracy and produces detailed 3D flood maps for disaster response.

Details

Motivation: Floods are among the most frequent natural hazards causing significant social and economic damage. Timely, large-scale information on flood extent and depth is essential for disaster response; however, existing products often trade spatial detail for coverage or ignore flood depth altogether. Method: The paper introduces HOTA: Hierarchical Overlap-Tiling Aggregation, a plug-and-play, multi-scale inference strategy used in combination with SegFormer and a dual-constraint depth estimation module to form a complete 3D flood-mapping pipeline. Result: A case study on the March 2021 Kempsey (Australia) flood shows that HOTA, when coupled with SegFormer, improves IoU from 73% (U-Net baseline) to 84%. The resulting 3D surface achieves a mean absolute boundary error of less than 0.5 m. Conclusion: HOTA, when coupled with SegFormer, improves IoU from 73% (U-Net baseline) to 84%, producing accurate, large-area 3D flood maps suitable for rapid disaster response. Abstract: Floods are among the most frequent natural hazards and cause significant social and economic damage. Timely, large-scale information on flood extent and depth is essential for disaster response; however, existing products often trade spatial detail for coverage or ignore flood depth altogether. To bridge this gap, this work presents HOTA: Hierarchical Overlap-Tiling Aggregation, a plug-and-play, multi-scale inference strategy. When combined with SegFormer and a dual-constraint depth estimation module, this approach forms a complete 3D flood-mapping pipeline. HOTA applies overlapping tiles of different sizes to multispectral Sentinel-2 images only during inference, enabling the SegFormer model to capture both local features and kilometre-scale inundation without changing the network weights or retraining. The subsequent depth module is based on a digital elevation model (DEM) differencing method, which refines the 2D mask and estimates flood depth by enforcing (i) zero depth along the flood boundary and (ii) near-constant flood volume with respect to the DEM. A case study on the March 2021 Kempsey (Australia) flood shows that HOTA, when coupled with SegFormer, improves IoU from 73\% (U-Net baseline) to 84\%. The resulting 3D surface achieves a mean absolute boundary error of less than 0.5 m. These results demonstrate that HOTA can produce accurate, large-area 3D flood maps suitable for rapid disaster response.

[93] Stable-Hair v2: Real-World Hair Transfer via Multiple-View Diffusion Model

Kuiyuan Sun,Yuxuan Zhang,Jichao Zhang,Jiaming Liu,Wei Wang,Niculae Sebe,Yao Zhao

Main category: cs.CV

TL;DR: 本文提出 Stable-Hair v2，首次将多视角扩散模型应用于头发转移任务，实现了高保真度和视角一致性的发型转换。

Details

Motivation: 尽管基于扩散的方法在捕捉多样和复杂的发型方面表现出色，但其在生成一致且高质量的多视角输出方面仍有不足，限制了其在数字人类和虚拟化身等实际应用中的使用。 Method: 提出了一个综合的多视角训练数据生成流程，包括基于扩散的秃头转换器、数据增强修复模型和面部微调的多视角扩散模型，并结合极坐标嵌入和时间注意力层进行模型优化。 Result: 实验表明，该方法在源对象上准确转移详细且逼真的发型，同时在不同视角下实现了无缝且一致的结果，显著优于现有方法。 Conclusion: Stable-Hair v2 是一种新的基于扩散模型的多视角头发转移框架，能够实现高保真度和视角一致性的头发转移，为数字人类和虚拟化身等实际应用提供了有效解决方案。 Abstract: While diffusion-based methods have shown impressive capabilities in capturing diverse and complex hairstyles, their ability to generate consistent and high-quality multi-view outputs -- crucial for real-world applications such as digital humans and virtual avatars -- remains underexplored. In this paper, we propose Stable-Hair v2, a novel diffusion-based multi-view hair transfer framework. To the best of our knowledge, this is the first work to leverage multi-view diffusion models for robust, high-fidelity, and view-consistent hair transfer across multiple perspectives. We introduce a comprehensive multi-view training data generation pipeline comprising a diffusion-based Bald Converter, a data-augment inpainting model, and a face-finetuned multi-view diffusion model to generate high-quality triplet data, including bald images, reference hairstyles, and view-aligned source-bald pairs. Our multi-view hair transfer model integrates polar-azimuth embeddings for pose conditioning and temporal attention layers to ensure smooth transitions between views. To optimize this model, we design a novel multi-stage training strategy consisting of pose-controllable latent IdentityNet training, hair extractor training, and temporal attention training. Extensive experiments demonstrate that our method accurately transfers detailed and realistic hairstyles to source subjects while achieving seamless and consistent results across views, significantly outperforming existing methods and establishing a new benchmark in multi-view hair transfer. Code is publicly available at https://github.com/sunkymepro/StableHairV2.

[94] HiM2SAM: Enhancing SAM2 with Hierarchical Motion Estimation and Memory Optimization towards Long-term Tracking

Ruixiang Chen,Guolei Sun,Yawei Li,Jie Qin,Luca Benini

Main category: cs.CV

TL;DR: 本文提出了针对视频目标跟踪任务的SAM2框架的增强版本，通过引入分层运动估计策略和优化内存库设计，在不增加训练的情况下显著提高了长期遮挡和外观变化下的跟踪性能。

Details

Motivation: 解决视频目标跟踪中的挑战，如遮挡、背景杂乱和目标重现问题，特别是在长期遮挡和外观变化下提高跟踪可靠性。 Method: 引入了结合轻量级线性预测与选择性非线性优化的分层运动估计策略，并区分长时记忆帧与短时记忆帧来优化内存库。 Result: 实验结果表明在不同模型规模上均有持续改进，在LaSOT和LaSOText数据集上大模型相对原始SAM2分别实现了9.6%和7.2%的AUC提升，小模型提升更明显。 Conclusion: 该方法在无需训练的情况下有效提升了SAM2框架在长期跟踪任务中的性能，达到了先进水平。 Abstract: This paper presents enhancements to the SAM2 framework for video object tracking task, addressing challenges such as occlusions, background clutter, and target reappearance. We introduce a hierarchical motion estimation strategy, combining lightweight linear prediction with selective non-linear refinement to improve tracking accuracy without requiring additional training. In addition, we optimize the memory bank by distinguishing long-term and short-term memory frames, enabling more reliable tracking under long-term occlusions and appearance changes. Experimental results show consistent improvements across different model scales. Our method achieves state-of-the-art performance on LaSOT and LaSOText with the large model, achieving 9.6% and 7.2% relative improvements in AUC over the original SAM2, and demonstrates even larger relative gains on smaller models, highlighting the effectiveness of our trainless, low-overhead improvements for boosting long-term tracking performance. The code is available at https://github.com/LouisFinner/HiM2SAM.

[95] LOSC: LiDAR Open-voc Segmentation Consolidator

Nermin Samet,Gilles Puy,Renaud Marlet

Main category: cs.CV

TL;DR: LOSC利用视觉语言模型和3D网络提升了lidar扫描的开放词汇分割性能。

Details

Motivation: 传统的图像语义反投影方法导致点标签噪声多且稀疏，需要更精确和稳健的解决方案。 Method: 通过图像语义反投影到3D点云，并对标签进行优化以提高时空一致性和鲁棒性，然后基于这些改进的标签训练3D网络。 Result: LOSC在nuScenes和SemanticKITTI数据集上显著优于现有的零样本开放词汇语义和全景分割方法。 Conclusion: LOSC实现了一个简单但有效的方法，用于lidar扫描的开放词汇分割，优于现有技术。 Abstract: We study the use of image-based Vision-Language Models (VLMs) for open-vocabulary segmentation of lidar scans in driving settings. Classically, image semantics can be back-projected onto 3D point clouds. Yet, resulting point labels are noisy and sparse. We consolidate these labels to enforce both spatio-temporal consistency and robustness to image-level augmentations. We then train a 3D network based on these refined labels. This simple method, called LOSC, outperforms the SOTA of zero-shot open-vocabulary semantic and panoptic segmentation on both nuScenes and SemanticKITTI, with significant margins.

[96] SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs

Siting Wang,Luoyang Sun,Cheng Deng,Kun Shao,Minnan Pei,Zheng Tian,Haifeng Zhang,Jun Wang

Main category: cs.CV

TL;DR: 研究介绍了一个新的多模态基准测试SpatialViz-Bench，用于评估大型语言模型的空间可视化能力，并揭示了这些模型在此类任务中的不足。

Details

Motivation: 现有的评估往往依赖于可能与训练数据重叠的IQ测试或数学竞赛，这损害了评估的可靠性。因此，需要一个专门针对空间可视化的基准测试。 Method: 引入了一个全面的多模态基准测试SpatialViz-Bench，包含12个任务和1,180个自动生成的问题，并对33种最先进的MLLM进行了评估。 Result: 评估不仅显示了广泛的性能差异并证明了基准的强大区分能力，还发现了模型表现出与人类直觉不一致的行为。 Conclusion: SpatialViz-Bench揭示了最先进的MLLM在空间可视化任务中仍存在缺陷，解决了该领域的一个重要空白。 Abstract: Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark's strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.

[97] ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon,Yannis Karmim,Julio Silva-Rodriguez,Paul Couairon,Clément Rambour,Raphaël Fournier-Sniehotta,Ismail Ben Ayed,Jose Dolz,Nicolas Thome

Main category: cs.CV

TL;DR: ViLU is a framework for reliable uncertainty quantification in Vision-Language Models that uses multi-modal representation and binary classification for effective failure prediction.

Details

Motivation: Reliable Uncertainty Quantification (UQ) and failure prediction are open challenges for Vision-Language Models (VLMs). Method: ViLU constructs an uncertainty-aware multi-modal representation by integrating visual embedding, predicted textual embedding, and image-conditioned textual representation via cross-attention. It trains an uncertainty predictor as a binary classifier using weighted binary cross-entropy loss. Result: Extensive experiments show significant gains compared to state-of-the-art failure prediction methods on datasets like ImageNet-1k, CC12M, and LAION-400M. Conclusion: ViLU is a new Vision-Language Uncertainty quantification framework that provides reliable uncertainty quantification and failure prediction for Vision-Language Models. Abstract: Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.

[98] T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates

Zhitao Wang,Hengyu Man,Wenrui Li,Xingtao Wang,Xiaopeng Fan,Debin Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的生成视频编码框架T-GVC，在超低比特率条件下通过轨迹引导实现高效视频压缩和精确运动控制。

Details

Motivation: 现有的生成视频编码方法受限于领域特定性或过度依赖高层文本引导，难以捕捉运动细节并导致不真实的重建。 Method: T-GVC采用了语义感知的稀疏运动采样管道，并引入了轨迹对齐损失约束的无训练潜在空间引导机制，将低级运动跟踪与高级语义理解有效结合。 Result: 实验结果表明，T-GVC在超低比特率条件下优于传统编解码器和最先进的端到端视频压缩方法，并且在运动控制上优于现有的文本引导方法。 Conclusion: T-GVC通过轨迹引导的生成视频编码框架，在超低比特率条件下实现了比传统编解码器和现有端到端视频压缩方法更优的性能，同时在运动控制方面优于现有的文本引导方法。 Abstract: Recent advances in video generation techniques have given rise to an emerging paradigm of generative video coding, aiming to achieve semantically accurate reconstructions in Ultra-Low Bitrate (ULB) scenarios by leveraging strong generative priors. However, most existing methods are limited by domain specificity (e.g., facial or human videos) or an excessive dependence on high-level text guidance, which often fails to capture motion details and results in unrealistic reconstructions. To address these challenges, we propose a Trajectory-Guided Generative Video Coding framework (dubbed T-GVC). T-GVC employs a semantic-aware sparse motion sampling pipeline to effectively bridge low-level motion tracking with high-level semantic understanding by extracting pixel-wise motion as sparse trajectory points based on their semantic importance, not only significantly reducing the bitrate but also preserving critical temporal semantic information. In addition, by incorporating trajectory-aligned loss constraints into diffusion processes, we introduce a training-free latent space guidance mechanism to ensure physically plausible motion patterns without sacrificing the inherent capabilities of generative models. Experimental results demonstrate that our framework outperforms both traditional codecs and state-of-the-art end-to-end video compression methods under ULB conditions. Furthermore, additional experiments confirm that our approach achieves more precise motion control than existing text-guided methods, paving the way for a novel direction of generative video coding guided by geometric motion modeling.

[99] Bridging the gap in FER: addressing age bias in deep learning

F. Xavier Gaya-Morey,Julia Sanchez-Perez,Cristina Manresa-Yee,Jose M. Buades-Rubio

Main category: cs.CV

TL;DR: 这项研究揭示了深度学习在面部表情识别中存在的年龄相关偏见，并提出了一系列有效的偏差缓解策略，如多任务学习、多模态输入和年龄加权损失，以提高对老年人表情识别的准确性。

Details

Motivation: 近年来，基于深度学习的面部表情识别（FER）系统取得了显著成效，但这些模型常常表现出人口统计学偏差，尤其是年龄相关的偏差，这可能影响其公平性和可靠性。因此，本文旨在全面研究深度FER模型中的年龄相关偏差，特别关注老年群体。 Method: 研究首先调查了不同年龄段的人脸表情识别性能是否存在差异，以及哪些表情最受影响，模型注意力是否因年龄而异。接着利用可解释AI（XAI）技术识别了表情识别和注意力模式上的系统性差异。基于这些发现，提出了三种偏差缓解策略：多任务学习、多模态输入和年龄加权损失。 Result: 结果表明，针对老年人群体，尤其是在最容易出错的表情上，识别准确率有了显著提升。通过对显著性热图的分析发现，采用年龄感知策略训练的模型能够更关注每个年龄段相关的面部区域，从而解释了性能提升的原因。 Conclusion: 年龄相关偏差在深度学习的面部表情识别中确实存在，但通过简单的训练修改可以有效缓解。使用多任务学习、多模态输入和年龄加权损失等策略，可以提高对老年人群体的表情识别准确率。此外，即使只有近似的年龄标签，在促进大规模情感计算系统的公平性方面也具有价值。 Abstract: Facial Expression Recognition (FER) systems based on deep learning have achieved impressive performance in recent years. However, these models often exhibit demographic biases, particularly with respect to age, which can compromise their fairness and reliability. In this work, we present a comprehensive study of age-related bias in deep FER models, with a particular focus on the elderly population. We first investigate whether recognition performance varies across age groups, which expressions are most affected, and whether model attention differs depending on age. Using Explainable AI (XAI) techniques, we identify systematic disparities in expression recognition and attention patterns, especially for "neutral", "sadness", and "anger" in elderly individuals. Based on these findings, we propose and evaluate three bias mitigation strategies: Multi-task Learning, Multi-modal Input, and Age-weighted Loss. Our models are trained on a large-scale dataset, AffectNet, with automatically estimated age labels and validated on balanced benchmark datasets that include underrepresented age groups. Results show consistent improvements in recognition accuracy for elderly individuals, particularly for the most error-prone expressions. Saliency heatmap analysis reveals that models trained with age-aware strategies attend to more relevant facial regions for each age group, helping to explain the observed improvements. These findings suggest that age-related bias in FER can be effectively mitigated using simple training modifications, and that even approximate demographic labels can be valuable for promoting fairness in large-scale affective computing systems.

[100] MolCLIP: A Molecular-Auxiliary CLIP Framework for Identifying Drug Mechanism of Action Based on Time-Lapsed Mitochondrial Images

Fengqian Pang,Chunyue Lei,Hongfei Zhao,Chenghao Liu,Zhiqiang Xing,Huafeng Wang,Chuyang Ye

Main category: cs.CV

TL;DR: 本文提出MolCLIP模型，结合显微细胞视频和药物分子信息，利用分子潜在空间指导视频特征学习并优化特征聚合，在药物识别和作用机制识别中取得显著效果提升。

Details

Motivation: 现有的基于深度学习的方法主要关注于细胞的空间特性而忽视了时间动态变化，并且缺乏对药物分子模态与图像模态之间互补性的充分利用。 Method: 提出MolCLIP模型，采用分子辅助CLIP框架引导视频特征学习分子潜在空间分布，并集成度量学习策略优化视频特征的聚合；在MitoDataset数据集上进行实验验证。 Result: 在MitoDataset上的实验结果表明，MolCLIP在药物识别mAP提高了51.2%，作用机制识别mAP提高了20.5%。 Conclusion: MolCLIP是一个结合显微细胞视频和分子模态的视觉语言模型，通过设计分子辅助CLIP框架和集成度量学习策略，优化视频特征的学习和聚合，在药物识别和作用机制识别方面取得了显著提升。 Abstract: Drug Mechanism of Action (MoA) mainly investigates how drug molecules interact with cells, which is crucial for drug discovery and clinical application. Recently, deep learning models have been used to recognize MoA by relying on high-content and fluorescence images of cells exposed to various drugs. However, these methods focus on spatial characteristics while overlooking the temporal dynamics of live cells. Time-lapse imaging is more suitable for observing the cell response to drugs. Additionally, drug molecules can trigger cellular dynamic variations related to specific MoA. This indicates that the drug molecule modality may complement the image counterpart. This paper proposes MolCLIP, the first visual language model to combine microscopic cell video- and molecule-modalities. MolCLIP designs a molecule-auxiliary CLIP framework to guide video features in learning the distribution of the molecular latent space. Furthermore, we integrate a metric learning strategy with MolCLIP to optimize the aggregation of video features. Experimental results on the MitoDataset demonstrate that MolCLIP achieves improvements of 51.2% and 20.5% in mAP for drug identification and MoA recognition, respectively.

[101] Scaling RL to Long Videos

Yukang Chen,Wei Huang,Baifeng Shi,Qinghao Hu,Hanrong Ye,Ligeng Zhu,Zhijian Liu,Pavlo Molchanov,Jan Kautz,Xiaojuan Qi,Sifei Liu,Hongxu Yin,Yao Lu,Song Han

Main category: cs.CV

TL;DR: This paper introduces LongVILA-R1, a scalable framework for long video reasoning using VLMs enhanced by reinforcement learning and a new high-quality dataset, achieving state-of-the-art performance on long video understanding tasks.

Details

Motivation: To address the challenges of scaling reasoning in vision-language models (VLMs) for long video analysis, enabling better performance on diverse reasoning tasks such as temporal, spatial, and plot reasoning. Method: The paper proposes a full-stack framework incorporating a large-scale dataset (LongVideo-Reason), a two-stage training pipeline with CoT-SFT and RL, and a specialized infrastructure (MR-SP) for efficient long video RL training. Result: LongVILA-R1-7B achieves strong results on long video QA benchmarks like VideoMME, outperforms existing models like Video-R1-7B, matches Gemini-1.5-Pro across multiple reasoning domains, and MR-SP achieves up to 2.1x speedup in training. Conclusion: LongVILA-R1 demonstrates significant performance improvements in long video reasoning tasks and marks progress towards effective vision-language modeling for extended videos. Abstract: We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).

[102] Attend-and-Refine: Interactive keypoint estimation and quantitative cervical vertebrae analysis for bone age assessment

Jinhee Kim,Taesung Kim,Taewoo Kim,Dong-Wook Kim,Byungduk Ahn,Yoon-Ji Kim,In-Seok Song,Jaegul Choo

Main category: cs.CV

TL;DR: 本研究开发了一种名为ARNet的人工智能辅助工具，可提高儿科正畸中生长潜力评估的效率与准确性，减少手动标注关键点的工作负担。

Details

Motivation: 在儿科正畸中，准确估计生长潜力对于制定有效的治疗策略至关重要。然而，传统方法中对颈椎椎关键点的标注是一项费时费力的任务，因此需要一种更高效、可靠的方法来优化正畸干预的最佳时机。 Method: 通过侧头影放射照片全面分析颈椎椎成熟度（CVM）特征，并引入一种基于深度学习的交互式模型ARNet，用于关键点标注任务。ARNet结合了交互引导再校准网络和形态感知损失函数。 Result: ARNet显著减少了人工标注关键点的工作量，提高了效率和准确性，在多个数据集上验证后表现出卓越性能，并具有广泛的医学影像应用前景。 Conclusion: 本研究为儿科正畸学中评估生长潜力提供了一种有效的AI辅助诊断工具，标志着该领域的重要进展。 Abstract: In pediatric orthodontics, accurate estimation of growth potential is essential for developing effective treatment strategies. Our research aims to predict this potential by identifying the growth peak and analyzing cervical vertebra morphology solely through lateral cephalometric radiographs. We accomplish this by comprehensively analyzing cervical vertebral maturation (CVM) features from these radiographs. This methodology provides clinicians with a reliable and efficient tool to determine the optimal timings for orthodontic interventions, ultimately enhancing patient outcomes. A crucial aspect of this approach is the meticulous annotation of keypoints on the cervical vertebrae, a task often challenged by its labor-intensive nature. To mitigate this, we introduce Attend-and-Refine Network (ARNet), a user-interactive, deep learning-based model designed to streamline the annotation process. ARNet features Interaction-guided recalibration network, which adaptively recalibrates image features in response to user feedback, coupled with a morphology-aware loss function that preserves the structural consistency of keypoints. This novel approach substantially reduces manual effort in keypoint identification, thereby enhancing the efficiency and accuracy of the process. Extensively validated across various datasets, ARNet demonstrates remarkable performance and exhibits wide-ranging applicability in medical imaging. In conclusion, our research offers an effective AI-assisted diagnostic tool for assessing growth potential in pediatric orthodontics, marking a significant advancement in the field.

[103] Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Haochen Wang,Xiangtai Li,Zilong Huang,Anran Wang,Jiacong Wang,Tao Zhang,Jiani Zheng,Sule Bai,Zijian Kang,Jiashi Feng,Zhuochen Wang,Zhaoxiang Zhang

Main category: cs.CV

TL;DR: This paper introduces TreeBench, a new benchmark for evaluating visual grounded reasoning with traceable evidence and complex reasoning tasks, along with TreeVGR, a training paradigm that improves performance across multiple benchmarks.

Details

Motivation: Current models like OpenAI-o3 excel at visual grounded reasoning but lack holistic benchmarks to evaluate their capabilities, particularly regarding traceable evidence and complex reasoning. This gap motivated the development of TreeBench. Method: The study proposes TreeBench based on three principles: focused visual perception, traceable evidence evaluation, and second-order reasoning. It samples images from SA-1B and uses LMM experts to annotate data. A new training paradigm, TreeVGR, incorporates reinforcement learning for joint supervision of localization and reasoning. Result: TreeBench contains 405 challenging visual question-answering pairs, with even advanced models struggling (e.g., OpenAI-o3 scores 54.87%). Using TreeVGR, improvements are achieved on multiple benchmarks, including V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4). Conclusion: TreeBench serves as a diagnostic benchmark for evaluating visual grounded reasoning, and the TreeVGR training paradigm demonstrates effectiveness in enhancing localization and reasoning capabilities. Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.

[104] Action Unit Enhance Dynamic Facial Expression Recognition

Feng Liu,Lingna Gu,Chen Shi,Xiaolan Fu

Main category: cs.CV

TL;DR: The paper introduces AU-DFER, an enhanced method for dynamic facial expression recognition that integrates AU-expression knowledge and tackles data imbalance, leading to improved model performance.

Details

Motivation: The motivation is to improve the effectiveness of deep learning modeling in Dynamic Facial Expression Recognition (DFER) by incorporating AU-expression knowledge and addressing data label imbalance issues in existing datasets. Method: The authors propose an AU-enhanced Dynamic Facial Expression Recognition (AU-DFER) architecture. They quantify the contribution of Action Units (AUs) to different expressions, design a weight matrix to incorporate prior knowledge, and integrate this knowledge with deep learning through an AU loss function. Result: Experiments on three recent open-source approaches and principal datasets show that the proposed AU-DFER architecture outperforms state-of-the-art methods without additional computation and addresses label imbalance issues effectively. Conclusion: This paper concludes that integrating quantified AU-expression knowledge into DFER models enhances their effectiveness, and addressing data imbalance improves performance. Abstract: Dynamic Facial Expression Recognition(DFER) is a rapidly evolving field of research that focuses on the recognition of time-series facial expressions. While previous research on DFER has concentrated on feature learning from a deep learning perspective, we put forward an AU-enhanced Dynamic Facial Expression Recognition architecture, namely AU-DFER, that incorporates AU-expression knowledge to enhance the effectiveness of deep learning modeling. In particular, the contribution of the Action Units(AUs) to different expressions is quantified, and a weight matrix is designed to incorporate a priori knowledge. Subsequently, the knowledge is integrated with the learning outcomes of a conventional deep learning network through the introduction of AU loss. The design is incorporated into the existing optimal model for dynamic expression recognition for the purpose of validation. Experiments are conducted on three recent mainstream open-source approaches to DFER on the principal datasets in this field. The results demonstrate that the proposed architecture outperforms the state-of-the-art(SOTA) methods without the need for additional arithmetic and generally produces improved results. Furthermore, we investigate the potential of AU loss function redesign to address data label imbalance issues in established dynamic expression datasets. To the best of our knowledge, this is the first attempt to integrate quantified AU-expression knowledge into various DFER models. We also devise strategies to tackle label imbalance, or minor class problems. Our findings suggest that employing a diverse strategy of loss function design can enhance the effectiveness of DFER. This underscores the criticality of addressing data imbalance challenges in mainstream datasets within this domain. The source code is available at https://github.com/Cross-Innovation-Lab/AU-DFER.

Shin'ya Yamaguchi,Kosuke Nishida,Daiki Chijiwa

Main category: cs.CV

TL;DR: 本文提出了一种名为RED的新解码策略，通过将视觉信息与推理链（CoT）中的中间推理步骤结合，解决了大型多模态模型在多模态推理中忽略推理链内容的问题。

Details

Motivation: 现有的大型多模态模型（LVLMs）在使用链式推理（CoT）时常常忽略生成的推理链内容，导致推理效果不佳。 Method: 将多模态CoT推理重新定义为KL约束下的奖励最大化问题，并提出了RED方法，在解码阶段结合图像和推理链条件分布来提升推理效果。 Result: 实验表明，RED在多个基准测试和LVLMs上显著优于标准CoT和其他解码方法。 Conclusion: RED是一种有效的推理改进方法，提高了CoT在LVLMs中的可靠性和准确性，为更可信的多模态系统奠定了基础。 Abstract: Large vision-language models (LVLMs) have demonstrated remarkable capabilities by integrating pre-trained vision encoders with large language models (LLMs). Similar to single-modal LLMs, chain-of-thought (CoT) prompting has been adapted for LVLMs to enhance multi-modal reasoning by generating intermediate rationales based on visual and textual inputs. While CoT is assumed to improve grounding and accuracy in LVLMs, our experiments reveal a key challenge: existing LVLMs often ignore the contents of generated rationales in CoT reasoning. To address this, we re-formulate multi-modal CoT reasoning as a KL-constrained reward maximization focused on rationale-conditional log-likelihood. As the optimal solution, we propose rationale-enhanced decoding (RED), a novel plug-and-play inference-time decoding strategy. RED harmonizes visual and rationale information by multiplying distinct image-conditional and rationale-conditional next token distributions. Extensive experiments show that RED consistently and significantly improves reasoning over standard CoT and other decoding methods across multiple benchmarks and LVLMs. Our work offers a practical and effective approach to improve both the faithfulness and accuracy of CoT reasoning in LVLMs, paving the way for more reliable rationale-grounded multi-modal systems.

[106] Tree-Mamba: A Tree-Aware Mamba for Underwater Monocular Depth Estimation

Peixian Zhuang,Yijian Wang,Zhenqi Fu,Hongliang Zhang,Sam Kwong,Chongyi Li

Main category: cs.CV

TL;DR: 本文提出了Tree-Mamba与BlueDepth数据集，显著提升了水下单目深度估计的准确性和可靠性。

Details

Motivation: 由于光吸收和散射效应，水下图像存在严重退化，同时现有数据集中深度标签不可靠，导致传统Mamba-based方法无法有效建模水下图像结构特征。因此需要开发更灵活的方法并提供高质量数据集以提升UMDE性能。 Method: 提出了一种树感知Mamba（Tree-Mamba）方法，通过基于特征相似性的最小生成树设计自适应扫描策略，并采用上下和下上遍历方式聚合多尺度空间拓扑特征。此外，还构建了包含38,162对可靠深度标签的水下图像基准数据集BlueDepth。 Result: Tree-Mamba在定性和定量评估中均优于多种先进方法，具有竞争力的计算效率，且BlueDepth数据集为训练基于深度学习的UMDE方法提供了基础支持。 Conclusion: Tree-Mamba方法在水下单目深度估计中表现出色，结合新构建的BlueDepth数据集，为解决水下图像退化问题提供了有效方案。 Abstract: Underwater Monocular Depth Estimation (UMDE) is a critical task that aims to estimate high-precision depth maps from underwater degraded images caused by light absorption and scattering effects in marine environments. Recently, Mamba-based methods have achieved promising performance across various vision tasks; however, they struggle with the UMDE task because their inflexible state scanning strategies fail to model the structural features of underwater images effectively. Meanwhile, existing UMDE datasets usually contain unreliable depth labels, leading to incorrect object-depth relationships between underwater images and their corresponding depth maps. To overcome these limitations, we develop a novel tree-aware Mamba method, dubbed Tree-Mamba, for estimating accurate monocular depth maps from underwater degraded images. Specifically, we propose a tree-aware scanning strategy that adaptively constructs a minimum spanning tree based on feature similarity. The spatial topological features among the tree nodes are then flexibly aggregated through bottom-up and top-down traversals, enabling stronger multi-scale feature representation capabilities. Moreover, we construct an underwater depth estimation benchmark (called BlueDepth), which consists of 38,162 underwater image pairs with reliable depth labels. This benchmark serves as a foundational dataset for training existing deep learning-based UMDE methods to learn accurate object-depth relationships. Extensive experiments demonstrate the superiority of the proposed Tree-Mamba over several leading methods in both qualitative results and quantitative evaluations with competitive computational efficiency. Code and dataset will be available at https://wyjgr.github.io/Tree-Mamba.html.

[107] Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring

Wei Shang,Dongwei Ren,Wanying Zhang,Pengfei Zhu,Qinghua Hu,Wangmeng Zuo

Main category: cs.CV

TL;DR: This paper proposes an efficient deblurring method using masked prediction and motion analysis, achieving better performance with reduced computation.

Details

Motivation: Existing deblurring methods struggle with inefficient resource allocation and poor handling of spatially varying blur patterns, necessitating a more effective approach. Method: A trainable mask predictor identifies blurred regions, and structural reparameterization converts 3×3 convolutions into efficient 1×1 convolutions for pixel-level pruning. An intra-frame motion analyzer provides adaptive guidance for blur restoration. Result: Extensive experiments show that the method outperforms state-of-the-art techniques while reducing FLOPs by 49%. Conclusion: The proposed method achieves superior performance on both local and global blur datasets while significantly reducing computational costs compared to state-of-the-art models. Abstract: Local motion blur in digital images originates from the relative motion between dynamic objects and static imaging systems during exposure. Existing deblurring methods face significant challenges in addressing this problem due to their inefficient allocation of computational resources and inadequate handling of spatially varying blur patterns. To overcome these limitations, we first propose a trainable mask predictor that identifies blurred regions in the image. During training, we employ blur masks to exclude sharp regions. For inference optimization, we implement structural reparameterization by converting $3\times 3$ convolutions to computationally efficient $1\times 1$ convolutions, enabling pixel-level pruning of sharp areas to reduce computation. Second, we develop an intra-frame motion analyzer that translates relative pixel displacements into motion trajectories, establishing adaptive guidance for region-specific blur restoration. Our method is trained end-to-end using a combination of reconstruction loss, reblur loss, and mask loss guided by annotated blur masks. Extensive experiments demonstrate superior performance over state-of-the-art methods on both local and global blur datasets while reducing FLOPs by 49\% compared to SOTA models (e.g., LMD-ViT). The source code is available at https://github.com/shangwei5/M2AENet.

[108] One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models

Jiale Zhao,Xinyang Jiang,Junyao Gao,Yuhao Xue,Cairong Zhao

Main category: cs.CV

TL;DR: This paper introduces CrossVLAD and CRAFT, tools designed to evaluate and exploit vulnerabilities in unified vision-language models by enabling more effective cross-task adversarial attacks.

Details

Motivation: The motivation stems from the unique security challenges posed by adversarial inputs in unified vision-language models, which must remain effective across unpredictable task instructions applied to the same content. Method: The researchers developed CrossVLAD, a benchmark dataset for evaluating cross-task adversarial attacks on unified VLMs, and introduced CRAFT, a region-centric attack framework designed to manipulate object classifications across multiple tasks simultaneously. Result: The experiments showed that CRAFT outperformed existing methods in both overall cross-task attack performance and targeted object-change success rates, demonstrating its effectiveness in influencing unified VLMs across diverse tasks. Conclusion: The study concludes that CRAFT is an effective method for adversarially influencing unified vision-language models across various tasks, as demonstrated through extensive experiments on Florence-2 and other VLMs using the CrossVLAD benchmark. Abstract: Unified vision-language models(VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective-consistently manipulating a target object's classification across four downstream tasks-and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.

[109] Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays

Ethan Dack,Chengliang Dai

Main category: cs.CV

TL;DR: This paper investigates dataset bias in popular open-source chest X-ray datasets using transformed data and various neural network architectures, aiming to promote more explainable AI research in medical imaging.

Details

Motivation: The motivation stems from the need to understand if dataset bias exists in open-source chest X-ray datasets due to their sensitive nature and widespread usage in AI research for medical imaging. Method: The authors applied simple transformations to the datasets (NIH, CheXpert, MIMIC-CXR, and PadChest) and implemented a range of different network architectures to explore potential biases. Result: The results indicate the presence of dataset bias even in medical imaging datasets, highlighting concerns about whether modern methods are taking shortcuts or focusing on relevant pathology. Conclusion: The study concludes that there is dataset bias in popular open-source chest X-ray datasets, and the implementation of different network architectures aims to encourage more explainable research in medical imaging. Abstract: Recent work has revisited the infamous task Name that dataset and established that in non-medical datasets, there is an underlying bias and achieved high Accuracies on the dataset origin task. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. % We deliberately try to increase the difficulty of the task by dataset transformations. We apply simple transformations of the datasets to try to identify bias. Given the importance of AI applications in medical imaging, it's vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. The corresponding code will be released upon acceptance.

[110] RAPS-3D: Efficient interactive segmentation for 3D radiological imaging

Théo Danielou,Daniel Tordjman,Pierre Manceron,Corentin Dancette

Main category: cs.CV

TL;DR: This paper introduces a simplified 3D promptable segmentation method inspired by SegVol that improves efficiency, reduces inference time, and eliminates complexities in handling 3D volumetric medical imaging data.

Details

Motivation: SAM's architecture is limited to 2D images and does not naturally extend to 3D volumetric data like CT or MRI scans. Existing adaptations to 3D involve complex and computationally expensive methods, prompting the need for a more efficient solution. Method: Inspired by SegVol, the authors propose a simplified 3D promptable segmentation method that avoids autoregressive strategies and sliding-window inference, aiming to reduce computational complexity and improve efficiency. Result: The proposed method achieves state-of-the-art performance while significantly reducing inference time and simplifying prompt management compared to existing 3D segmentation approaches. Conclusion: The paper concludes that their simplified 3D promptable segmentation method outperforms existing approaches by reducing inference time and eliminating complexities, while achieving state-of-the-art performance. Abstract: Promptable segmentation, introduced by the Segment Anything Model (SAM), is a promising approach for medical imaging, as it enables clinicians to guide and refine model predictions interactively. However, SAM's architecture is designed for 2D images and does not extend naturally to 3D volumetric data such as CT or MRI scans. Adapting 2D models to 3D typically involves autoregressive strategies, where predictions are propagated slice by slice, resulting in increased inference complexity. Processing large 3D volumes also requires significant computational resources, often leading existing 3D methods to also adopt complex strategies like sliding-window inference to manage memory usage, at the cost of longer inference times and greater implementation complexity. In this paper, we present a simplified 3D promptable segmentation method, inspired by SegVol, designed to reduce inference time and eliminate prompt management complexities associated with sliding windows while achieving state-of-the-art performance.

[111] Energy-Guided Decoding for Object Hallucination Mitigation

Xixi Liu,Ailin Deng,Christopher Zach

Main category: cs.CV

TL;DR: This paper introduces an energy-based decoding approach that effectively reduces object hallucinations in vision-language models, improving accuracy and fairness across multiple benchmarks.

Details

Motivation: Object hallucination in LVLMs poses safety concerns, and existing approaches are limited due to dependency on specific decoding methods, complex visual input modifications, or reliance on external models. Method: An energy-based decoding method was developed to dynamically select hidden states with minimal energy scores, aiming to reduce bias and improve model performance. Result: The method achieved an average accuracy improvement of 4.82% compared to greedy decoding and reduced the yes-ratio gap by 8.81%, showing consistent performance gains across three VQA datasets and three VLMs. Conclusion: The proposed energy-based decoding method effectively mitigates object hallucination in LVLMs by reducing bias in the yes ratio while enhancing performance across multiple benchmarks. Abstract: Mitigating object hallucination in large vision-language models (LVLMs) is critical to their safe deployment. Existing methods either are restricted to specific decoding methods, or demand sophisticated modifications to visual inputs, or rely on knowledge from external models. In this work, we first reveal the phenomenon that VLMs exhibit significant imbalance in the ``Yes'' ratio ( \ie, the fraction of ``Yes'' answers among the total number of questions) across three different visual question answering (VQA) datasets. Furthermore, we propose an energy-based decoding method, which dynamically selects the hidden states from the layer with minimal energy score. It is simple yet effective in reducing the bias for the yes ratio while boosting performance across three benchmarks (POPE, MME, and MMVP). Our method consistently improves accuracy and F1 score on three VQA datasets across three commonly used VLMs over several baseline methods. The average accuracy improvement is 4.82% compared to greedy decoding. Moreover, the average yes-ratio gap reduction is 8.81%, meaning the proposed method is less biased as shown in Figure 1.

[112] EEvAct: Early Event-Based Action Recognition with High-Rate Two-Stream Spiking Neural Networks

Michael Neumeier,Jules Lecomte,Nils Kazinski,Soubarna Banik,Bing Li,Axel von Arnim

Main category: cs.CV

TL;DR: This paper introduces a high-rate two-stream spiking neural network (SNN) for early recognition of human activities, achieving better accuracy than previous methods on the THU EACT-50 dataset.

Details

Motivation: Event-based vision sensors have high temporal resolution and low latency, making them ideal for early human activity recognition. However, existing methods limit early prediction capabilities by accumulating events into low-rate frames or space-time voxels. Method: The paper proposes a high-rate two-stream spiking neural network (SNN) approach and benchmarks it using a novel early event-based recognition framework, reporting Top-1 and Top-5 recognition scores. Result: The proposed method achieves a 2% improvement in final accuracy compared to previous approaches on the large-scale THU EACT-50 dataset. Conclusion: The paper concludes that the introduced high-rate two-stream SNN outperforms previous works in final accuracy for early recognition of human activities on the THU EACT-50 dataset. Abstract: Recognizing human activities early is crucial for the safety and responsiveness of human-robot and human-machine interfaces. Due to their high temporal resolution and low latency, event-based vision sensors are a perfect match for this early recognition demand. However, most existing processing approaches accumulate events to low-rate frames or space-time voxels which limits the early prediction capabilities. In contrast, spiking neural networks (SNNs) can process the events at a high-rate for early predictions, but most works still fall short on final accuracy. In this work, we introduce a high-rate two-stream SNN which closes this gap by outperforming previous work by 2% in final accuracy on the large-scale THU EACT-50 dataset. We benchmark the SNNs within a novel early event-based recognition framework by reporting Top-1 and Top-5 recognition scores for growing observation time. Finally, we exemplify the impact of these methods on a real-world task of early action triggering for human motion capture in sports.

[113] Sparse-Dense Side-Tuner for efficient Video Temporal Grounding

David Pujol-Perich,Sergio Escalera,Albert Clapés

Main category: cs.CV

TL;DR: This paper introduces SDST, a novel anchor-free side-tuning architecture for video temporal grounding that achieves superior performance with fewer parameters.

Details

Motivation: Existing VTG methods rely on frozen backbones, limiting adaptability. Current side-tuning approaches overlook the sparse nature of Moment Retrieval. Method: Sparse-Dense Side-Tuner (SDST), Reference-based Deformable Self-Attention, and integration of InternVideo2 backbone into an ST framework. Result: Highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA; up to 73% reduction in parameter count compared to existing SOTA methods. Conclusion: The proposed SDST method significantly improves existing ST methods in Video Temporal Grounding, achieving SOTA results while reducing parameter count. Abstract: Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code is publicly accessible at https://github.com/davidpujol/SDST.

Charlie Budd,Silvère Ségaud,Matthew Elliot,Graeme Stasiuk,Yijing Xie,Jonathan Shapey,Tom Vercauteren

Main category: cs.CV

TL;DR: This paper proposes X-RAFT, a modified RAFT optical flow model designed for cross-modal inputs, which significantly improves performance over existing methods.

Details

Motivation: The integration of hyperspectral imaging into fluorescence-guided neurosurgery requires finding dense cross-modal image correspondences between two hyperspectral images taken under different lighting conditions. Method: X-RAFT uses distinct image encoders for each modality pair and is fine-tuned in a self-supervised manner using flow-cycle-consistency on neurosurgical hyperspectral data. Result: X-RAFT achieves a 36.6% error reduction compared to a naive baseline and 27.83% reduction compared to an existing cross-modal optical flow method (CrossRAFT). Conclusion: X-RAFT, a modified RAFT optical flow model for cross-modal inputs, shows significant improvement in reducing error compared to existing methods. Abstract: Integration of hyperspectral imaging into fluorescence-guided neurosurgery has the potential to improve surgical decision making by providing quantitative fluorescence measurements in real-time. Quantitative fluorescence requires paired spectral data in fluorescence (blue light) and reflectance (white light) mode. Blue and white image acquisition needs to be performed sequentially in a potentially dynamic surgical environment. A key component to the fluorescence quantification process is therefore the ability to find dense cross-modal image correspondences between two hyperspectral images taken under these drastically different lighting conditions. We address this challenge with the introduction of X-RAFT, a Recurrent All-Pairs Field Transforms (RAFT) optical flow model modified for cross-modal inputs. We propose using distinct image encoders for each modality pair, and fine-tune these in a self-supervised manner using flow-cycle-consistency on our neurosurgical hyperspectral data. We show an error reduction of 36.6% across our evaluation metrics when comparing to a naive baseline and 27.83% reduction compared to an existing cross-modal optical flow method (CrossRAFT). Our code and models will be made publicly available after the review process.

[115] Deep Learning based 3D Volume Correlation for Additive Manufacturing Using High-Resolution Industrial X-ray Computed Tomography

Keerthana Chand,Tobias Fritsch,Bardia Hejazi,Konstantin Poka,Giovanni Bruno

Main category: cs.CV

TL;DR: This paper proposes a deep learning-based Digital Volume Correlation (DVC) method to improve the registration accuracy between computer-aided design (CAD) and X-ray Computed Tomography (XCT) volumes in additive manufacturing, resulting in faster and more accurate deformation estimation.

Details

Motivation: The motivation is to address the challenges of accurate registration between CAD and XCT volumes in additive manufacturing due to the lack of ground truth deformation fields and computational difficulties caused by high-resolution data. This is critical for improving reliability and efficiency in industrial applications. Method: The authors introduced a deep learning-based approach using a dynamic patch-based processing strategy to estimate voxel-wise deformations between CAD and XCT volumes. They also introduced a Binary Difference Map (BDM) as an evaluation metric alongside the Dice Score. Result: The proposed method achieved a 9.2% improvement in the Dice Score and a 9.9% improvement in the voxel match rate compared to classic DVC methods, while reducing computation time from days to minutes. Conclusion: This paper concludes that the proposed deep learning-based DVC method significantly improves registration accuracy and efficiency between CAD and XCT volumes, laying the foundation for generating compensation meshes in closed-loop correlations during additive manufacturing. Abstract: Quality control in additive manufacturing (AM) is vital for industrial applications in areas such as the automotive, medical and aerospace sectors. Geometric inaccuracies caused by shrinkage and deformations can compromise the life and performance of additively manufactured components. Such deviations can be quantified using Digital Volume Correlation (DVC), which compares the computer-aided design (CAD) model with the X-ray Computed Tomography (XCT) geometry of the components produced. However, accurate registration between the two modalities is challenging due to the absence of a ground truth or reference deformation field. In addition, the extremely large data size of high-resolution XCT volumes makes computation difficult. In this work, we present a deep learning-based approach for estimating voxel-wise deformations between CAD and XCT volumes. Our method uses a dynamic patch-based processing strategy to handle high-resolution volumes. In addition to the Dice Score, we introduce a Binary Difference Map (BDM) that quantifies voxel-wise mismatches between binarized CAD and XCT volumes to evaluate the accuracy of the registration. Our approach shows a 9.2\% improvement in the Dice Score and a 9.9\% improvement in the voxel match rate compared to classic DVC methods, while reducing the interaction time from days to minutes. This work sets the foundation for deep learning-based DVC methods to generate compensation meshes that can then be used in closed-loop correlations during the AM production process. Such a system would be of great interest to industries since the manufacturing process will become more reliable and efficient, saving time and material.

[116] SCOOTER: A Human Evaluation Framework for Unrestricted Adversarial Examples

Dren Fazlija,Monty-Maximilian Zühlke,Johanna Schrader,Arkadij Orlov,Clara Stein,Iyiola E. Olatunji,Daniel Kudenko

Main category: cs.CV

TL;DR: SCOOTER is a new open-source framework designed to evaluate unrestricted adversarial examples using human assessments, revealing that current attack methods fail to create visually imperceptible changes and demonstrating the need for better alignment between machine models and human perception.

Details

Motivation: Unrestricted adversarial attacks evade traditional norm-bounded defenses by not adhering to human-perceptibility constraints, making it essential to evaluate their visual authenticity through human studies. Existing work lacks statistically significant insights, necessitating a standardized framework like SCOOTER. Method: The authors introduced SCOOTER, an open-source framework incorporating best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds. They conducted a large-scale human study involving 346 participants and compared human perception with model predictions, including using GPT-4o as a preliminary detection tool. Additionally, they released software tools and a benchmark dataset for further research. Result: The study found that six different unrestricted adversarial attacks (three color-space and three diffusion-based) failed to generate truly imperceptible images when evaluated by humans. GPT-4o could detect adversarial examples in only four out of six cases. The SCOOTER framework successfully supported this analysis and includes tools, datasets, and a browser-based interface for future evaluations. Conclusion: SCOOTER provides a unified framework for evaluating unrestricted adversarial attacks, emphasizing the importance of human evaluation to determine imperceptibility and highlighting the misalignment between automated systems and human perception. Abstract: Unrestricted adversarial attacks aim to fool computer vision models without being constrained by $\ell_p$-norm bounds to remain imperceptible to humans, for example, by changing an object's color. This allows attackers to circumvent traditional, norm-bounded defense strategies such as adversarial training or certified defense strategies. However, due to their unrestricted nature, there are also no guarantees of norm-based imperceptibility, necessitating human evaluations to verify just how authentic these adversarial examples look. While some related work assesses this vital quality of adversarial attacks, none provide statistically significant insights. This issue necessitates a unified framework that supports and streamlines such an assessment for evaluating and comparing unrestricted attacks. To close this gap, we introduce SCOOTER - an open-source, statistically powered framework for evaluating unrestricted adversarial examples. Our contributions are: $(i)$ best-practice guidelines for crowd-study power, compensation, and Likert equivalence bounds to measure imperceptibility; $(ii)$ the first large-scale human vs. model comparison across 346 human participants showing that three color-space attacks and three diffusion-based attacks fail to produce imperceptible images. Furthermore, we found that GPT-4o can serve as a preliminary test for imperceptibility, but it only consistently detects adversarial examples for four out of six tested attacks; $(iii)$ open-source software tools, including a browser-based task template to collect annotations and analysis scripts in Python and R; $(iv)$ an ImageNet-derived benchmark dataset containing 3K real images, 7K adversarial examples, and over 34K human ratings. Our findings demonstrate that automated vision systems do not align with human perception, reinforcing the need for a ground-truth SCOOTER benchmark.

[117] Where are we with calibration under dataset shift in image classification?

Mélanie Roschewitz,Raghav Mehta,Fabio de Sousa Ribeiro,Ben Glocker

Main category: cs.CV

TL;DR: This paper examines calibration techniques under real-world dataset shifts, showing that combining label smoothing with entropy regularization, using post-hoc calibrators with OOD data, and ensembling offer the best results, though improving OOD calibration can hurt ID performance.

Details

Motivation: Calibration under dataset shift is crucial for reliable model deployment in real-world scenarios. This work aims to provide practical guidelines by evaluating how different calibration approaches perform when faced with natural distribution shifts. Method: The authors conducted an extensive evaluation across eight classification tasks and multiple imaging domains. They compared various post-hoc calibration methods and their interactions with in-training strategies like label smoothing, while also analyzing the effects of ensembling and fine-tuning from foundation models. Result: Key findings include: (1) Entropy regularization combined with label smoothing yields the best raw probability calibration. (2) Post-hoc calibrators trained on semantic OOD data are most robust. (3) Advanced calibration methods tailored for shifts do not consistently outperform simpler ones. (4) Out-of-distribution calibration improvements often degrade in-distribution calibration. (5) Ensembling improves calibration robustness, especially when applied before calibration. Conclusion: The study concludes that combining entropy regularization and label smoothing, using post-hoc calibrators with OOD data exposure, and employing ensembling techniques significantly enhance calibration robustness under dataset shift. However, improvements in out-of-distribution calibration often come at the cost of in-distribution performance. Abstract: We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.

[118] SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes

Jiaxin Huang,Ziwen Li,Hanlve Zhang,Runnan Chen,Xiao He,Yandong Guo,Wenping Wang,Tongliang Liu,Mingming Gong

Main category: cs.CV

TL;DR: 本文提出了一种新的3D视觉-语言数据集S\textsc{urprise}3D和3D-SRS基准测试套件，用于评估复杂3D场景中的语言引导空间推理分割，旨在促进空间感知AI的发展。

Details

Motivation: 当前3D视觉-语言研究中空间推理这一关键能力仍未得到充分探索，现有数据集混合了语义线索和空间上下文，导致模型依赖表面捷径而非真正理解空间关系。 Method: 引入了一个名为S\textsc{urprise}3D的新数据集，包含超过20万条视觉语言对，并设计了89k+个人工标注的空间查询以减少对象名称的捷径偏见。 Result: 初步基准测试表明，当前最先进的3D视觉定位方法和3D-LLM在该数据集上面临显著挑战，强调了该数据集和3D-SRS基准测试套件的重要性。 Conclusion: S\textsc{urprise}3D和3D-SRS基准测试套件旨在促进空间感知AI的发展，为实现有效的实体交互和机器人规划铺平道路。 Abstract: The integration of language and 3D perception is critical for embodied AI and robotic systems to perceive, understand, and interact with the physical world. Spatial reasoning, a key capability for understanding spatial relationships between objects, remains underexplored in current 3D vision-language research. Existing datasets often mix semantic cues (e.g., object name) with spatial context, leading models to rely on superficial shortcuts rather than genuinely interpreting spatial relationships. To address this gap, we introduce S\textsc{urprise}3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes. S\textsc{urprise}3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2, including more than 2.8k unique object classes. The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name, thereby mitigating shortcut biases in spatial understanding. These queries comprehensively cover various spatial reasoning skills, such as relative position, narrative perspective, parametric perspective, and absolute distance reasoning. Initial benchmarks demonstrate significant challenges for current state-of-the-art expert 3D visual grounding methods and 3D-LLMs, underscoring the necessity of our dataset and the accompanying 3D Spatial Reasoning Segmentation (3D-SRS) benchmark suite. S\textsc{urprise}3D and 3D-SRS aim to facilitate advancements in spatially aware AI, paving the way for effective embodied interaction and robotic planning. The code and datasets can be found in https://github.com/liziwennba/SUPRISE.

[119] Robust and Generalizable Heart Rate Estimation via Deep Learning for Remote Photoplethysmography in Complex Scenarios

Kang Cen,Chang-Hong Fu,Hong Hong

Main category: cs.CV

TL;DR: This paper proposes an improved end-to-end network for remote photoplethysmography (rPPG) that accurately extracts heart rate information from facial videos, particularly excelling in complex scenarios.

Details

Motivation: Existing rPPG network models face challenges in accuracy, robustness, and generalization under complex scenarios. This work aims to address these limitations by designing a more effective network architecture. Method: The method uses an end-to-end network with 3D convolutional neural networks, a differential frame fusion module, Temporal Shift Module (TSM) with self-attention mechanisms, and a dynamic hybrid loss function to extract accurate rPPG signals from raw facial videos. Result: After training on the PURE dataset, the model achieved a mean absolute error (MAE) of 7.58 on the MMPD test set, outperforming state-of-the-art models. Conclusion: The proposed end-to-end rPPG extraction network demonstrates superior robustness and generalization capability, achieving better performance than state-of-the-art models on challenging datasets. Abstract: Non-contact remote photoplethysmography (rPPG) technology enables heart rate measurement from facial videos. However, existing network models still face challenges in accu racy, robustness, and generalization capability under complex scenarios. This paper proposes an end-to-end rPPG extraction network that employs 3D convolutional neural networks to reconstruct accurate rPPG signals from raw facial videos. We introduce a differential frame fusion module that integrates differential frames with original frames, enabling frame-level representations to capture blood volume pulse (BVP) variations. Additionally, we incorporate Temporal Shift Module (TSM) with self-attention mechanisms, which effectively enhance rPPG features with minimal computational overhead. Furthermore, we propose a novel dynamic hybrid loss function that provides stronger supervision for the network, effectively mitigating over fitting. Comprehensive experiments were conducted on not only the PURE and UBFC-rPPG datasets but also the challenging MMPD dataset under complex scenarios, involving both intra dataset and cross-dataset evaluations, which demonstrate the superior robustness and generalization capability of our network. Specifically, after training on PURE, our model achieved a mean absolute error (MAE) of 7.58 on the MMPD test set, outperforming the state-of-the-art models.

[120] Visual Instance-aware Prompt Tuning

Xi Xiao,Yunbei Zhang,Xingjian Li,Tianyang Wang,Xiao Wang,Yuxiang Wei,Jihun Hamm,Min Xu

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉提示调优方法ViaPT，该方法基于每个单独的输入生成实例感知的提示，并与数据集级提示融合，解决了传统方法在下游数据集上的次优性能问题。

Details

Motivation: 传统的视觉Transformer参数高效微调方法使用在整个输入实例中保持不变的数据集级提示，导致下游数据集上出现次优性能。 Method: 提出Visual Instance-aware Prompt Tuning (ViaPT)，利用主成分分析(PCA)保留重要的提示信息，基于每个单独的输入生成实例感知的提示，并与数据集级提示融合。 Result: 通过34个不同数据集的大量实验表明，该方法始终优于最先进的基线方法。 Conclusion: ViaPT克服了现有方法无法有效捕捉特定实例信息的局限性，通过平衡数据集级和实例级知识，在减少可学习参数的同时提高了性能。 Abstract: Visual Prompt Tuning (VPT) has emerged as a parameter-efficient fine-tuning paradigm for vision transformers, with conventional approaches utilizing dataset-level prompts that remain the same across all input instances. We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. Moreover, we reveal that VPT-Deep and VPT-Shallow represent two corner cases based on a conceptual understanding, in which they fail to effectively capture instance-specific information, while random dimension reduction on prompts only yields performance between the two extremes. Instead, ViaPT overcomes these limitations by balancing dataset-level and instance-level knowledge, while reducing the amount of learnable parameters compared to VPT-Deep. Extensive experiments across 34 diverse datasets demonstrate that our method consistently outperforms state-of-the-art baselines, establishing a new paradigm for analyzing and optimizing visual prompts for vision transformers.

[121] Synergistic Prompting for Robust Visual Recognition with Missing Modalities

Zhihui Zhang,Luanyuan Dai,Qika Lin,Yunfeng Diao,Guangyin Jin,Yufei Guo,Jing Zhang,Xiaoshuai Hao

Main category: cs.CV

TL;DR: This paper proposes the Synergistic Prompting (SyP) framework to enhance robustness in visual recognition tasks with missing modalities, combining dynamic and static prompts for improved adaptability and performance.

Details

Motivation: The performance of large-scale multi-modal models degrades significantly in real-world applications due to missing or incomplete modality inputs. Existing prompt-based strategies suffer from static prompts' lack of flexibility and poor performance when critical modalities are missing. Method: The SyP framework introduces two key innovations: (1) a Dynamic Adapter that computes adaptive scaling factors to generate dynamic prompts and (2) a Synergistic Prompting Strategy that combines static and dynamic prompts to balance information across modalities. Result: The proposed SyP framework achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Conclusion: The proposed Synergistic Prompting (SyP) framework effectively addresses the issue of robust visual recognition with missing modalities, demonstrating significant performance improvements, adaptability, and reliability under diverse missing conditions. Abstract: Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are missing.To address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability.

[122] Patient-specific vs Multi-Patient Vision Transformer for Markerless Tumor Motion Forecasting

Gauthier Rotsart de Hertaing,Dani Manjah,Benoit Macq

Main category: cs.CV

TL;DR: This paper presents a markerless forecasting approach for lung tumor motion using Vision Transformers (ViT), showing that while patient-specific models achieve higher precision, multi-patient models offer robust out-of-the-box performance suitable for time-constrained clinical settings.

Details

Motivation: Accurate forecasting of lung tumor motion is essential for precise dose delivery in proton therapy. While current markerless methods mostly rely on deep learning, transformer-based architectures remain unexplored in this domain, despite their proven performance in trajectory forecasting. Method: Digitally reconstructed radiographs (DRRs) derived from planning 4DCT scans of 31 patients were used to train the MP model; a 32nd patient was held out for evaluation. PS models were trained using only the target patient's planning data. Both models used 16 DRRs per input and predicted tumor motion over a 1-second horizon. Performance was assessed using Average Displacement Error (ADE) and Final Displacement Error (FDE), on both planning (T1) and treatment (T2) data. Result: On T1 data, PS models outperformed MP models across all training set sizes, especially with larger datasets (up to 25,000 DRRs, p < 0.05). However, MP models demonstrated stronger robustness to inter-fractional anatomical variability and achieved comparable performance on T2 data without retraining. Conclusion: This is the first study to apply ViT architectures to markerless tumor motion forecasting. While PS models achieve higher precision, MP models offer robust out-of-the-box performance, well-suited for time-constrained clinical settings. Abstract: Background: Accurate forecasting of lung tumor motion is essential for precise dose delivery in proton therapy. While current markerless methods mostly rely on deep learning, transformer-based architectures remain unexplored in this domain, despite their proven performance in trajectory forecasting. Purpose: This work introduces a markerless forecasting approach for lung tumor motion using Vision Transformers (ViT). Two training strategies are evaluated under clinically realistic constraints: a patient-specific (PS) approach that learns individualized motion patterns, and a multi-patient (MP) model designed for generalization. The comparison explicitly accounts for the limited number of images that can be generated between planning and treatment sessions. Methods: Digitally reconstructed radiographs (DRRs) derived from planning 4DCT scans of 31 patients were used to train the MP model; a 32nd patient was held out for evaluation. PS models were trained using only the target patient's planning data. Both models used 16 DRRs per input and predicted tumor motion over a 1-second horizon. Performance was assessed using Average Displacement Error (ADE) and Final Displacement Error (FDE), on both planning (T1) and treatment (T2) data. Results: On T1 data, PS models outperformed MP models across all training set sizes, especially with larger datasets (up to 25,000 DRRs, p < 0.05). However, MP models demonstrated stronger robustness to inter-fractional anatomical variability and achieved comparable performance on T2 data without retraining. Conclusions: This is the first study to apply ViT architectures to markerless tumor motion forecasting. While PS models achieve higher precision, MP models offer robust out-of-the-box performance, well-suited for time-constrained clinical settings.

[123] Benchmarking Content-Based Puzzle Solvers on Corrupted Jigsaw Puzzles

Richard Dirauf,Florian Wolz,Dario Zanca,Björn Eskofier

Main category: cs.CV

TL;DR: 研究了最先进的基于内容的拼图求解器在面对现实世界挑战（如缺失碎片和腐蚀）时的鲁棒性，发现深度学习模型，尤其是位置扩散模型，在适当微调后表现优异，为实际文物的自动化重建提供了新的研究方向。

Details

Motivation: 现有的基于内容的拼图求解器缺乏对实际应用中关键挑战（如碎片化文物或碎纸张的重组）的评估。 Method: 引入三种拼图损坏类型（缺失碎片、边缘腐蚀、内容腐蚀），评估基于启发式和基于深度学习的求解器在这些损坏下的表现。 Result: 标准拼图为设计的求解器在更多碎片被损坏时性能迅速下降；通过使用增强数据进行微调，深度学习模型能显著提高其鲁棒性；位置扩散模型在大多数实验中优于其他竞争者。 Conclusion: 深度学习模型，特别是位置扩散模型，在处理具有缺失碎片、边缘腐蚀和内容腐蚀等现实挑战方面表现出较高的鲁棒性，并建议了未来增强现实世界文物自动重建的研究方向。 Abstract: Content-based puzzle solvers have been extensively studied, demonstrating significant progress in computational techniques. However, their evaluation often lacks realistic challenges crucial for real-world applications, such as the reassembly of fragmented artefacts or shredded documents. In this work, we investigate the robustness of State-Of-The-Art content-based puzzle solvers introducing three types of jigsaw puzzle corruptions: missing pieces, eroded edges, and eroded contents. Evaluating both heuristic and deep learning-based solvers, we analyse their ability to handle these corruptions and identify key limitations. Our results show that solvers developed for standard puzzles have a rapid decline in performance if more pieces are corrupted. However, deep learning models can significantly improve their robustness through fine-tuning with augmented data. Notably, the advanced Positional Diffusion model adapts particularly well, outperforming its competitors in most experiments. Based on our findings, we highlight promising research directions for enhancing the automated reconstruction of real-world artefacts.

[124] Rethinking Query-based Transformer for Continual Image Segmentation

Yuchen Zhu,Cheng Shi,Dingyou Wang,Jiajin Tang,Zhengxuan Wei,Yu Wu,Guanbin Li,Sibei Yang

Main category: cs.CV

TL;DR: 本文提出了SimCIS，一种用于类增量/持续图像分割的新方法，解决了现有解耦框架中存在的问题，通过直接选择图像特征进行查询分配以及引入跨阶段一致性和重放机制，实现了更优的性能表现。

Details

Motivation: 当前的解耦框架存在可塑性丢失和对输入数据顺序严重依赖的问题，因此本文提出SimCIS，旨在解决这些问题并提升持续学习中的性能表现。 Method: 提出了一种名为SimCIS的方法，其核心思想是直接选择图像特征进行查询分配，确保“完美对齐”以保留对象性，同时允许查询选择新类别以促进可塑性。此外，引入了跨阶段选择一致性和基于“视觉查询”的重放机制来进一步对抗类别灾难性遗忘。 Result: 实验表明，SimCIS在多个图像分割任务中均优于现有的最先进的方法，并且代码和模型将公开提供。 Conclusion: SimCIS不仅在各种分割任务、设置、分割和输入数据顺序上始终优于现有最先进方法，还通过跨阶段选择一致性策略和基于“视觉查询”的重放机制有效缓解了灾难性遗忘问题。 Abstract: Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage. To leverage the built-in objectness of query-based transformers, which mitigates catastrophic forgetting of mask proposals, current methods often decouple mask generation from the continual learning process. This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order. To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. Its core idea is to directly select image features for query assignment, ensuring "perfect alignment" to preserve objectness, while simultaneously allowing queries to select new classes to promote plasticity. To further combat catastrophic forgetting of categories, we introduce cross-stage consistency in selection and an innovative "visual query"-based replay mechanism. Experiments demonstrate that SimCIS consistently outperforms state-of-the-art methods across various segmentation tasks, settings, splits, and input data orders. All models and codes will be made publicly available at https://github.com/SooLab/SimCIS.

[125] 3D-ADAM: A Dataset for 3D Anomaly Detection in Advanced Manufacturing

Paul McHard,Florent P. Audonnet,Oliver Summerell,Sebastian Andraos,Paul Henderson,Gerardo Aragon-Camarasa

Main category: cs.CV

TL;DR: The paper introduces 3D-ADAM, a large-scale, high-precision RGB+3D industrial anomaly detection dataset that addresses limitations in existing datasets and aims to accelerate the development of robust defect detection models for real-world manufacturing environments.

Details

Motivation: Surface defects significantly impact manufacturing yield, making accurate defect detection valuable. However, existing datasets are limited in size, quality, and representation of real-world conditions, which hampers the development of effective automated defect detection methods. Method: The authors introduced 3D-ADAM, a dataset containing 14,120 high-resolution scans across 217 unique parts, captured using 4 industrial depth imaging sensors. It includes 27,346 annotated defect instances from 12 categories and 8,110 annotations of machine element features. The dataset was evaluated using state-of-the-art models, and its industrial relevance was validated through an expert labeling survey conducted by industry partners. Result: 3D-ADAM presents significant challenges to current state-of-the-art models, demonstrating its value as a benchmark for developing more robust and industrially relevant 3D Anomaly Detection models. Conclusion: 3D-ADAM is a new, large-scale, high-precision RGB+3D industrial anomaly detection dataset designed to improve the robustness of 3D Anomaly Detection models for real-world manufacturing applications. Abstract: Surface defects are one of the largest contributors to low yield in the manufacturing sector. Accurate and reliable detection of defects during the manufacturing process is therefore of great value across the sector. State-of-the-art approaches to automated defect detection yield impressive performance on current datasets, yet still fall short in real-world manufacturing settings and developing improved methods relies on large datasets representative of real-world scenarios. Unfortunately, high-quality, high-precision RGB+3D industrial anomaly detection datasets are scarce, and typically do not reflect real-world industrial deployment scenarios. To address this, we introduce 3D-ADAM, the first large-scale industry-relevant dataset for high-precision 3D Anomaly Detection. 3D-ADAM comprises 14,120 high-resolution scans across 217 unique parts, captured using 4 industrial depth imaging sensors. It includes 27,346 annotated defect instances from 12 categories, covering the breadth of industrial surface defects. 3D-ADAM uniquely captures an additional 8,110 annotations of machine element features, spanning the range of relevant mechanical design form factors. Unlike existing datasets, 3D-ADAM is captured in a real industrial environment with variations in part position and orientation, camera positioning, ambient lighting conditions, as well as partial occlusions. Our evaluation of SOTA models across various RGB+3D anomaly detection tasks demonstrates the significant challenge this dataset presents to current approaches. We further validated the industrial relevance and quality of the dataset through an expert labelling survey conducted by industry partners. By providing this challenging benchmark, 3D-ADAM aims to accelerate the development of robust 3D Anomaly Detection models capable of meeting the demands of modern manufacturing environments.

[126] THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza,Leo Fillioux,Sofiène Boutaj,Kunal Mahatha,Christian Desrosiers,Pablo Piantanida,Jose Dolz,Stergios Christodoulidis,Maria Vakalopoulou

Main category: cs.CV

TL;DR: This paper introduces THUNDER, a fast and flexible benchmark for evaluating digital pathology foundation models across diverse tasks, offering insights into model performance, robustness, and uncertainty.

Details

Motivation: The rapid development of foundation models in digital pathology necessitates a reliable benchmark to assess performance, understand differences between methods, and ensure robustness and uncertainty estimation for real-world applicability. Method: The authors developed THUNDER, a dynamic and easy-to-use benchmarking framework, to compare 23 foundation models across 16 datasets with various downstream tasks, feature analyses, and robustness evaluations. Result: A comprehensive comparison of 23 foundation models on multiple tasks was conducted, demonstrating THUNDER's capability to enable fast, insightful, and tile-level evaluation of existing and user-defined models. Conclusion: THUNDER serves as an efficient and comprehensive benchmark for comparing digital pathology foundation models, providing insights into performance, robustness, and uncertainty. Abstract: Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.

[127] Single-Step Latent Diffusion for Underwater Image Restoration

Jiayi Wu,Tianfu Wang,Md Abu Bakr Siddique,Md Jahidul Islam,Cornelia Fermuller,Yiannis Aloimonos,Christopher A. Metzler

Main category: cs.CV

TL;DR: 本研究提出了一种高效的水下图像恢复方法SLURPP，结合潜在扩散模型与场景分解，解决了现有方法在复杂场景中的效率和质量问题。

Details

Motivation: 现有的像素域扩散方法在处理具有复杂几何和显著深度变化的场景时计算成本高且容易产生不真实的伪影，因此需要更高效和真实的方法。 Method: SLURPP结合了预训练的潜在扩散模型与显式场景分解，并通过基于物理的合成水下图像生成管道进行训练。 Result: SLURPP在合成和真实世界基准测试中都表现出最先进的性能，比现有扩散方法快200多倍，并在合成基准测试中PSNR提高了约3dB。 Conclusion: SLURPP是一种新的水下图像恢复网络架构，结合了预训练的潜在扩散模型和显式场景分解，克服了现有方法在复杂几何和显著深度变化场景中的局限性。 Abstract: Underwater image restoration algorithms seek to restore the color, contrast, and appearance of a scene that is imaged underwater. They are a critical tool in applications ranging from marine ecology and aquaculture to underwater construction and archaeology. While existing pixel-domain diffusion-based image restoration approaches are effective at restoring simple scenes with limited depth variation, they are computationally intensive and often generate unrealistic artifacts when applied to scenes with complex geometry and significant depth variation. In this work we overcome these limitations by combining a novel network architecture (SLURPP) with an accurate synthetic data generation pipeline. SLURPP combines pretrained latent diffusion models -- which encode strong priors on the geometry and depth of scenes -- with an explicit scene decomposition -- which allows one to model and account for the effects of light attenuation and backscattering. To train SLURPP we design a physics-based underwater image synthesis pipeline that applies varied and realistic underwater degradation effects to existing terrestrial image datasets. This approach enables the generation of diverse training data with dense medium/degradation annotations. We evaluate our method extensively on both synthetic and real-world benchmarks and demonstrate state-of-the-art performance. Notably, SLURPP is over 200X faster than existing diffusion-based methods while offering ~ 3 dB improvement in PSNR on synthetic benchmarks. It also offers compelling qualitative improvements on real-world data. Project website https://tianfwang.github.io/slurpp/.

[128] MIRA: A Novel Framework for Fusing Modalities in Medical RAG

Jinhong Wang,Tajamul Ashraf,Zongyan Han,Jorma Laaksonen,Rao Mohammad Anwer

Main category: cs.CV

TL;DR: 本文提出了一种名为MIRA的框架，用以提高多模态大语言模型在医疗诊断中的事实准确性。

Details

Motivation: 为了解决MLLMs在生成响应时存在的事实不一致问题以及检索增强生成技术中的关键挑战。 Method: 引入了MIRA框架，包含校准的Rethinking and Rearrangement模块和一个整合图像嵌入与医学知识库的医疗RAG框架。 Result: 在公开的医疗VQA和报告生成基准测试中，MIRA显著提高了事实准确性和整体表现。 Conclusion: MIRA框架能够有效提升MLLM在医疗诊断中的事实准确性与整体性能，达到了新的最先进成果。 Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge. Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external sources, but it presents two key challenges. First, insufficient retrieval can miss critical information, whereas excessive retrieval can introduce irrelevant or misleading content, disrupting model output. Second, even when the model initially provides correct answers, over-reliance on retrieved data can lead to factual errors. To address these issues, we introduce the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework, designed to optimize factual accuracy in MLLM. MIRA consists of two key components: (1) a calibrated Rethinking and Rearrangement module that dynamically adjusts the number of retrieved contexts to manage factual risk, and (2) A medical RAG framework integrating image embeddings and a medical knowledge base with a query-rewrite module for efficient multimodal reasoning. This enables the model to effectively integrate both its inherent knowledge and external references. Our evaluation of publicly available medical VQA and report generation benchmarks demonstrates that MIRA substantially enhances factual accuracy and overall performance, achieving new state-of-the-art results. Code is released at https://github.com/mbzuai-oryx/MIRA.

[129] Hardware-Aware Feature Extraction Quantisation for Real-Time Visual Odometry on FPGA Platforms

Mateusz Wasala,Mateusz Smolarczyk,Michal Danilowicz,Tomasz Kryjak

Main category: cs.CV

TL;DR: 本文设计并实现了一个高效的量化SuperPoint模型，用于嵌入式平台上的特征点提取，在FPGA上实现了高速视觉里程计算。

Details

Motivation: 为了减少模型计算需求同时保持高检测质量，以满足自动驾驶平台中导航系统对实时性和资源限制的需求。 Method: 基于量化的SuperPoint卷积神经网络，采用硬件感知优化方法，在AMD/Xilinx Zynq UltraScale+ FPGA SoC平台上实现该模型，并利用Brevitas库和FINN框架进行量化与优化。 Result: 实现了640 x 480像素图像每秒54帧的处理速度，优于现有最先进技术，并在TUM数据集上展示了不同量化技术对模型精度的影响。 Conclusion: 本文提出了一种可在资源受限平台上高效部署的无监督特征点检测与描述架构，并通过实验验证了其在视觉里程计任务中的性能和效率。 Abstract: Accurate position estimation is essential for modern navigation systems deployed in autonomous platforms, including ground vehicles, marine vessels, and aerial drones. In this context, Visual Simultaneous Localisation and Mapping (VSLAM) - which includes Visual Odometry - relies heavily on the reliable extraction of salient feature points from the visual input data. In this work, we propose an embedded implementation of an unsupervised architecture capable of detecting and describing feature points. It is based on a quantised SuperPoint convolutional neural network. Our objective is to minimise the computational demands of the model while preserving high detection quality, thus facilitating efficient deployment on platforms with limited resources, such as mobile or embedded systems. We implemented the solution on an FPGA System-on-Chip (SoC) platform, specifically the AMD/Xilinx Zynq UltraScale+, where we evaluated the performance of Deep Learning Processing Units (DPUs) and we also used the Brevitas library and the FINN framework to perform model quantisation and hardware-aware optimisation. This allowed us to process 640 x 480 pixel images at up to 54 fps on an FPGA platform, outperforming state-of-the-art solutions in the field. We conducted experiments on the TUM dataset to demonstrate and discuss the impact of different quantisation techniques on the accuracy and performance of the model in a visual odometry task.

[130] Not Only Consistency: Enhance Test-Time Adaptation with Spatio-temporal Inconsistency for Remote Physiological Measurement

Xiao Yang,Yuxuan Fan,Can Liu,Houcheng Su,Weichen Guo,Jiyao Wang,Dengbo He

Main category: cs.CV

TL;DR: This paper introduces a novel test-time adaptation framework (CiCi) for remote photoplethysmography (rPPG), leveraging signal consistency and inconsistency properties to enable real-time, self-supervised model adaptation without requiring source data access.

Details

Motivation: Existing domain adaptation methods for deep learning-based rPPG models face challenges in real-world deployment due to privacy concerns and limited adaptability in unseen environments, motivating the need for an efficient test-time adaptation approach. Method: The proposed method, named Consistency-iCConsistency-iIntegration (CiCi), integrates expert knowledge through self-supervision by leveraging both time and frequency domain characteristics of rPPG signals. It also includes a gradient dynamic control mechanism to manage conflicts between priors. Result: Extensive experiments on five diverse datasets under the TTA protocol demonstrate that the CiCi framework outperforms existing techniques, achieving state-of-the-art performance in real-time self-supervised adaptation for rPPG tasks. Conclusion: The paper proposes a novel Test-Time Adaptation (TTA) strategy for remote photoplethysmography (rPPG), introducing the CiCi framework that leverages consistency and inconsistency priors for improved real-time adaptation without source data access. Abstract: Remote photoplethysmography (rPPG) has emerged as a promising non-invasive method for monitoring physiological signals using the camera. Although various domain adaptation and generalization methods were proposed to promote the adaptability of deep-based rPPG models in unseen deployment environments, considerations in aspects like privacy concerns and real-time adaptation restrict their application in real-world deployment. Thus, we aim to propose a novel fully Test-Time Adaptation (TTA) strategy tailored for rPPG tasks in this work. Specifically, based on prior knowledge in physiology and our observations, we noticed not only there is spatio-temporal consistency in the frequency domain of rPPG signals, but also that inconsistency in the time domain was significant. Given this, by leveraging both consistency and inconsistency priors, we introduce an innovative expert knowledge-based self-supervised \textbf{C}onsistency-\textbf{i}n\textbf{C}onsistency-\textbf{i}ntegration (\textbf{CiCi}) framework to enhances model adaptation during inference. Besides, our approach further incorporates a gradient dynamic control mechanism to mitigate potential conflicts between priors, ensuring stable adaptation across instances. Through extensive experiments on five diverse datasets under the TTA protocol, our method consistently outperforms existing techniques, presenting state-of-the-art performance in real-time self-supervised adaptation without accessing source data. The code will be released later.

[131] Towards Continuous Home Cage Monitoring: An Evaluation of Tracking and Identification Strategies for Laboratory Mice

Juan Pablo Oberhauser,Daniel Grzenda

Main category: cs.CV

TL;DR: 本研究开发了一种高效的小鼠实时追踪与身份识别系统，通过结合多目标追踪、Transformer分类及线性规划关联方法，解决了传统方法在密集饲养环境下追踪困难的问题。

Details

Motivation: 由于实验小鼠居住密度高、外观相似、活动频繁且交互频繁，提供个体小鼠指标具有挑战性。而连续、自动化的监测有助于提高数据收集准确性，并通过实时洞察改善动物福利。 Method: 该研究设计了一个包含三部分的流水线：（1）结合外观和运动线索的定制多目标追踪器（MouseTracks）；（2）基于Transformer的身份分类器（Mouseformer）；（3）用于将最终身份预测分配给tracklet的线性规划关联器（MouseMap）。 Result: 该系统能够以每秒30帧的速度对佩戴定制耳标的小鼠进行身份识别，并实现全天候笼内覆盖。研究显示，该方法在不同小鼠品系和多种环境条件下均能有效提升追踪效率并减少身份切换。 Conclusion: 该研究提出了一种高效的实时识别算法，用于追踪和识别数字家笼中佩戴定制耳标的实验室小鼠。与现有方法相比，该系统在不同小鼠品系和多种环境因素下提高了追踪效率并降低了身份切换率。 Abstract: Continuous, automated monitoring of laboratory mice enables more accurate data collection and improves animal welfare through real-time insights. Researchers can achieve a more dynamic and clinically relevant characterization of disease progression and therapeutic effects by integrating behavioral and physiological monitoring in the home cage. However, providing individual mouse metrics is difficult because of their housing density, similar appearances, high mobility, and frequent interactions. To address these challenges, we develop a real-time identification (ID) algorithm that accurately assigns ID predictions to mice wearing custom ear tags in digital home cages monitored by cameras. Our pipeline consists of three parts: (1) a custom multiple object tracker (MouseTracks) that combines appearance and motion cues from mice; (2) a transformer-based ID classifier (Mouseformer); and (3) a tracklet associator linear program to assign final ID predictions to tracklets (MouseMap). Our models assign an animal ID based on custom ear tags at 30 frames per second with 24/7 cage coverage. We show that our custom tracking and ID pipeline improves tracking efficiency and lowers ID switches across mouse strains and various environmental factors compared to current mouse tracking methods.

[132] TinierHAR: Towards Ultra-Lightweight Deep Learning Models for Efficient Human Activity Recognition on Edge Devices

Sizhen Bian,Mengxi Liu,Vitor Fortes Rey,Daniel Geissler,Paul Lukowicz

Main category: cs.CV

TL;DR: 本文提出了一种名为TinierHAR的超轻量级深度学习架构，用于在资源受限的可穿戴设备上实现高效准确的人类活动识别。

Details

Motivation: 在资源受限的可穿戴设备上进行人类活动识别需要兼顾准确性与计算效率的推理模型。 Method: 该论文结合了残差深度可分离卷积、门控循环单元（GRUs）和时间聚合方法，设计了一个高效准确的模型。 Result: TinierHAR在14个公共HAR数据集上的评估显示，与TinyHAR相比参数减少了2.7倍，MACs减少了6.4倍；与DeepConvLSTM相比参数减少了43.3倍，MACs减少了58.6倍，同时保持了平均F1分数。 Conclusion: TinierHAR是一个超轻量级的深度学习架构，为资源受限的可穿戴设备上的人类活动识别提供了高效的解决方案，并且通过开源材料促进了未来边缘-HAR的研究。 Abstract: Human Activity Recognition (HAR) on resource-constrained wearable devices demands inference models that harmonize accuracy with computational efficiency. This paper introduces TinierHAR, an ultra-lightweight deep learning architecture that synergizes residual depthwise separable convolutions, gated recurrent units (GRUs), and temporal aggregation to achieve SOTA efficiency without compromising performance. Evaluated across 14 public HAR datasets, TinierHAR reduces Parameters by 2.7x (vs. TinyHAR) and 43.3x (vs. DeepConvLSTM), and MACs by 6.4x and 58.6x, respectively, while maintaining the averaged F1-scores. Beyond quantitative gains, this work provides the first systematic ablation study dissecting the contributions of spatial-temporal components across proposed TinierHAR, prior SOTA TinyHAR, and the classical DeepConvLSTM, offering actionable insights for designing efficient HAR systems. We finally discussed the findings and suggested principled design guidelines for future efficient HAR. To catalyze edge-HAR research, we open-source all materials in this work for future benchmarking\footnote{https://github.com/zhaxidele/TinierHAR}

[133] Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions

Longfei Li,Zhiwen Fan,Wenyan Cong,Xinhang Liu,Yuyang Yin,Matt Foutter,Panwang Pan,Chenyu You,Yue Wang,Zhangyang Wang,Yao Zhao,Marco Pavone,Yunchao Wei

Main category: cs.CV

TL;DR: 本文提出了一种生成逼真火星景观视频的解决方案，包括数据整理流程M3arsSynth和视频生成器MarsGen。

Details

Motivation: 由于缺乏高质量的火星数据以及火星与地球图像之间的领域差异，生成逼真的火星景观视频面临独特挑战。 Method: 1) 提出Multimodal Mars Synthesis (M3arsSynth) 数据整理流程，从NASA的立体导航图像中重建3D火星环境；2) 开发MarsGen视频生成器，根据初始图像帧、相机轨迹或文本提示合成新视频。 Result: 实验结果表明，该方法在视觉保真度和3D结构一致性方面优于基于地球数据集训练的视频合成模型。 Conclusion: 提出的M3arsSynth和MarsGen为生成物理准确且视觉逼真的火星景观视频提供了有效方法。 Abstract: Synthesizing realistic Martian landscape videos is crucial for mission rehearsal and robotic simulation. However, this task poses unique challenges due to the scarcity of high-quality Martian data and the significant domain gap between Martian and terrestrial imagery. To address these challenges, we propose a holistic solution composed of two key components: 1) A data curation pipeline Multimodal Mars Synthesis (M3arsSynth), which reconstructs 3D Martian environments from real stereo navigation images, sourced from NASA's Planetary Data System (PDS), and renders high-fidelity multiview 3D video sequences. 2) A Martian terrain video generator, MarsGen, which synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data. Our M3arsSynth engine spans a wide range of Martian terrains and acquisition dates, enabling the generation of physically accurate 3D surface models at metric-scale resolution. MarsGen, fine-tuned on M3arsSynth data, synthesizes videos conditioned on an initial image frame and, optionally, camera trajectories or textual prompts, allowing for video generation in novel environments. Experimental results show that our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.

[134] Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu,Diankun Wu,Tianyu He,Junliang Guo,Yang Ye,Yueqi Duan,Jiang Bian

Main category: cs.CV

TL;DR: This paper proposes Geometry Forcing, a method to enhance video diffusion models by integrating latent 3D representations through alignment objectives, resulting in better visual quality and 3D consistency.

Details

Motivation: Video diffusion models trained on raw data often fail to capture geometric-aware structures, which limits their ability to represent the underlying 3D nature of the physical world. Method: Geometry Forcing introduces Angular Alignment and Scale Alignment objectives to align intermediate representations with features from a pretrained geometric foundation model. Result: Experiments show that Geometry Forcing significantly improves performance in camera view-conditioned and action-conditioned video generation tasks. Conclusion: The proposed Geometry Forcing method effectively enhances video diffusion models by encouraging latent 3D representations, leading to improved visual quality and 3D consistency. Abstract: Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view-conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

[135] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

JingLi Lin,Chenming Zhu,Runsen Xu,Xiaohan Mao,Xihui Liu,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: OST-Bench is introduced to assess online spatio-temporal reasoning in MLLMs, revealing that these models struggle with dynamic and memory-intensive tasks, necessitating further research into improving embodied perception.

Details

Motivation: Most existing benchmarks evaluate multimodal large language models (MLLMs) under offline settings, which do not reflect the dynamic and incremental nature of real-world embodied perception. This work aims to address this gap through OST-Bench. Method: The researchers introduced OST-Bench, a benchmark for evaluating Online Spatio-Temporal understanding, built using an efficient data collection pipeline with 1.4k scenes and 10k question-answer pairs from ScanNet, Matterport3D, and ARKitScenes. Result: Evaluation of leading MLLMs on OST-Bench revealed significant shortcomings in handling complex spatio-temporal reasoning tasks, with accuracy declining as exploration progresses and memory increases. Conclusion: The study concludes that current MLLMs struggle with complex spatio-temporal reasoning tasks, especially as the exploration horizon extends and memory grows, highlighting key challenges for improvement in online embodied reasoning. Abstract: Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/

[136] CLIP Won't Learn Object-Attribute Binding from Natural Data and Here is Why

Bijay Gurung,David T. Hoffmann,Thomas Brox

Main category: cs.CV

TL;DR: 本研究发现对比视觉-语言模型CLIP的绑定问题主要由数据特性引起，合成数据实验表明，特定数据条件下CLIP可以实现接近完美的绑定性能。

Details

Motivation: 尽管CLIP等对比视觉-语言模型被广泛应用于各种任务，但它们在图像和文本绑定问题上存在明显局限。先前的研究尝试通过添加困难负样本或修改架构来解决这个问题，但未能完全解决。本文旨在从数据的角度出发，揭示影响CLIP绑定能力的关键因素。 Method: 该论文使用一个合成数据集来系统地分析数据特性对CLIP模型的影响。实验包括评估常见的自然数据属性（如低属性密度、不完整的字幕和显著性偏差）对绑定性能的影响，并测试了增加批量大小和显式生成困难负样本的效果。 Result: 论文结果显示，常见的自然数据属性（如低属性密度、不完整字幕和显著性偏差）会显著降低CLIP的绑定性能。此外，增加批量大小或显式生成困难负样本并不能有效改善绑定效果。只有当数据满足特定条件时，CLIP才能实现近乎完美的绑定能力。 Conclusion: 本文的结论是，对比视觉-语言模型如CLIP在学习绑定方面的问题主要源于数据特性。通过合成数据集的实证研究，作者发现常见的自然数据属性会对CLIP的绑定性能产生负面影响。只有当数据满足特定条件时，CLIP才能几乎完美地学习绑定。 Abstract: Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of "a yellow submarine and a blue bus" or "a blue submarine and a yellow bus". Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP's ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is "most salient" to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.

[137] Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun,Sukjun Hwang,Su Ho Han,Taeoh Kim,Inwoong Lee,Dongyoon Wee,Joon-Young Lee,Seon Joo Kim,Minho Shim

Main category: cs.CV

TL;DR: STTM is a new training-free token merging method for video LLMs that significantly improves efficiency by exploiting local spatial and temporal redundancy.

Details

Motivation: The motivation is to overcome the quadratic computational scaling issue of Video LLMs with token count by reducing redundancy without compromising understanding performance. Method: The method involves transforming each video frame into multi-granular spatial tokens using a quadtree structure and performing directed pairwise merging across the temporal dimension to exploit local redundancy in video data. Result: STTM outperforms existing token reduction methods on six video QA benchmarks, achieving up to 3x speed-up with only minor accuracy drops under reduced token budgets. Conclusion: STTM is an effective training-free approach for spatio-temporal token merging, demonstrating significant speed-up with minimal accuracy drop and enabling KV cache reuse. Abstract: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.

[138] Multigranular Evaluation for Brain Visual Decoding

Weihao Xia,Cengiz Oztireli

Main category: cs.CV

TL;DR: This paper proposes BASIC, a new multigranular evaluation framework for brain visual decoding that improves upon existing methods by offering more detailed, structured, and context-aware comparisons.

Details

Motivation: Existing evaluation protocols for brain visual decoding have limitations such as coarse metrics, lack of neuroscientific basis, and poor capture of fine-grained distinctions. This study aims to overcome these issues. Method: The study introduces BASIC, a framework that evaluates brain visual decoding using three levels: structural fidelity, inferential alignment, and contextual coherence. It uses segmentation-based metrics for structure and multimodal language models for semantic comparisons. Result: The proposed framework enables detailed, scalable, and context-rich comparisons between decoded images and ground-truth stimuli, offering improved benchmarking of visual decoding methods. Conclusion: BASIC provides a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods. Abstract: Existing evaluation protocols for brain visual decoding predominantly rely on coarse metrics that obscure inter-model differences, lack neuroscientific foundation, and fail to capture fine-grained visual distinctions. To address these limitations, we introduce BASIC, a unified, multigranular evaluation framework that jointly quantifies structural fidelity, inferential alignment, and contextual coherence between decoded and ground truth images. For the structural level, we introduce a hierarchical suite of segmentation-based metrics, including foreground, semantic, instance, and component masks, anchored in granularity-aware correspondence across mask structures. For the semantic level, we extract structured scene representations encompassing objects, attributes, and relationships using multimodal large language models, enabling detailed, scalable, and context-rich comparisons with ground-truth stimuli. We benchmark a diverse set of visual decoding methods across multiple stimulus-neuroimaging datasets within this unified evaluation framework. Together, these criteria provide a more discriminative, interpretable, and comprehensive foundation for measuring brain visual decoding methods.

[139] Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection

Subhajit Maity,Ayan Kumar Bhunia,Subhadeep Koley,Pinaki Nath Chowdhury,Aneeshan Sain,Yi-Zhe Song

Main category: cs.CV

TL;DR: This paper proposes a framework for few-shot keypoint detection using sketches, addressing challenges in cross-modal embeddings and sketch style variations through a prototypical approach and domain adaptation.

Details

Motivation: Keypoint detection faces challenges in few-shot learning when source data from the same distribution as the query is unavailable. Sketches are explored as a solution to provide a source-free alternative. Method: The framework uses a prototypical setup combined with a grid-based locator and prototypical domain adaptation to overcome challenges in cross-modal embeddings and user-specific sketch styles. Result: Extensive experiments show that the proposed method achieves success in few-shot convergence across novel keypoints and classes. Conclusion: The proposed framework successfully addresses challenges in few-shot learning for keypoint detection by using sketches as a source-free alternative and demonstrates success in convergence across novel keypoints and classes. Abstract: Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.

[140] Single-pass Adaptive Image Tokenization for Minimum Program Search

Shivam Duggal,Sanghyun Byun,William T. Freeman,Antonio Torralba,Phillip Isola

Main category: cs.CV

TL;DR: KARL is a single-pass adaptive tokenizer inspired by Algorithmic Information Theory that efficiently allocates variable-length representations for images, matching the performance of existing methods while offering deeper insights into the relationship between image complexity and Kolmogorov Complexity.

Details

Motivation: Traditional visual representation learning systems use fixed-length representations despite variations in input complexity or familiarity. The research aims to develop a more efficient adaptive tokenization method aligned with Algorithmic Information Theory. Method: KARL, a single-pass adaptive tokenizer inspired by Kolmogorov Complexity principles, predicts the appropriate number of tokens for an image in one forward pass. It uses a training procedure similar to Upside-Down Reinforcement Learning to halt token prediction based on desired reconstruction quality. Result: KARL achieves comparable performance to existing adaptive tokenization methods but does so in a single forward pass, making it more efficient. Scaling laws were analyzed, and a conceptual study showed alignment between predicted image complexity and human intuition. Conclusion: KARL successfully matches the performance of recent adaptive tokenizers while operating in a single pass and provides insights into the relationship between adaptive image tokenization and Algorithmic Information Theory. Abstract: According to Algorithmic Information Theory (AIT) -- Intelligent representations compress data into the shortest possible program that can reconstruct its content, exhibiting low Kolmogorov Complexity (KC). In contrast, most visual representation learning systems use fixed-length representations for all inputs, ignoring variations in complexity or familiarity. Recent adaptive tokenization methods address this by allocating variable-length representations but typically require test-time search over multiple encodings to find the most predictive one. Inspired by Kolmogorov Complexity principles, we propose a single-pass adaptive tokenizer, KARL, which predicts the appropriate number of tokens for an image in a single forward pass, halting once its approximate KC is reached. The token count serves as a proxy for the minimum description length. KARL's training procedure closely resembles the Upside-Down Reinforcement Learning paradigm, as it learns to conditionally predict token halting based on a desired reconstruction quality. KARL matches the performance of recent adaptive tokenizers while operating in a single pass. We present scaling laws for KARL, analyzing the role of encoder/decoder size, continuous vs. discrete tokenization and more. Additionally, we offer a conceptual study drawing an analogy between Adaptive Image Tokenization and Algorithmic Information Theory, examining the predicted image complexity (KC) across axes such as structure vs. noise and in- vs. out-of-distribution familiarity -- revealing alignment with human intuition.

[141] MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

Mingkai Jia,Wei Yin,Xiaotao Hu,Jiaxin Guo,Xiaoyang Guo,Qian Zhang,Xiao-Xiao Long,Ping Tan

Main category: cs.CV

TL;DR: 本文提出了一种名为NickName的方法，通过增强离散码本的表示能力来提升VQ-VAE的重建质量，在多个基准测试中表现优异。

Details

Motivation: 现有方法无法有效弥合VQ-VAE和VAE之间的重建质量差距。 Method: 保留潜在维度以保存编码特征，并结合一组子码本进行量化。 Result: NickName在ImageNet和8个零样本基准测试中均达到了最先进的性能，且在rFID和PSNR指标上优于SD-VAE。 Conclusion: NickName在重建质量方面表现出色，缩小了VQ-VAE和VAE之间的差距，并为高分辨率图像处理任务提供了保持保真度的新方法。 Abstract: Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose \NickName, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. \NickName~achieves the \textbf{state-of-the-art performance on both ImageNet and $8$ zero-shot benchmarks} across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID $\textbf{0.49}$ v.s. $\textbf{0.91}$, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of \NickName~in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.

[142] Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models

Helen Qu,Sang Michael Xie

Main category: cs.CV

TL;DR: 这篇论文探讨了CLIP和大型多模态模型（LMMs）中概念组合对性能的影响，发现词共现统计（通过点互信息测量）与零样本准确率高度相关，并指出需要改进模型以增强组合泛化能力。

Details

Motivation: 当前CLIP和大型多模态模型（LMMs）在训练数据中高频概念上的表现较好，但对概念组合如何影响组合泛化能力尚不清楚。因此，本文旨在探究词共现统计如何影响模型性能。 Method: 研究使用点互信息（PMI）来衡量CLIP预训练数据集中的词共现统计，并利用合成生成的图像以及编辑后的自然图像测试不同概念组合对零样本准确率的影响。此外，还评估了CLIP的行为是否转移到了基于CLIP的LMMs上。 Result: 研究表明，CLIP预训练数据集中的PMI与CLIP模型在LAION-400M上的零样本准确率之间存在高度相关性（r=0.97），并且高PMI值与低PMI值图像之间的准确率差距达14%。在自然图像中重现这一效应后，观察到相关性为r=0.75。此外，这种行为也转移到了基于CLIP的LMMs上（TextVQA: r=0.70, VQAv2: r=0.62）。 Conclusion: 该论文的结论是，CLIP和基于CLIP的大型多模态模型（LMMs）的表现受到预训练数据集中概念组合的影响。即使在常见概念上，其准确性也会因图像中概念的组合方式而变化。这表明需要改进多模态模型中的组合泛化能力，而不依赖于训练数据的组合扩展。 Abstract: CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear -- for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at https://github.com/helenqu/multimodal-pretraining-pmi.

eess.IV [Back]

[143] Semi-supervised learning and integration of multi-sequence MR-images for carotid vessel wall and plaque segmentation

Marie-Christine Pali,Christina Schwaiger,Malik Galijasevic,Valentin K. Ladenhauf,Stephanie Mangesius,Elke R. Gizewski

Main category: eess.IV

TL;DR: 本文提出了一种用于颈动脉MRI分割的半监督深度学习方法，结合多序列数据和改进的U-Net架构，解决了标注数据不足和形态复杂的问题。

Details

Motivation: 准确分割颈动脉斑块对于评估动脉粥样硬化和缺血性中风风险至关重要，但由于斑块形态复杂且标注数据稀缺，传统方法面临挑战。 Method: 该论文提出了一种基于半监督深度学习的方法，结合多序列MRI数据进行颈动脉壁和斑块的分割。该方法包括一个粗略定位模型和一个精细分割模型，并引入了多层次多序列版本的U-Net架构以及多种融合策略。此外，为了应对标注数据有限的问题，该方法通过强制输入变换下的一致性来提升性能。 Result: 该方法在52例动脉粥样硬化患者的MRI数据上进行了评估，每个患者包含五个MRI序列。实验结果表明，该方法有效提升了分割精度，并强调了U-Net架构中融合点选择的重要性。专家评估进一步验证了该方法的准确性。 Conclusion: 该论文提出的半监督深度学习方法在颈动脉MRI数据分割中表现出色，特别是在数据有限的情况下，融合策略和半监督学习的应用为颈动脉分割提供了新的潜力。 Abstract: The analysis of carotid arteries, particularly plaques, in multi-sequence Magnetic Resonance Imaging (MRI) data is crucial for assessing the risk of atherosclerosis and ischemic stroke. In order to evaluate metrics and radiomic features, quantifying the state of atherosclerosis, accurate segmentation is important. However, the complex morphology of plaques and the scarcity of labeled data poses significant challenges. In this work, we address these problems and propose a semi-supervised deep learning-based approach designed to effectively integrate multi-sequence MRI data for the segmentation of carotid artery vessel wall and plaque. The proposed algorithm consists of two networks: a coarse localization model identifies the region of interest guided by some prior knowledge on the position and number of carotid arteries, followed by a fine segmentation model for precise delineation of vessel walls and plaques. To effectively integrate complementary information across different MRI sequences, we investigate different fusion strategies and introduce a multi-level multi-sequence version of U-Net architecture. To address the challenges of limited labeled data and the complexity of carotid artery MRI, we propose a semi-supervised approach that enforces consistency under various input transformations. Our approach is evaluated on 52 patients with arteriosclerosis, each with five MRI sequences. Comprehensive experiments demonstrate the effectiveness of our approach and emphasize the role of fusion point selection in U-Net-based architectures. To validate the accuracy of our results, we also include an expert-based assessment of model performance. Our findings highlight the potential of fusion strategies and semi-supervised learning for improving carotid artery segmentation in data-limited MRI applications.

[144] D-CNN and VQ-VAE Autoencoders for Compression and Denoising of Industrial X-ray Computed Tomography Images

Bardia Hejazi,Keerthana Chand,Tobias Fritsch,Giovanni Bruno

Main category: eess.IV

TL;DR: 本研究探讨了基于深度学习的XCT数据压缩方法及其对恢复数据质量的影响，并提出了适用于三维数据分析的边缘保持质量度量方法。

Details

Motivation: 成像技术的发展导致成像科学中的数据量不断增长，这需要高效可靠的数据存储解决方案。 Method: 使用深度学习自编码器对工业X射线计算机断层扫描（XCT）数据进行压缩，并引入了一种对边缘保持敏感的质量度量方法来评估解压图像质量。 Result: 通过两种网络架构（D-CNN和VQ-VAE）以不同压缩率对砂岩样本的XCT数据进行压缩与恢复，发现解码图像的质量依赖于特定需保留的特征。 Conclusion: 不同的架构和压缩率会对后期分析所需保留的特定特征产生不同效果，研究结果可帮助科学家确定数据存储和分析策略。 Abstract: The ever-growing volume of data in imaging sciences stemming from the advancements in imaging technologies, necessitates efficient and reliable storage solutions for such large datasets. This study investigates the compression of industrial X-ray computed tomography (XCT) data using deep learning autoencoders and examines how these compression algorithms affect the quality of the recovered data. Two network architectures with different compression rates were used, a deep convolution neural network (D-CNN) and a vector quantized variational autoencoder (VQ-VAE). The XCT data used was from a sandstone sample with a complex internal pore network. The quality of the decoded images obtained from the two different deep learning architectures with different compression rates were quantified and compared to the original input data. In addition, to improve image decoding quality metrics, we introduced a metric sensitive to edge preservation, which is crucial for three-dimensional data analysis. We showed that different architectures and compression rates are required depending on the specific characteristics needed to be preserved for later analysis. The findings presented here can aid scientists to determine the requirements and strategies for their data storage and analysis needs.

[145] Compressive Imaging Reconstruction via Tensor Decomposed Multi-Resolution Grid Encoding

Zhenyu Jin,Yisi Luo,Xile Zhao,Deyu Meng

Main category: eess.IV

TL;DR: GridTD improves compressive imaging reconstruction by utilizing an unsupervised continuous representation framework that combines hierarchical modeling with tensor decomposition for efficient and effective image recovery.

Details

Motivation: The motivation stems from the challenge faced by existing unsupervised representations in achieving a balance between representation ability and efficiency in CI reconstruction. Method: Tensor Decomposed multi-resolution Grid encoding (GridTD), an unsupervised continuous representation framework, was proposed for CI reconstruction. It uses a lightweight neural network with a tensor decomposition model learned via multi-resolution hash grid encoding. Result: GridTD effectively and efficiently reconstructs high-dimensional images and outperforms existing methods in diverse CI tasks like video SCI, spectral SCI, and compressive dynamic MRI reconstruction. Conclusion: GridTD is a versatile and state-of-the-art CI reconstruction method that offers superior performance in reconstructing high-dimensional images. Abstract: Compressive imaging (CI) reconstruction, such as snapshot compressive imaging (SCI) and compressive sensing magnetic resonance imaging (MRI), aims to recover high-dimensional images from low-dimensional compressed measurements. This process critically relies on learning an accurate representation of the underlying high-dimensional image. However, existing unsupervised representations may struggle to achieve a desired balance between representation ability and efficiency. To overcome this limitation, we propose Tensor Decomposed multi-resolution Grid encoding (GridTD), an unsupervised continuous representation framework for CI reconstruction. GridTD optimizes a lightweight neural network and the input tensor decomposition model whose parameters are learned via multi-resolution hash grid encoding. It inherently enjoys the hierarchical modeling ability of multi-resolution grid encoding and the compactness of tensor decomposition, enabling effective and efficient reconstruction of high-dimensional images. Theoretical analyses for the algorithm's Lipschitz property, generalization error bound, and fixed-point convergence reveal the intrinsic superiority of GridTD as compared with existing continuous representation models. Extensive experiments across diverse CI tasks, including video SCI, spectral SCI, and compressive dynamic MRI reconstruction, consistently demonstrate the superiority of GridTD over existing methods, positioning GridTD as a versatile and state-of-the-art CI reconstruction method.

[146] Breast Ultrasound Tumor Generation via Mask Generator and Text-Guided Network:A Clinically Controllable Framework with Downstream Evaluation

Haoyu Pan,Hongxin Lin,Zetian Feng,Chuxuan Lin,Junyang Mo,Chu Zhang,Zijian Wu,Yi Wang,Qingqing Zheng

Main category: eess.IV

TL;DR: 本文提出了一种结合临床描述与结构掩码的生成框架，有效解决了乳腺超声图像分析中专家标注数据不足的问题，并成功应用于下游诊断任务。

Details

Motivation: 由于专家标注数据的稀缺性限制了稳健深度学习模型的发展，因此需要一种可控且具有临床实用性的方法来生成个性化、反映真实世界形态多样性的BUS图像。 Method: 提出了一种结合临床描述和结构掩码的可控制生成框架，用于合成乳腺超声（BUS）图像，并设计了一个语义曲率掩码生成器，以根据临床先验知识合成结构多样的肿瘤掩码。 Result: 在六个公开的BUS数据集上进行的定量评估证明了合成图像在提升乳腺癌诊断方面的有效性，视觉图灵测试也确认了生成图像的真实性和实用性。 Conclusion: 该框架在增强下游乳腺癌诊断任务中的有效性得到了证实，同时通过视觉图灵测试验证了生成图像的真实性，表明其在临床应用中的潜力。 Abstract: The development of robust deep learning models for breast ultrasound (BUS) image analysis is significantly constrained by the scarcity of expert-annotated data. To address this limitation, we propose a clinically controllable generative framework for synthesizing BUS images. This framework integrates clinical descriptions with structural masks to generate tumors, enabling fine-grained control over tumor characteristics such as morphology, echogencity, and shape. Furthermore, we design a semantic-curvature mask generator, which synthesizes structurally diverse tumor masks guided by clinical priors. During inference, synthetic tumor masks serve as input to the generative framework, producing highly personalized synthetic BUS images with tumors that reflect real-world morphological diversity. Quantitative evaluations on six public BUS datasets demonstrate the significant clinical utility of our synthetic images, showing their effectiveness in enhancing downstream breast cancer diagnosis tasks. Furthermore, visual Turing tests conducted by experienced sonographers confirm the realism of the generated images, indicating the framework's potential to support broader clinical applications.

[147] MeD-3D: A Multimodal Deep Learning Framework for Precise Recurrence Prediction in Clear Cell Renal Cell Carcinoma (ccRCC)

Hasaan Maqsood,Saif Ur Rehman Khan

Main category: eess.IV

TL;DR: 这项研究开发了一个多模态深度学习框架，整合了影像、病理、临床和基因组数据，以更准确地预测透明细胞肾癌的复发风险，并提升临床决策水平。

Details

Motivation: 由于ccRCC在分子、病理和临床方面的复杂异质性，传统的基于单一数据模态的预后模型难以准确预测其复发，因此需要一种更全面的方法来提高预测准确性并优化临床决策。 Method: 提出了一种结合CT、MRI、组织病理学全切片图像（WSI）、临床数据和基因组数据的多模态深度学习框架，利用领域特定模型（CLAM、MeD-3D和MLP）提取特征并通过早期和晚期融合策略进行集成。 Result: 该框架通过整合多种数据模态提升了对ccRCC复发的预测能力，并能够应对临床场景中常见数据缺失的挑战。 Conclusion: 研究得出结论，所提出的深度学习框架能够有效整合多种数据模态，提高了ccRCC复发的预测准确性，并且能够在临床环境中处理不完整数据的问题。 Abstract: Accurate prediction of recurrence in clear cell renal cell carcinoma (ccRCC) remains a major clinical challenge due to the disease complex molecular, pathological, and clinical heterogeneity. Traditional prognostic models, which rely on single data modalities such as radiology, histopathology, or genomics, often fail to capture the full spectrum of disease complexity, resulting in suboptimal predictive accuracy. This study aims to overcome these limitations by proposing a deep learning (DL) framework that integrates multimodal data, including CT, MRI, histopathology whole slide images (WSI), clinical data, and genomic profiles, to improve the prediction of ccRCC recurrence and enhance clinical decision-making. The proposed framework utilizes a comprehensive dataset curated from multiple publicly available sources, including TCGA, TCIA, and CPTAC. To process the diverse modalities, domain-specific models are employed: CLAM, a ResNet50-based model, is used for histopathology WSIs, while MeD-3D, a pre-trained 3D-ResNet18 model, processes CT and MRI images. For structured clinical and genomic data, a multi-layer perceptron (MLP) is used. These models are designed to extract deep feature embeddings from each modality, which are then fused through an early and late integration architecture. This fusion strategy enables the model to combine complementary information from multiple sources. Additionally, the framework is designed to handle incomplete data, a common challenge in clinical settings, by enabling inference even when certain modalities are missing.

[148] ArteryX: Advancing Brain Artery Feature Extraction with Vessel-Fused Networks and a Robust Validation Framework

Abrar Faiyaz,Nhat Hoang,Giovanni Schifitto,Md Nasir Uddin

Main category: eess.IV

TL;DR: ArteryX is a novel semi-supervised toolbox that improves the evaluation of cerebral vasculature by offering accurate, efficient, and standardized quantitative assessments.

Details

Motivation: Current methods for extracting arterial features from MRA face challenges such as user-dependent variability and lack of standardized validations, which the ArteryX toolbox aims to overcome. Method: ArteryX uses a vessel-fused network-based landmarking approach to track and manage tracings, and it integrates an in-vivo like artery simulation framework using predefined ground-truth features. Result: ArteryX processes data at 0.5 mm resolution in 10-15 minutes with minimal user intervention, demonstrating improved sensitivity to vascular changes compared to existing methods. Conclusion: The ArteryX framework is a promising tool for benchmarking feature extraction and integrating into clinical workflows, aiding in the early detection of cerebrovascular pathology. Abstract: Cerebrovascular pathology significantly contributes to cognitive decline and neurological disorders, underscoring the need for advanced tools to assess vascular integrity. Three-dimensional Time-of-Flight Magnetic Resonance Angiography (3D TOF MRA) is widely used to visualize cerebral vasculature, however, clinical evaluations generally focus on major arterial abnormalities, overlooking quantitative metrics critical for understanding subtle vascular changes. Existing methods for extracting structural, geometrical and morphological arterial features from MRA - whether manual or automated - face challenges including user-dependent variability, steep learning curves, and lack of standardized quantitative validations. We propose a novel semi-supervised artery evaluation framework, named ArteryX, a MATLAB-based toolbox that quantifies vascular features with high accuracy and efficiency, achieving processing times ~10-15 minutes per subject at 0.5 mm resolution with minimal user intervention. ArteryX employs a vessel-fused network based landmarking approach to reliably track and manage tracings, effectively addressing the issue of dangling/disconnected vessels. Validation on human subjects with cerebral small vessel disease demonstrated its improved sensitivity to subtle vascular changes and better performance than an existing semi-automated method. Importantly, the ArteryX toolbox enables quantitative feature validation by integrating an in-vivo like artery simulation framework utilizing vessel-fused graph nodes and predefined ground-truth features for specific artery types. Thus, the ArteryX framework holds promise for benchmarking feature extraction toolboxes and for seamless integration into clinical workflows, enabling early detection of cerebrovascular pathology and standardized comparisons across patient cohorts to advance understanding of vascular contributions to brain health.

Table of Contents

cs.CL [Back]

[1] Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

[2] Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses

[3] SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

[4] Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings

[5] The Impact of Background Speech on Interruption Detection in Collaborative Groups

[6] Multi-Agent Retrieval-Augmented Framework for Evidence-Based Counterspeech Against Health Misinformation

[7] GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation

[8] MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning

[9] SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data

[10] Towards Interpretable Time Series Foundation Models

[11] SAND: Boosting LLM Agents with Self-Taught Action Deliberation

[12] RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning

[13] Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

[14] PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving

[15] Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code

[16] Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature

[17] Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

[18] Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System

[19] Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

[20] CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text

[21] The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

[22] The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs

[23] Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

[24] Bayesian Discrete Diffusion Beats Autoregressive Perplexity

[25] Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks

[26] FrugalRAG: Learning to retrieve and reason for multi-hop QA

[27] Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement

[28] An Automated Length-Aware Quality Metric for Summarization

[29] SAS: Simulated Attention Score

[30] KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities

[31] Rethinking the Privacy of Text Embeddings: A Reproducibility Study of "Text Embeddings Reveal (Almost) As Much As Text"

[32] Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization

[33] Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review

[34] When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance

[35] StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model

[36] Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers

[37] Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

[38] On the Effect of Instruction Tuning Loss on Generalization

[39] Conditional Unigram Tokenization with Parallel Data

[40] From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

[41] Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation

[42] DocCHA: Towards LLM-Augmented Interactive Online diagnosis System

[43] Automating MD simulations for Proteins using Large language Models: NAMD-Agent

[44] DTECT: Dynamic Topic Explorer & Context Tracker

[45] SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment

[46] MIRIX: Multi-Agent Memory System for LLM-Based Agents

[47] Why is Your Language Model a Poor Implicit Reward Model?

[48] Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology

[49] Automating Expert-Level Medical Reasoning Evaluation of Large Language Models

[50] PyVision: Agentic Vision with Dynamic Tooling

cs.CV [Back]

[51] Multi-level Mixture of Experts for Multimodal Entity Linking

[52] CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings

[53] Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

[54] Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey

[55] Robust Multimodal Large Language Models Against Modality Conflict

[56] Aerial Maritime Vessel Detection and Identification

[57] CL-Polyp: A Contrastive Learning-Enhanced Network for Accurate Polyp Segmentation

[58] Interpretable EEG-to-Image Generation with Semantic Prompts

[59] A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

[60] Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement

[61] Automated Video Segmentation Machine Learning Pipeline

[62] DisenQ: Disentangling Q-Former for Activity-Biometrics

[63] LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

[64] MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning

[65] ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation

[66] Scalable and Realistic Virtual Try-on Application for Foundation Makeup with Kubelka-Munk Theory

[67] Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning

[68] PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency

[69] Adaptive Particle-Based Shape Modeling for Anatomical Surface Correspondence

[70] Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos

[71] KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos

[72] Behave Your Motion: Habit-preserved Cross-category Animal Motion Transfer

[73] Seg-Wild: Interactive Segmentation based on 3D Gaussian Splatting for Unconstrained Image Collections

[74] EscherNet++: Simultaneous Amodal Completion and Scalable View Synthesis through Masked Fine-Tuning and Enhanced Feed-Forward 3D Reconstruction

[75] EPIC: Efficient Prompt Interaction for Text-Image Classification

[76] Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

[77] Towards High-Resolution 3D Anomaly Detection: A Scalable Dataset and Real-Time Framework for Subtle Industrial Defects