Skip to content

Table of Contents

cs.CL [Back]

[1] MemeMind: A Large-Scale Multimodal Dataset with Chain-of-Thought Reasoning for Harmful Meme Detection

Hexiang Gu,Qifan Yu,Saihui Hou,Zhiqin Fang,Huijia Wu,Zhaofeng He

Main category: cs.CL

TL;DR: 本文提出了MemeMind数据集和MemeGuard框架,显著提升了有害模因检测的效果。

Details Motivation: 由于有害模因的隐含语义和复杂的多模态交互,自动化检测面临重大挑战,而目前缺乏系统性、大规模、多样化和高度可解释的数据集阻碍了该领域进一步的发展。 Method: 提出了一种创新的检测框架MemeGuard,有效整合了多模态信息与推理过程建模,并基于MemeMind数据集进行了广泛的实验。 Result: MemeMind填补了当前数据集中关键的空白,提供全面的标签和明确的推理轨迹;MemeGuard显著提高了模型理解和识别有害模因的能力。 Conclusion: MemeGuard在检测有害模因任务中始终优于现有最先进方法。 Abstract: The rapid development of social media has intensified the spread of harmful content. Harmful memes, which integrate both images and text, pose significant challenges for automated detection due to their implicit semantics and complex multimodal interactions. Although existing research has made progress in detection accuracy and interpretability, the lack of a systematic, large-scale, diverse, and highly explainable dataset continues to hinder further advancement in this field. To address this gap, we introduce MemeMind, a novel dataset featuring scientifically rigorous standards, large scale, diversity, bilingual support (Chinese and English), and detailed Chain-of-Thought (CoT) annotations. MemeMind fills critical gaps in current datasets by offering comprehensive labeling and explicit reasoning traces, thereby providing a solid foundation for enhancing harmful meme detection. In addition, we propose an innovative detection framework, MemeGuard, which effectively integrates multimodal information with reasoning process modeling, significantly improving models' ability to understand and identify harmful memes. Extensive experiments conducted on the MemeMind dataset demonstrate that MemeGuard consistently outperforms existing state-of-the-art methods in harmful meme detection tasks.

[2] Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge

Sahil Kale,Vijaykant Nadadur

Main category: cs.CL

TL;DR: 该研究发现,大型语言模型(LLM)常常通过记忆而非真正的推理来解决问题,尤其在科学和医学领域,这种问题导致模型的自我认知不一致,进而影响了AI的可靠性和可解释性。

Details Motivation: 现有研究表明,LLM在记忆和自我认知方面存在问题,并且这些问题相互交织,降低了模型响应的可靠性。本研究旨在揭示这些问题的本质并提出改进方向。 Method: 研究使用了一个新颖的框架,用于判断LLM是否真正从训练数据中学习推理模式,还是仅仅通过记忆答案来表现能力。 Result: 分析显示,LLM在面对经过逻辑一致的任务扰动时,其可行性评估的一致性下降超过45%,特别是在科学和医学领域表现最为明显。 Conclusion: 研究得出,当前LLM的训练模式和架构存在缺陷,导致模型在科学与医学领域等具有高度标准化术语的领域中表现出强烈的自我认知不稳定现象,这表明需要改进方法来提高AI的可解释性和可信度。 Abstract: When artificial intelligence mistakes memorization for intelligence, it creates a dangerous mirage of reasoning. Existing studies treat memorization and self-knowledge deficits in LLMs as separate issues and do not recognize an intertwining link that degrades the trustworthiness of LLM responses. In our study, we utilize a novel framework to ascertain if LLMs genuinely learn reasoning patterns from training data or merely memorize them to assume competence across problems of similar complexity focused on STEM domains. Our analysis shows a noteworthy problem in generalization: LLMs draw confidence from memorized solutions to infer a higher self-knowledge about their reasoning ability, which manifests as an over 45% inconsistency in feasibility assessments when faced with self-validated, logically coherent task perturbations. This effect is most pronounced in science and medicine domains, which tend to have maximal standardized jargon and problems, further confirming our approach. Significant wavering within the self-knowledge of LLMs also shows flaws in current architectures and training patterns, highlighting the need for techniques that ensure a balanced, consistent stance on models' perceptions of their own knowledge for maximum AI explainability and trustworthiness. Our code and results are available publicly at https://github.com/knowledge-verse-ai/LLM-Memorization_SK_Eval-.

[3] Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Siyuan Zheng,Alisa Liu,Orevaoghene Ahia,Jonathan Hayase,Yejin Choi,Noah A. Smith

Main category: cs.CL

TL;DR: 研究显示指令微调模型在非规范分词下仍能保持高性能,甚至某些分词方案可提升任务表现。

Details Motivation: 现代分词器使用确定性算法将文本映射为单一“规范”标记序列,但同一字符串可以用词汇表生成多种非规范分词。本文探究语言模型对训练期间未见的非规范分词文本的稳健性。 Method: 评估20个基准测试中使用非规范分词编码的文本对语言模型性能的影响,并分析不同分词方案(如字符级分割、右对齐数字分组)的表现及来源稳健性。 Result: 指令微调模型在随机分词下保留高达93.4%的原始性能,在字符级分词下保留90.8%;更强模型通常更稳健;非规范分词方案可提升特定任务性能(如字符分割提升字符串操作和代码理解任务达+14%,右对齐数字分组使大数算术提升+33%)。 Conclusion: 研究发现模型在非规范分词下的表现取决于指令微调阶段,且其对分词的依赖性低于预期。推理时调整分词方式有提升性能的潜力。 Abstract: Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can *improve* performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We show that while both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings), base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less tied to their tokenizer than previously believed, and demonstrate the promise of intervening on tokenization at inference time to boost performance.

[4] Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

Weijie Xu,Yiwen Wang,Chi Xue,Xiangkun Hu,Xi Fang,Guimin Dong,Chandan K. Reddy

Main category: cs.CL

TL;DR: 本文提出了一种名为FiSCo的新框架,可在长文本响应中检测大语言模型(LLM)的潜在偏见,相较于传统方法更具鲁棒性和准确性。

Details Motivation: 现有评估方法忽视了长文本回复中的偏差以及LLM输出固有的可变性,因此需要一种更细粒度的语义分析方法来可靠地识别这些偏差。 Method: FiSCo通过分解模型输出为语义不同的主张,在合成和人工标注数据集上进行假设检验,并形式化了新的群体反事实公平性定义。 Result: FiSCo在性别、种族和年龄等多个维度验证了其有效性,相比其他评估指标能够更可靠地识别出细微的偏见。 Conclusion: FiSCo是一个新的统计框架,用于检测LLM在群体层面的细微语义偏差,并能减少LLM生成结果的随机性影响。 Abstract: Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo(Fine-grained Semantic Computation), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSco more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.

[5] Plan for Speed -- Dilated Scheduling for Masked Diffusion Language Models

Omer Luxembourg,Haim Permuter,Eliya Nachmani

Main category: cs.CL

TL;DR: The paper introduces DUS, an efficient inference method for Masked Diffusion Language Models that enables parallel unmasking while maintaining quality, reducing the number of denoiser calls and improving speed.

Details Motivation: Existing samplers for Masked Diffusion Language Models (MDLMs) use heuristics like denoiser confidence or entropy scores to select tokens to unmask, but these methods fail in parallel unmasking scenarios by ignoring pairwise interactions and dependencies. This limits their efficiency compared to traditional auto-regressive models. Method: The Dilated-scheduled Unmasking Strategy (DUS) partitions sequence positions into dilation-based groups of non-adjacent tokens under a first-order Markov assumption, enabling independent, parallel unmasking steps that respect local context and minimize joint entropy. Result: DUS improves performance on math and code completion benchmarks over existing parallel confidence-based planners without modifying the underlying denoiser, achieving a reduced runtime complexity of O(log B) per generation block compared to O(B) in state-of-the-art diffusion models. Conclusion: DUS offers a lightweight, budget-aware approach to efficient, high-quality text generation, paving the way to unlock the true capabilities of MDLMs. Abstract: Masked diffusion language models (MDLM) have shown strong promise for non-autoregressive text generation, yet existing samplers act as implicit planners, selecting tokens to unmask via denoiser confidence or entropy scores. Such heuristics falter under parallel unmasking - they ignore pairwise interactions between tokens and cannot account for dependencies when unmasking multiple positions at once, limiting their inference time to traditional auto-regressive (AR) models. We introduce the Dilated-scheduled Unmasking Strategy (DUS), an inference-only, planner-model-free method that requires no additional training. DUS leverages a first-order Markov assumption to partition sequence positions into dilation-based groups of non-adjacent tokens, enabling independent, parallel unmasking steps that respect local context that minimizes the joint entropy of each iteration step. Unlike semi-AR block approaches (e.g., LLADA and Dream) that still invoke the denoiser per block, DUS reduces the number of denoiser calls to O(log B) per generation block - yielding substantial speedup over the O(B) run time of state-of-the-art diffusion models, where B is the block size in the semi-AR inference process. In experiments on math (GSM8K) and code completion (Humaneval, MBPP) benchmarks - domains suited to non-ordinal generation - DUS improves scores over parallel confidence-based planner, without modifying the underlying denoiser. DUS offers a lightweight, budget-aware approach to efficient, high-quality text generation, paving the way to unlock the true capabilities of MDLMs.

[6] NLPnorth @ TalentCLEF 2025: Comparing Discriminative, Contrastive, and Prompt-Based Methods for Job Title and Skill Matching

Mike Zhang,Rob van der Goot

Main category: cs.CL

TL;DR: 本文提出了NLPnorth在TalentCLEF 2025的两个任务中的提交方案,包括多语言职位名称匹配和基于职位名称的技能预测,通过比较多种方法验证了最大多语言语言模型的有效性。

Details Motivation: 匹配职位名称及其与技能的对齐对于提高自动候选人匹配、职业路径预测和就业市场分析等下游任务具有重要意义。 Method: 比较了(微调)基于分类的方法、(微调)对比方法和提示方法在这两个任务上的表现,并使用了ESCO中的语言特定标题和描述数据。 Result: 对于任务A,提示方法在测试数据上的平均平均精度(MAP)为0.492;对于任务B,微调的基于分类的方法在测试数据上的MAP为0.290。此外,在任务A中排名第五,在任务B中排名第三。 Conclusion: NLPnorth的提交在TalentCLEF 2025的两个任务中表现良好,最大的多语言语言模型在这两个任务中表现最佳。 Abstract: Matching job titles is a highly relevant task in the computational job market domain, as it improves e.g., automatic candidate matching, career path prediction, and job market analysis. Furthermore, aligning job titles to job skills can be considered an extension to this task, with similar relevance for the same downstream tasks. In this report, we outline NLPnorth's submission to TalentCLEF 2025, which includes both of these tasks: Multilingual Job Title Matching, and Job Title-Based Skill Prediction. For both tasks we compare (fine-tuned) classification-based, (fine-tuned) contrastive-based, and prompting methods. We observe that for Task A, our prompting approach performs best with an average of 0.492 mean average precision (MAP) on test data, averaged over English, Spanish, and German. For Task B, we obtain an MAP of 0.290 on test data with our fine-tuned classification-based approach. Additionally, we made use of extra data by pulling all the language-specific titles and corresponding \emph{descriptions} from ESCO for each job and skill. Overall, we find that the largest multilingual language models perform best for both tasks. Per the provisional results and only counting the unique teams, the ranking on Task A is 5$^{\text{th}}$/20 and for Task B 3$^{\text{rd}}$/14.

[7] MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanation

Jackson Trager,Francielle Vargas,Diego Alves,Matteo Guida,Mikel K. Ngueajio,Ameeta Agrawal,Flor Plaza-del-Arco,Yalda Daryanai,Farzan Karimi-Malekabadi

Main category: cs.CL

TL;DR: 本文提出了MFTCXplain,这是一个多语言基准数据集,用于通过道德基础理论评估大型语言模型的道德推理能力。

Details Motivation: 当前的评估基准存在两个主要缺陷:缺乏解释道德分类的注释,以及主要关注英语,限制了跨文化道德推理的评估。 Method: 介绍了一个包含3000条推文的多语言数据集,涵盖葡萄牙语、意大利语、波斯语和英语,并使用道德基础理论进行仇恨言论的多跳解释标注。 Result: 实验结果表明,虽然LLMs在仇恨言论检测上表现良好(F1高达0.836),但在预测道德情感方面表现较差(F1低于0.35),尤其是在代表性不足的语言中。 Conclusion: 目前的LLMs在内部化和反映人类道德推理方面的能力有限。 Abstract: Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.

[8] Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Nathaniel Getachew,Abulhair Saparov

Main category: cs.CL

TL;DR: 本文介绍了一个名为StorySim的新框架,用于评估大型语言模型的心智理论和世界建模能力,结果表明模型在不同任务上的表现存在差异,并揭示了其推理过程中的某些偏见。

Details Motivation: 解决现有基准测试中可能存在的预训练数据污染问题,并提供一个可控且灵活的工具来评估LLMs的心智理论和世界建模能力。 Method: 通过开发一个名为StorySim的框架,利用高度可控的Storyboard生成新的、组合性的故事提示,从而精确操控角色视角和事件。实验涉及一系列最先进的LLMs,用于设计一阶和二阶ToM任务以及WM任务。 Result: 实验结果显示,大多数模型在WM任务上的表现优于ToM任务,且模型在与人类相关的推理中表现更好,而不是无生命物体。此外,该框架帮助发现了模型中的启发式行为,例如近期偏差和对早期事件的过度依赖。 Conclusion: StorySim是一个用于合成生成故事以评估大语言模型(LLM)心智理论(ToM)和世界建模(WM)能力的可编程框架。它能够发现模型在执行ToM任务时的表现不如WM任务,并揭示了模型在推理过程中使用启发式行为,如近期偏差和对早期事件的过度依赖。 Abstract: We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

[9] Human-Aligned Faithfulness in Toxicity Explanations of LLMs

Ramaravind K. Mothilal,Joanna Roy,Syed Ishtiaque Ahmed,Shion Guha

Main category: cs.CL

TL;DR: 该研究提出了一种新的评估LLM毒性解释能力的标准——人类对齐忠实性(HAF),并通过自动化指标发现当前LLM在复杂毒性推理任务中表现不佳。

Details Motivation: 现有的关于可解释性的研究方法难以直接应用于自由形式毒性解释的评估,因为它们通常过度依赖输入文本扰动等技术。此外,当前的研究主要集中在毒性检测任务上,而缺乏对LLM在解释毒性立场时推理能力的系统评估。因此,本研究旨在填补这一空白,通过引入新的评估标准来衡量LLM解释与理想条件下人类解释的一致性。 Method: 研究团队提出了一种名为“人类对齐忠实性”(Human-Aligned Faithfulness, HAF)的新标准,并基于不确定性量化构建了六个自动化评估指标。这些指标用于在没有人工参与的情况下评估LLM生成的自由形式毒性解释质量。实验中使用了三种Llama模型(最大为70B参数)和一个8B的Ministral模型,在五个不同的毒性数据集上进行测试。 Result: 实验结果显示,虽然LLM在简单提示下可以生成看似合理的解释,但当被要求解释完整的理由集合、个别理由与其毒性立场之间的细微关系时,其推理能力显著下降,产生不一致甚至荒谬的回应。这表明当前LLM在理解和解释毒性方面的推理能力仍存在较大局限性。 Conclusion: 尽管大型语言模型(LLM)能够生成看似合理的毒性解释,但在处理复杂关系和推理任务时表现不佳,导致不一致和无意义的响应。研究团队提出了一种新的、理论基础扎实的多维标准——人类对齐忠实性(HAF),并开发了六个基于不确定性量化的自动评估指标,以全面评估LLM的毒性解释能力。 Abstract: The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that justify a stance -- to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate \haf of LLMs' toxicity explanations with no human involvement, and highlight how "non-ideal" the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and nonsensical responses. We open-source our code and LLM-generated explanations at https://github.com/uofthcdslab/HAF.

[10] Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data

Yun Tang,Eesung Kim,Vijendra Raj Apsingekar

Main category: cs.CL

TL;DR: 本文提出了一种新的语音识别建模方法J-TAED,通过结合语音和文本信息进行训练,即使在缺乏特定领域语音数据的情况下也能显著提高识别准确率。

Details Motivation: 为了解决在语音识别任务中不同领域数据稀缺的问题,并提升语音识别的准确性。 Method: 提出了一种联合语音和文本优化的混合转导器和基于注意力机制的编码器-解码器(TAED)建模方法,同时使用语音和文本输入进行训练,但在推理过程中仅使用语音数据作为输入。 Result: 实验表明,在Librispeech数据集上,J-TAED使词错误率降低了5.8%至12.8%;在两个域外数据集(金融和命名实体相关)上,分别实现了15.3%和17.8%的词错误率降低。 Conclusion: J-TAED可以将语音和语言信息整合到一个模型中,并通过基于文本的领域自适应显著提高识别准确率。 Abstract: A joint speech and text optimization method is proposed for hybrid transducer and attention-based encoder decoder (TAED) modeling to leverage large amounts of text corpus and enhance ASR accuracy. The joint TAED (J-TAED) is trained with both speech and text input modalities together, while it only takes speech data as input during inference. The trained model can unify the internal representations from different modalities, and be further extended to text-based domain adaptation. It can effectively alleviate data scarcity for mismatch domain tasks since no speech data is required. Our experiments show J-TAED successfully integrates speech and linguistic information into one model, and reduce the WER by 5.8 ~12.8% on the Librispeech dataset. The model is also evaluated on two out-of-domain datasets: one is finance and another is named entity focused. The text-based domain adaptation brings 15.3% and 17.8% WER reduction on those two datasets respectively.

[11] Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages

Christopher Toukmaji,Jeffrey Flanigan

Main category: cs.CL

TL;DR: 本文研究了低资源语言在上下文学习中的表现,并比较了多种适应技术的效果。

Details Motivation: 由于LLM通常在高资源语言上训练,低资源语言的任务表现不佳,因此需要研究如何跨语言适应LLM以改善低资源语言的上下文学习。 Method: 本文跨越五种目标语言、三种基础LLM和七种下游任务,进行了全面的研究,并设计了一种新的度量标准Valid Output Recall (VOR)来分析模型输出。 Result: 结果显示,少量示例提示和翻译测试设置往往大大优于基于梯度的适应方法,并通过新指标分析归因于灾难性遗忘导致这些训练模型的性能下降。 Conclusion: 这是迄今为止针对低资源语言上下文学习的最大研究之一,作者公开了所有数据集和训练模型。 Abstract: LLMs are typically trained in high-resource languages, and tasks in lower-resourced languages tend to underperform the higher-resource language counterparts for in-context learning. Despite the large body of work on prompting settings, it is still unclear how LLMs should be adapted cross-lingually specifically for in-context learning in the low-resource target languages. We perform a comprehensive study spanning five diverse target languages, three base LLMs, and seven downstream tasks spanning over 4,100 GPU training hours (9,900+ TFLOPs) across various adaptation techniques: few-shot prompting, translate-test, fine-tuning, embedding re-initialization, and instruction fine-tuning. Our results show that the few-shot prompting and translate-test settings tend to heavily outperform the gradient-based adaptation methods. To better understand this discrepancy, we design a novel metric, Valid Output Recall (VOR), and analyze model outputs to empirically attribute the degradation of these trained models to catastrophic forgetting. To the extent of our knowledge, this is the largest study done on in-context learning for low-resource languages with respect to train compute and number of adaptation techniques considered. We make all our datasets and trained models available for public use.

[12] Augmenting Multi-Agent Communication with State Delta Trajectory

Yichen Tang,Weihang Su,Yujia Zhou,Yiqun Liu,Min Zhang,Shaoping Ma,Qingyao Ai

Main category: cs.CL

TL;DR: This paper proposes State Delta Encoding (SDE), a new communication protocol for LLM-based multi-agent systems that improves performance by better capturing and transferring hidden reasoning information.

Details Motivation: Natural language communication in LLM-based multi-agent systems leads to information loss, especially for abstract reasoning. This work aims to reduce such loss. Method: State Delta Encoding (SDE) to represent state transition trajectories for improved agent communication. Result: Multi-agent systems using SDE achieved SOTA performance compared to other communication protocols, especially in complex reasoning tasks. Conclusion: The proposed State Delta Encoding (SDE) method enhances the communication in LLM-based multi-agent systems, particularly for complex reasoning tasks. Abstract: Multi-agent techniques such as role playing or multi-turn debates have been shown to be effective in improving the performance of large language models (LLMs) in downstream tasks. Despite their differences in workflows, existing LLM-based multi-agent systems mostly use natural language for agent communication. While this is appealing for its simplicity and interpretability, it also introduces inevitable information loss as one model must down sample its continuous state vectors to concrete tokens before transferring them to the other model. Such losses are particularly significant when the information to transfer is not simple facts, but reasoning logics or abstractive thoughts. To tackle this problem, we propose a new communication protocol that transfers both natural language tokens and token-wise state transition trajectory from one agent to another. Particularly, compared to the actual state value, we find that the sequence of state changes in LLMs after generating each token can better reflect the information hidden behind the inference process, so we propose a State Delta Encoding (SDE) method to represent state transition trajectories. The experimental results show that multi-agent systems with SDE achieve SOTA performance compared to other communication protocols, particularly in tasks that involve complex reasoning. This shows the potential of communication augmentation for LLM-based multi-agent systems.

[13] Personality Prediction from Life Stories using Language Models

Rasiq Hussain,Jerry Ma,Rithik Khandelwal,Joshua Oltmanns,Mehak Gupta

Main category: cs.CL

TL;DR: This paper proposes a hybrid method combining pretrained transformers and RNNs with attention to better predict personality traits from long narrative interviews, showing improvements over existing long-context models.

Details Motivation: Traditional personality assessments rely on structured questionnaires, which may lack depth. Natural Language Processing (NLP) offers richer insights through open-ended text, especially long narrative interviews. However, modeling such long-context data poses challenges, which this study aims to address. Method: The authors used a two-step approach: (1) sliding-window fine-tuning of pretrained language models to extract contextual embeddings, and (2) applying Recurrent Neural Networks (RNNs) with attention mechanisms to model long-range dependencies. They conducted ablation studies and compared their method with state-of-the-art models like LLaMA and Longformer. Result: The proposed hybrid method improved prediction accuracy, efficiency, and interpretability in modeling long narrative interviews for predicting Five-Factor Model (FFM) personality traits compared to existing long-context models. Conclusion: The study concludes that combining language-based features with long-context modeling can significantly improve personality assessment based on life narratives, offering a more accurate, efficient, and interpretable approach compared to existing methods. Abstract: Natural Language Processing (NLP) offers new avenues for personality assessment by leveraging rich, open-ended text, moving beyond traditional questionnaires. In this study, we address the challenge of modeling long narrative interview where each exceeds 2000 tokens so as to predict Five-Factor Model (FFM) personality traits. We propose a two-step approach: first, we extract contextual embeddings using sliding-window fine-tuning of pretrained language models; then, we apply Recurrent Neural Networks (RNNs) with attention mechanisms to integrate long-range dependencies and enhance interpretability. This hybrid method effectively bridges the strengths of pretrained transformers and sequence modeling to handle long-context data. Through ablation studies and comparisons with state-of-the-art long-context models such as LLaMA and Longformer, we demonstrate improvements in prediction accuracy, efficiency, and interpretability. Our results highlight the potential of combining language-based features with long-context modeling to advance personality assessment from life narratives.

[14] What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning

Yuchang Zhu,Zhonghua zhen,Qunshu Lin,Haotong Wei,Xiaolong Sun,Zixuan Yu,Minghao Liu,Zibin Zheng,Liang Chen

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLM)生成数据的多样性对下游模型性能的影响,发现适度多样性的生成数据可提升模型表现,而过度多样性则有负面影响。

Details Motivation: 解决LLM生成数据训练下游模型时出现的模型崩溃问题,并强调数据多样性在数据质量中的关键作用。 Method: 通过实验探究不同多样性水平的LLM生成数据对下游模型性能的影响,并研究混合不同比例的LLM生成数据后的模型性能变化。 Result: 实验结果表明,在分布偏移最小的情况下,适度多样性的LLM生成数据能够增强模型性能;而高度多样性的生成数据会对模型性能产生负面影响。 Conclusion: 适度多样性的LLM生成数据可以在标签数据不足的情况下提升模型性能,但高度多样性的生成数据会产生负面影响。 Abstract: With the remarkable generative capabilities of large language models (LLMs), using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.

[15] EmoStage: A Framework for Accurate Empathetic Response Generation via Perspective-Taking and Phase Recognition

Zhiyang Qi,Keiko Takamizo,Mariko Ukiyo,Michimasa Inaba

Main category: cs.CL

TL;DR: 提出了一种名为EmoStage的框架,用于改进AI驱动的心理咨询系统。

Details Motivation: 当前的AI心理咨询系统在理解客户心理状态、依赖高质量训练数据和隐私问题上面临挑战。 Method: 引入了视角采样和阶段识别技术,利用开源大语言模型的推理能力生成情感共鸣的回应。 Result: 实验显示EmoStage在日语和中文心理咨询环境中提高了基础模型生成回应的质量,并与数据驱动方法表现相当。 Conclusion: EmoStage为AI驱动的心理咨询系统提供了一种有效的解决方案,无需额外训练数据即可增强同理心回应的生成。 Abstract: The rising demand for mental health care has fueled interest in AI-driven counseling systems. While large language models (LLMs) offer significant potential, current approaches face challenges, including limited understanding of clients' psychological states and counseling stages, reliance on high-quality training data, and privacy concerns associated with commercial deployment. To address these issues, we propose EmoStage, a framework that enhances empathetic response generation by leveraging the inference capabilities of open-source LLMs without additional training data. Our framework introduces perspective-taking to infer clients' psychological states and support needs, enabling the generation of emotionally resonant responses. In addition, phase recognition is incorporated to ensure alignment with the counseling process and to prevent contextually inappropriate or inopportune responses. Experiments conducted in both Japanese and Chinese counseling settings demonstrate that EmoStage improves the quality of responses generated by base models and performs competitively with data-driven methods.

[16] JCAPT: A Joint Modeling Approach for CAPT

Tzu-Hsuan Yang,Yue-Yang He,Berlin Chen

Main category: cs.CL

TL;DR: 本文提出一种新的统一框架,结合Mamba模型、语音特征和提示策略,显著提高了CAPT系统在发音评估和错误诊断方面的性能。

Details Motivation: 有效的发音反馈对二语学习至关重要,而当前的CAPT系统中联合建模APA和MDD任务可以带来相互增益的效果。此外,提升模型的可解释性和细粒度时间推理能力是关键需求。 Method: 利用选择性状态空间模型(SSM)Mamba,结合语音学特征和提示策略进行联合建模,并在speechocean762基准数据集上进行了实验验证。 Result: 实验表明,所提出的模型在speechocean762数据集上的表现优于之前的方法,尤其是在MDD任务上表现突出。 Conclusion: 该研究提出了一种基于Mamba模型的统一框架,结合语音特征和提示策略,有效提升了计算机辅助发音训练(CAPT)中的自动发音评估(APA)和发音错误检测与诊断(MDD)任务的表现。 Abstract: Effective pronunciation feedback is critical in second language (L2) learning, for which computer-assisted pronunciation training (CAPT) systems often encompass two key tasks: automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD). Recent work has shown that joint modeling of these two tasks can yield mutual benefits. Our unified framework leverages Mamba, a selective state space model (SSM), while integrating phonological features and think token strategies to jointly enhance interpretability and fine-grained temporal reasoning in APA and MDD. To our knowledge, this is the first study to combine phonological attribution, SSM-based modeling, and prompting in CAPT. A series of experiments conducted on the speechocean762 benchmark demonstrate that our model consistently outperforms prior methods, particularly on the MDD task.

[17] Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation

Jisu Shin,Juhyun Oh,Eunsu Kim,Hoyun Song,Alice Oh

Main category: cs.CL

TL;DR: This paper proposes an atomic-level evaluation framework to detect subtle persona inconsistencies in large language models, showing improved precision over existing methods.

Details Motivation: Existing evaluation methods for persona fidelity in LLMs assign single scores to entire responses and struggle to capture subtle deviations in long-form text, necessitating a more granular approach. Method: The authors propose an atomic-level evaluation framework with three key metrics to quantify persona fidelity at a finer granularity, enabling detection of subtle persona misalignments in long-form text generation. Result: The experiments demonstrate that the proposed framework identifies persona inconsistencies overlooked by prior methods and reveals how task structure and persona desirability affect model adaptability. Conclusion: The paper concludes that the proposed atomic-level evaluation framework effectively detects persona inconsistencies in LLMs, offering a more precise assessment of persona fidelity by identifying subtle deviations. Abstract: Ensuring persona fidelity in large language models (LLMs) is essential for maintaining coherent and engaging human-AI interactions. However, LLMs often exhibit Out-of-Character (OOC) behavior, where generated responses deviate from an assigned persona, leading to inconsistencies that affect model reliability. Existing evaluation methods typically assign single scores to entire responses, struggling to capture subtle persona misalignment, particularly in long-form text generation. To address this limitation, we propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity. Our three key metrics measure the degree of persona alignment and consistency within and across generations. Our approach enables a more precise and realistic assessment of persona fidelity by identifying subtle deviations that real users would encounter. Through our experiments, we demonstrate that our framework effectively detects persona inconsistencies that prior methods overlook. By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability, highlighting challenges in maintaining consistent persona expression.

[18] Measuring and Guiding Monosemanticity

Ruben Härle,Felix Friedrich,Manuel Brack,Stephan Wäldchen,Björn Deiseroth,Patrick Schramowski,Kristian Kersting

Main category: cs.CL

TL;DR: 本文提出了Guided Sparse Autoencoders (G-SAE) 方法,通过在训练过程中基于标记概念对潜在表示进行条件处理,提高大语言模型内部特征表示的可解释性和可控性。

Details Motivation: 现有的方法在可靠定位和操作特征表示方面存在根本性的挑战,而稀疏自编码器(Sparse Autoencoders, SAEs)虽然在大规模特征提取中展现出希望,但受限于特征隔离不完全和单一语义性不可靠。 Method: 提出了一种新的方法Guided Sparse Autoencoders (G-SAE),在训练期间基于标记的概念对潜在表示进行条件处理,并引入了Feature Monosemanticity Score (FMS)来量化潜在表示中的特征单一语义性。 Result: 评估结果显示,在毒性检测、写作风格识别和隐私属性识别任务中,G-SAE提高了潜在空间中目标概念的定位和解耦能力,从而增强了可解释性、行为检测能力和控制效果。 Conclusion: 研究发现G-SAE不仅增强了特征的单一语义性,还实现了更有效和精细的控制,为改进大语言模型的机制可解释性和控制提供了可行指南。 Abstract: There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

[19] Automated Detection of Pre-training Text in Black-box LLMs

Ruihan Hu,Yu-Ming Shang,Jiankun Peng,Wei Luo,Yazhe Wang,Xi Zhang

Main category: cs.CL

TL;DR: VeilProbe is an automated framework for detecting whether a text was part of an LLM's pre-training data in a black-box setting, offering better performance than current approaches.

Details Motivation: Existing methods for detecting LLM pre-training texts rely on hidden information or require manual efforts, making them unsuitable for black-box settings. Method: VeilProbe uses a sequence-to-sequence mapping model to infer latent features and performs token perturbations for distinguishable features. A prototype-based classifier is used to handle limited training samples. Result: Evaluations on three datasets show that VeilProbe performs well and is superior in the black-box setting. Conclusion: VeilProbe effectively detects LLMs' pre-training texts in a black-box setting and outperforms existing methods. Abstract: Detecting whether a given text is a member of the pre-training data of Large Language Models (LLMs) is crucial for ensuring data privacy and copyright protection. Most existing methods rely on the LLM's hidden information (e.g., model parameters or token probabilities), making them ineffective in the black-box setting, where only input and output texts are accessible. Although some methods have been proposed for the black-box setting, they rely on massive manual efforts such as designing complicated questions or instructions. To address these issues, we propose VeilProbe, the first framework for automatically detecting LLMs' pre-training texts in a black-box setting without human intervention. VeilProbe utilizes a sequence-to-sequence mapping model to infer the latent mapping feature between the input text and the corresponding output suffix generated by the LLM. Then it performs the key token perturbations to obtain more distinguishable membership features. Additionally, considering real-world scenarios where the ground-truth training text samples are limited, a prototype-based membership classifier is introduced to alleviate the overfitting issue. Extensive evaluations on three widely used datasets demonstrate that our framework is effective and superior in the black-box setting.

[20] Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study

Yingji Zhang,Marco Valentino,Danilo S. Carvalho,André Freitas

Main category: cs.CL

TL;DR: This paper explores embedding reasoning rules into language models via VAEs, showing improved interpretability and controllability with disentangled representations and prior knowledge integration.

Details Motivation: Current LMs rely on memorization rather than rule-based inference; explicit reasoning rules can improve generalization, interpretability, and control. Method: The paper proposes a pipeline using Transformer-based Language VAEs to embed reasoning rules, involving three rule-based tasks, theoretical framework, and architecture. Result: Findings include successful disentanglement of reasoning rules in encoder space, effective prior knowledge injection via Query, and identification of FFN layers as better at preserving rule separation than attention layers. Conclusion: Incorporating reasoning rules into language models using VAEs enhances disentanglement, prior knowledge integration, and performance optimization. Abstract: Incorporating explicit reasoning rules within the latent space of language models (LMs) offers a promising pathway to enhance generalisation, interpretability, and controllability. While current Transformer-based language models have shown strong performance on Natural Language Inference (NLI) tasks, they often rely on memorisation rather than rule-based inference. This work investigates how reasoning rules can be explicitly embedded and memorised within the LMs through Language Variational Autoencoders (VAEs). We propose a complete pipeline for learning reasoning rules within Transformer-based language VAEs. This pipeline encompasses three rule-based reasoning tasks, a supporting theoretical framework, and a practical end-to-end architecture. The experiment illustrates the following findings: Disentangled reasoning: Under explicit signal supervision, reasoning rules - viewed as functional mappings - can be disentangled within the encoder's parametric space. This separation results in distinct clustering of rules in the output feature space. Prior knowledge injection: injecting reasoning information into the Query enables the model to more effectively retrieve the stored value Value from memory based on Key. This approach offers a simple method for integrating prior knowledge into decoder-only language models. Performance bottleneck: In mathematical reasoning tasks using Qwen2.5(0.5B), increasing sample count doesn't improve performance beyond a point. Moreover, ffn layers are better than attention layers at preserving the separation of reasoning rules in the model's parameters.

[21] Can Large Language Models Capture Human Annotator Disagreements?

Jingwei Ni,Yu Fan,Vilém Zouhar,Donya Rooein,Alexander Hoyle,Mrinmaya Sachan,Markus Leippold,Dirk Hovy,Elliott Ash

Main category: cs.CL

TL;DR: 本研究发现,尽管大语言模型常用于减少人工标注的工作量,但它们在捕捉人工标注变化(如分歧)方面表现不佳,且某些提升方法反而会恶化这一问题。

Details Motivation: 由于NLP中人工标注的变化反映了任务主观性和样本模糊性等重要信息,而目前大语言模型在自动标注中的评估往往集中在预测多数投票的“真实”标签上,缺乏对标注变化的捕捉能力的研究。 Method: 通过大量评估LLMs在没有重复人工标签的情况下预测标注分歧的能力,并探讨RLVR-style推理对这种预测的影响。 Result: 结果显示,LLMs在建模分歧方面表现不佳,这可能被基于多数标签的评估方法所忽视;此外,RLVR-style推理虽然通常能提高LLM性能,但在分歧预测中却导致性能下降。 Conclusion: 研究强调了在使用大语言模型进行标注时,需要特别关注其在建模分歧方面的能力,因为仅基于多数标签的评估可能会忽略这些问题。 Abstract: Human annotation variation (i.e., annotation disagreements) is common in NLP and often reflects important information such as task subjectivity and sample ambiguity. While Large Language Models (LLMs) are increasingly used for automatic annotation to reduce human effort, their evaluation often focuses on predicting the majority-voted "ground truth" labels. It is still unclear, however, whether these models also capture informative human annotation variation. Our work addresses this gap by extensively evaluating LLMs' ability to predict annotation disagreements without access to repeated human labels. Our results show that LLMs struggle with modeling disagreements, which can be overlooked by majority label-based evaluations. Notably, while RLVR-style (Reinforcement learning with verifiable rewards) reasoning generally boosts LLM performance, it degrades performance in disagreement prediction. Our findings highlight the critical need for evaluating and improving LLM annotators in disagreement modeling. Code and data at https://github.com/EdisonNi-hku/Disagreement_Prediction.

[22] MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

Wenhan Han,Yifan Zhang,Zhixun Chen,Binbin Liu,Haobin Lin,Bingni Zhang,Taifeng Wang,Mykola Pechenizkiy,Meng Fang,Yin Zheng

Main category: cs.CL

TL;DR: This paper introduces MuBench, a comprehensive benchmark for evaluating multilingual large language models across 61 languages. It highlights gaps in language coverage and proposes a new metric called Multilingual Consistency to analyze performance bottlenecks.

Details Motivation: The motivation is to address the lack of comprehensive and aligned evaluation datasets for multilingual large language models, which leads to fragmented assessments of their capabilities across languages. Method: The authors introduce MuBench, a benchmark covering 61 languages to evaluate multilingual LLMs. They assess several state-of-the-art models and propose Multilingual Consistency as a complementary metric to accuracy. Additionally, they pretrain models with varying language ratios and parallel data proportions to study cross-lingual transfer dynamics. Result: The results indicate discrepancies in language coverage claims versus actual performance, especially for low-resource languages compared to English. Using MuBench, the authors demonstrate how Multilingual Consistency can be used to identify performance issues and guide improvements in multilingual LLMs. Conclusion: The paper concludes that there are notable gaps between claimed and actual language coverage in multilingual LLMs, particularly between English and low-resource languages. The authors also show the effectiveness of Multilingual Consistency as a metric for analyzing performance bottlenecks. Abstract: Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench's alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. Finally, we pretrain a suite of 1.2B-parameter models on English and Chinese with 500B tokens, varying language ratios and parallel data proportions to investigate cross-lingual transfer dynamics.

[23] Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models

Marcos Estecha-Garitagoitia,Chen Zhang,Mario Rodríguez-Cantelar,Luis Fernando D'Haro

Main category: cs.CL

TL;DR: 本文研究了如何利用大语言模型(LLMs)进行基于常识关系的对话数据增强,并提出了相应的自动生成与评估方法。

Details Motivation: 探索基于不同类型常识关系的对话系统回合级数据增强任务及其生成合成回合的自动评估。 Method: 提出了一种基于链式思维(CoT)的方法,利用预训练大语言模型(LLMs)的扩展知识和零样本能力,在常识属性条件下生成对话数据增强,并通过指令提示进行自动质量检测。 Result: 从5个知名对话数据集中随机选取200个部分对话,并基于不同事件常识属性生成替代响应;同时提出一种受ACCENT指标启发的评估框架,用于自动检测生成数据的质量。 Conclusion: 初步结果表明,该方法能有效利用LLMs在对话系统中的常识推理和评估能力。 Abstract: This paper provides preliminary results on exploring the task of performing turn-level data augmentation for dialogue system based on different types of commonsense relationships, and the automatic evaluation of the generated synthetic turns. The proposed methodology takes advantage of the extended knowledge and zero-shot capabilities of pretrained Large Language Models (LLMs) to follow instructions, understand contextual information, and their commonsense reasoning capabilities. The approach draws inspiration from methodologies like Chain-of-Thought (CoT), applied more explicitly to the task of prompt-based generation for dialogue-based data augmentation conditioned on commonsense attributes, and the automatic evaluation of the generated dialogues. To assess the effectiveness of the proposed approach, first we extracted 200 randomly selected partial dialogues, from 5 different well-known dialogue datasets, and generate alternative responses conditioned on different event commonsense attributes. This novel dataset allows us to measure the proficiency of LLMs in generating contextually relevant commonsense knowledge, particularly up to 12 different specific ATOMIC [10] database relations. Secondly, we propose an evaluation framework to automatically detect the quality of the generated dataset inspired by the ACCENT [26] metric, which offers a nuanced approach to assess event commonsense. However, our method does not follow ACCENT's complex eventrelation tuple extraction process. Instead, we propose an instruction-based prompt for each commonsense attribute and use state-of-the-art LLMs to automatically detect the original attributes used when creating each augmented turn in the previous step. Preliminary results suggest that our approach effectively harnesses LLMs capabilities for commonsense reasoning and evaluation in dialogue systems.

[24] Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning

Russell Beale

Main category: cs.CL

TL;DR: This paper reviews how large language models (LLMs) are being used in education, particularly in higher education, and examines how they align with established pedagogical theories like Vygotsky’s scaffolding, the Socratic method, and Laurillard’s conversational framework. It identifies gaps in applying these theories to LLMs and proposes strategies—such as improved prompting and retrieval-augmented generation—to make AI-driven dialogue more educationally effective.

Details Motivation: The motivation for this study stems from the rapid transformation of education by LLMs and the need to ensure their integration aligns with established pedagogical theories. The authors seek to understand how LLM-based conversational agents can support or challenge traditional educational approaches and provide practical strategies for improvement. Method: The authors synthesize existing literature on LLMs in education alongside theories of conversational and dialogic pedagogy, including Vygotsky's sociocultural learning theory, the Socratic method, and Laurillard's conversational framework. They examine how prompting strategies and retrieval-augmented generation (RAG) can align LLM behaviors with these theories. Result: The paper maps educational theories to LLM capabilities, identifying both alignments and gaps. It highlights how LLMs can enable personalized and adaptive learning but also points out challenges such as the tendency of LLMs to provide direct answers rather than foster co-constructed knowledge and the limitations of non-human expertise in tutoring. Conclusion: The paper concludes that while LLMs have the potential to support established learning principles through personalized, adaptive dialogue, there are notable gaps in applying prior pedagogical theories to LLMs. The authors propose strategies to align LLM interactions more closely with sound pedagogy and aim to bridge the gap between educational theory and AI-driven conversational learning. Abstract: Large Language Models (LLMs) are rapidly transforming education by enabling rich conversational learning experiences. This article provides a comprehensive review of how LLM-based conversational agents are being used in higher education, with extensions to secondary and lifelong learning contexts. We synthesize existing literature on LLMs in education and theories of conversational and dialogic pedagogy - including Vygotsky's sociocultural learning (scaffolding and the Zone of Proximal Development), the Socratic method, and Laurillard's conversational framework - and examine how prompting strategies and retrieval-augmented generation (RAG) can align LLM behaviors with these pedagogical theories, and how it can support personalized, adaptive learning. We map educational theories to LLM capabilities, highlighting where LLM-driven dialogue supports established learning principles and where it challenges or falls short of traditional pedagogical assumptions. Notable gaps in applying prior theories to LLMs are identified, such as the models tendency to provide direct answers instead of fostering co-construction of knowledge, and the need to account for the constant availability and broad but non-human expertise of LLM tutors. In response, we propose practical strategies to better align LLM interactions with sound pedagogy - for example, designing prompts that encourage Socratic questioning, scaffolded guidance, and student reflection, as well as integrating retrieval mechanisms to ensure accuracy and contextual relevance. Our aim is to bridge the gap between educational theory and the emerging practice of AI-driven conversational learning, offering insights and tools for making LLM-based dialogues more educationally productive and theory-aligned.

[25] Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs

Shu Yang,Junchao Wu,Xuansheng Wu,Derek Wong,Ninhao Liu,Di Wang

Main category: cs.CL

TL;DR: Efficient reasoning methods improve token-level efficiency in Large Reasoning Models but risk increased behavioral inconsistencies, as shown by the $ICBENCH$ benchmark analysis.

Details Motivation: To determine whether efficient reasoning strategies introduce behavioral inconsistencies and reduce model robustness by compressing reasoning steps. Method: The study introduces a benchmark called $ICBENCH$ to evaluate inconsistency in Large Reasoning Models across three dimensions: inconsistency across task settings, between training objectives and learned behavior, and between internal reasoning and self-explanations. Result: The application of $ICBENCH$ revealed that while larger models are generally more consistent, all models display behaviors like self-disagreement and post-hoc rationalization. Efficient reasoning strategies were found to consistently increase all types of inconsistency. Conclusion: Efficient reasoning strategies enhance token-level efficiency but may increase behavioral inconsistencies and risks of models evading supervision. Abstract: Large Reasoning Models (LRMs) have achieved remarkable performance on complex tasks by engaging in extended reasoning before producing final answers, yet this strength introduces the risk of overthinking, where excessive token generation occurs even for simple tasks. While recent work in efficient reasoning seeks to reduce reasoning length while preserving accuracy, it remains unclear whether such optimization is truly a free lunch. Drawing on the intuition that compressing reasoning may reduce the robustness of model responses and lead models to omit key reasoning steps, we investigate whether efficient reasoning strategies introduce behavioral inconsistencies. To systematically assess this, we introduce $ICBENCH$, a benchmark designed to measure inconsistency in LRMs across three dimensions: inconsistency across task settings (ITS), inconsistency between training objectives and learned behavior (TR-LB), and inconsistency between internal reasoning and self-explanations (IR-SE). Applying $ICBENCH$ to a range of open-source LRMs, we find that while larger models generally exhibit greater consistency than smaller ones, they all display widespread "scheming" behaviors, including self-disagreement, post-hoc rationalization, and the withholding of reasoning cues. Crucially, our results demonstrate that efficient reasoning strategies such as No-Thinking and Simple Token-Budget consistently increase all three defined types of inconsistency. These findings suggest that although efficient reasoning enhances token-level efficiency, further investigation is imperative to ascertain whether it concurrently introduces the risk of models evading effective supervision.

[26] AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models

Zeyu Li,Chuanfu Xiao,Yang Wang,Xiang Liu,Zhenheng Tang,Baotong Lu,Mao Yang,Xinyu Chen,Xiaowen Chu

Main category: cs.CL

TL;DR: AnTKV reduces memory usage and improves efficiency in large language models by prioritizing precision for critical tokens during quantization.

Details Motivation: Ultra-low-bit quantization of KV cache in LLMs causes performance degradation, requiring a solution to minimize accuracy loss. Method: Proposed Anchor Score (AnS) to measure token sensitivity and developed AnTKV using Vector Quantization and a Triton kernel for fast Anchor Token selection. Result: AnTKV allows LLaMA-3-8B to handle up to 840K tokens with improved decoding throughput and achieves lower perplexity on Mistral-7B at 1-bit and 0.375-bit quantization. Conclusion: AnTKV enables efficient KV cache compression while maintaining accuracy, outperforming existing methods in ultra-low-bit quantization settings. Abstract: Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models (LLMs). Nevertheless, minimizing the performance degradation caused by ultra-low-bit KV cache quantization remains a significant challenge. We observe that quantizing the KV cache of different tokens has varying impacts on the quality of attention outputs. To systematically investigate this phenomenon, we perform forward error propagation analysis on attention and propose the Anchor Score (AnS) that quantifies the sensitivity of each token's KV cache to quantization-induced error. Our analysis reveals significant disparities in AnS across tokens, suggesting that preserving a small subset with full precision (FP16) of high-AnS tokens can greatly mitigate accuracy loss in aggressive quantization scenarios. Based on this insight, we introduce AnTKV, a novel framework that leverages Anchor Token-aware Vector Quantization to compress the KV cache. Furthermore, to support efficient deployment, we design and develop a triton kernel that is fully compatible with FlashAttention, enabling fast online Anchor Token selection. AnTKV enables LLaMA-3-8B to handle context lengths up to 840K tokens on a single 80GB A100 GPU, while achieving up to 3.5x higher decoding throughput compared to the FP16 baseline. Our experiment results demonstrate that AnTKV matches or outperforms prior works such as KIVI, SKVQ, KVQuant, and CQ under 4-bit settings. More importantly, AnTKV achieves significantly lower perplexity under ultra-low-bit quantization on Mistral-7B, with only 6.32 at 1-bit and 8.87 at 0.375-bit, compared to the FP16 baseline of 4.73.

[27] heiDS at ArchEHR-QA 2025: From Fixed-k to Query-dependent-k for Retrieval Augmented Generation

Ashish Chouhan,Michael Gertz

Main category: cs.CL

TL;DR: This paper introduces the heiDS approach for the ArchEHR-QA 2025 task, utilizing a RAG framework with new retrieval strategies like autocut* and elbow, showing improved factual and relevant answer generation over traditional fixed-k methods.

Details Motivation: The motivation is to improve the accuracy and relevance of answers generated from clinical evidence by employing more adaptive retrieval strategies within a RAG framework. Method: A pipeline using a retrieval augmented generation (RAG) framework is designed to generate attributed answers from EHRs. The study explores ranked list truncation (RLT) retrieval strategies and attribution approaches, comparing existing methods with two new methods: autocut* and elbow. Result: The experimental results indicate that the proposed query-dependent-k retrieval strategy enhances the production of factual and relevant answers compared to a fixed-k approach. Conclusion: The paper concludes that the query-dependent-k retrieval strategy, including autocut* and elbow methods, outperforms the fixed top-k RLT retrieval strategy in generating factual and relevant answers. Abstract: This paper presents the approach of our team called heiDS for the ArchEHR-QA 2025 shared task. A pipeline using a retrieval augmented generation (RAG) framework is designed to generate answers that are attributed to clinical evidence from the electronic health records (EHRs) of patients in response to patient-specific questions. We explored various components of a RAG framework, focusing on ranked list truncation (RLT) retrieval strategies and attribution approaches. Instead of using a fixed top-k RLT retrieval strategy, we employ a query-dependent-k retrieval strategy, including the existing surprise and autocut methods and two new methods proposed in this work, autocut* and elbow. The experimental results show the benefits of our strategy in producing factual and relevant answers when compared to a fixed-$k$.

[28] Automatic Posology Structuration : What role for LLMs?

Natalia Bobkova,Laura Zanella-Calzada,Anyes Tafoughalt,Raphaël Teboul,François Plesse,Félix Gaschi

Main category: cs.CL

TL;DR: 本研究探讨了如何利用大型语言模型将自由文本的posology转换为结构化格式,并提出了一个结合NERL和LLM优点的混合流程。

Details Motivation: 法语处方中的posology指令通常含糊不清、不规则或口语化,限制了经典机器学习管道的效果。 Method: 比较基于提示的方法和微调的LLM与基于命名实体识别和链接(NERL)的系统,并提出一个根据置信度分数选择输出的混合流程。 Result: 结果显示,虽然提示改进了性能,但只有微调的LLM能够匹配基线的准确性;通过错误分析观察到NERL提供结构精度,而LLM更好地处理语义细微差别。 Conclusion: 本文的结论是,混合方法在结构化准确率方面优于单独使用NERL或LLM,并且在计算成本和延迟方面具有优势。 Abstract: Automatically structuring posology instructions is essential for improving medication safety and enabling clinical decision support. In French prescriptions, these instructions are often ambiguous, irregular, or colloquial, limiting the effectiveness of classic ML pipelines. We explore the use of Large Language Models (LLMs) to convert free-text posologies into structured formats, comparing prompt-based methods and fine-tuning against a "pre-LLM" system based on Named Entity Recognition and Linking (NERL). Our results show that while prompting improves performance, only fine-tuned LLMs match the accuracy of the baseline. Through error analysis, we observe complementary strengths: NERL offers structural precision, while LLMs better handle semantic nuances. Based on this, we propose a hybrid pipeline that routes low-confidence cases from NERL (<0.8) to the LLM, selecting outputs based on confidence scores. This strategy achieves 91% structuration accuracy while minimizing latency and compute. Our results show that this hybrid approach improves structuration accuracy while limiting computational cost, offering a scalable solution for real-world clinical use.

[29] KnowMap: Efficient Knowledge-Driven Task Adaptation for LLMs

Kelin Fu,Kaigui Bian

Main category: cs.CL

TL;DR: KnowMap 是一种动态构建知识库并利用小型知识嵌入模型增强 LLM 任务适应能力的新方法,有效提高了 LLM 的推理能力和任务表现。

Details Motivation: LLM 在开放世界代理任务中具有显著能力,但因依赖静态预训练知识,在快速适应新专业任务方面面临挑战。传统的微调方法成本高、数据需求大,可能导致“灾难性遗忘”。 Method: KnowMap 动态构建一个来自环境和经验数据的知识库,并通过微调一个小的知识嵌入模型,为大型 LLM 提供有价值的任务特定知识。 Result: 在 ScienceWorld 基准测试中,KnowMap 使 gpt-4-turbo 模型的性能提升了 17.71%。 Conclusion: KnowMap 提供了一种高效且有效的方法来增强 LLM 的任务适应能力,并突出了整合环境和经验知识对提升 LLM 推理能力的重要性。 Abstract: While Large Language Models (LLMs) possess significant capabilities in open-world agent tasks, they also face challenges in rapidly adapting to new, specialized tasks due to their reliance on static pre-trained knowledge. Traditional methods such as fine-tuning are often costly, data-intensive, and may lead to "catastrophic forgetting." Therefore, we present KnowMap, a novel approach that dynamically constructs a knowledge base from environmental and experiential data. KnowMap fine-tunes a small knowledge-embedding model to equip a larger LLM with valuable task-specific knowledge. Our experiments on the ScienceWorld benchmark demonstrate 17.71% improvement for the performance of gpt-4-turbo model. KnowMap not only provides an efficient and effective means for LLM task-adapting, but also highlights how integrating environmental and experiential knowledge can enhance LLMs' reasoning capabilities.

[30] Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection

Devesh Pant,Rishi Raj Grandhe,Vipin Samaria,Mukul Paul,Sudhir Kumar,Saransh Khanna,Jatin Agrawal,Jushaan Singh Kalra,Akhil VSSG,Satish V Khalikar,Vipin Garg,Himanshu Chauhan,Pranay Verma,Neha Khandelwal,Soma S Dhavala,Minesh Mathew

Main category: cs.CL

TL;DR: Health Sentinel is an automated system using ML and non-ML techniques to detect potential disease outbreaks from online news, successfully identifying thousands of health events in India since April 2022.

Details Motivation: Traditional indicator-based surveillance systems face challenges, and manual screening of online media articles is impractical due to the sheer volume. This necessitates an automated system for timely detection of disease outbreaks. Method: The proposed method is Health Sentinel, a multi-stage information extraction pipeline that combines ML and non-ML methods to extract structured information about disease outbreaks from online articles. Result: From April 2022 till date, Health Sentinel has processed over 300 million news articles, identified over 95,000 unique health events across India, and 3,500 of these were shortlisted by NCDC experts as potential outbreaks. Conclusion: Health Sentinel has proven to be an effective tool in identifying potential disease outbreaks by processing a large volume of online news articles and providing structured event information to health authorities for timely intervention. Abstract: Early detection of disease outbreaks is crucial to ensure timely intervention by the health authorities. Due to the challenges associated with traditional indicator-based surveillance, monitoring informal sources such as online media has become increasingly popular. However, owing to the number of online articles getting published everyday, manual screening of the articles is impractical. To address this, we propose Health Sentinel. It is a multi-stage information extraction pipeline that uses a combination of ML and non-ML methods to extract events-structured information concerning disease outbreaks or other unusual health events-from online articles. The extracted events are made available to the Media Scanning and Verification Cell (MSVC) at the National Centre for Disease Control (NCDC), Delhi for analysis, interpretation and further dissemination to local agencies for timely intervention. From April 2022 till date, Health Sentinel has processed over 300 million news articles and identified over 95,000 unique health events across India of which over 3,500 events were shortlisted by the public health experts at NCDC as potential outbreaks.

[31] RCStat: A Statistical Framework for using Relative Contextualization in Transformers

Debabrata Mahapatra,Shubham Agarwal,Apoorv Saxena,Subrata Mitra

Main category: cs.CL

TL;DR: The paper introduces RCStat, a framework that improves token importance analysis in transformers by leveraging raw attention logits, achieving superior performance in compression and explanation tasks.

Details Motivation: Prior methods using Softmax-normalized attention weights obscure the richer structure of pre-Softmax query-key logits, limiting insights into input-token importance in auto-regressive transformers. Method: RCStat utilizes raw attention logits through Relative Contextualization (RC), which measures the contextual alignment of token segments and derives an efficient upper bound for RC. Result: RCStat achieves significant empirical gains across question answering, summarization, and attribution benchmarks, providing state-of-the-art performance in compression and attribution tasks. Conclusion: RCStat provides a statistical framework for analyzing contextual alignment between token segments, leading to effective key-value compression and high-fidelity attribution without model retraining. Abstract: Prior work on input-token importance in auto-regressive transformers has relied on Softmax-normalized attention weights, which obscure the richer structure of pre-Softmax query-key logits. We introduce RCStat, a statistical framework that harnesses raw attention logits via Relative Contextualization (RC), a random variable measuring contextual alignment between token segments, and derive an efficient upper bound for RC. We demonstrate two applications: (i) Key-Value compression, where RC-based thresholds drive adaptive key-value eviction for substantial cache reduction with minimal quality loss; and (ii) Attribution, where RC yields higher-fidelity token-, sentence-, and chunk-level explanations than post-Softmax methods. Across question answering, summarization, and attribution benchmarks, RCStat achieves significant empirical gains, delivering state-of-the-art compression and attribution performance without any model retraining.

[32] Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

Lorenzo Proietti,Stefano Perrella,Roberto Navigli

Main category: cs.CL

TL;DR: Automatic metrics in Machine Translation are approaching or even surpassing human performance in evaluations, prompting discussions on how to measure further improvements.

Details Motivation: To understand the upper bound of metric performance in Machine Translation and evaluate whether automatic metrics can rival human judgment. Method: The authors incorporated human baselines in MT meta-evaluation to assess the capabilities of automatic metrics and their agreement with human judgments. Result: State-of-the-art automatic metrics often ranked as high as or higher than human annotators, suggesting a new benchmark for MT evaluation. Conclusion: The study concludes that automatic metrics are reaching human parity in MT evaluation, raising questions about the future measurement of progress in this field. Abstract: In Machine Translation (MT) evaluation, metric performance is assessed based on agreement with human judgments. In recent years, automatic metrics have demonstrated increasingly high levels of agreement with humans. To gain a clearer understanding of metric performance and establish an upper bound, we incorporate human baselines in the MT meta-evaluation, that is, the assessment of MT metrics' capabilities. Our results show that human annotators are not consistently superior to automatic metrics, with state-of-the-art metrics often ranking on par with or higher than human baselines. Despite these findings suggesting human parity, we discuss several reasons for caution. Finally, we explore the broader implications of our results for the research field, asking: Can we still reliably measure improvements in MT evaluation? With this work, we aim to shed light on the limits of our ability to measure progress in the field, fostering discussion on an issue that we believe is crucial to the entire MT evaluation community.

[33] ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model

Zhenke Duan,Jiqun Pan,Jiani Tu,Xiaoyi Wang,Yanqing Wang

Main category: cs.CL

TL;DR: ECCoT是一种用于评估和改进大型语言模型推理链的新方法,它结合了主题感知生成和因果推理对齐技术,以提升模型的可解释性和可靠性。

Details Motivation: 大型语言模型虽然在自然语言处理方面取得了显著进展,但它们往往缺乏透明度并产生不可靠的输出,因此需要提高其可解释性和可靠性。 Method: ECCoT集成了马尔可夫随机场嵌入主题模型(MRF-ETM)用于主题感知的思维链生成和因果句子-BERT(CSBert)用于因果推理对齐。 Result: 通过使用结构化排序统计过滤无效链,ECCoT提高了可解释性,减少了偏差,并增强了基于LLM的决策的可信度。 Conclusion: ECCoT是一个端到端的认知思维链验证框架,可以提高大型语言模型的可解释性、减少偏差并增强基于LLM的决策的可信度。 Abstract: In the era of large-scale artificial intelligence, Large Language Models (LLMs) have made significant strides in natural language processing. However, they often lack transparency and generate unreliable outputs, raising concerns about their interpretability. To address this, the Chain of Thought (CoT) prompting method structures reasoning into step-by-step deductions. Yet, not all reasoning chains are valid, and errors can lead to unreliable conclusions. We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By filtering ineffective chains using structured ordering statistics, ECCoT improves interpretability, reduces biases, and enhances the trustworthiness of LLM-based decision-making. Key contributions include the introduction of ECCoT, MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning enhancement. Code is released at: https://github.com/erwinmsmith/ECCoT.git.

[34] Social Hatred: Efficient Multimodal Detection of Hatemongers

Tom Marzea,Abraham Israeli,Oren Tsur

Main category: cs.CL

TL;DR: 本文提出了一个结合用户文本、活动和网络的多模态方法,用于更准确地检测仇恨言论传播者,并展示了其跨平台的有效性。

Details Motivation: 在线仇恨言论的自动检测对净化网络话语至关重要,同时聚焦于用户层面比仅关注语句更具挑战性和重要性。 Method: 结合用户的潜在仇恨文本、用户活动及用户网络,采用多模态聚合方法进行分析。 Result: 在三个独特数据集(Twitter、Gab 和 Parler)上的实验表明,将用户文本与其社交环境结合能显著提升仇恨用户的检测效果。 Conclusion: 本文提出了一种多模态聚合方法来检测散播仇恨的用户,相较于传统基于文本和图的方法,在不同平台和大规模数据集上表现更优。 Abstract: Automatic detection of online hate speech serves as a crucial step in the detoxification of the online discourse. Moreover, accurate classification can promote a better understanding of the proliferation of hate as a social phenomenon. While most prior work focus on the detection of hateful utterances, we argue that focusing on the user level is as important, albeit challenging. In this paper we consider a multimodal aggregative approach for the detection of hate-mongers, taking into account the potentially hateful texts, user activity, and the user network. Evaluating our method on three unique datasets X (Twitter), Gab, and Parler we show that processing a user's texts in her social context significantly improves the detection of hate mongers, compared to previously used text and graph-based methods. We offer comprehensive set of results obtained in different experimental settings as well as qualitative analysis of illustrative cases. Our method can be used to improve the classification of coded messages, dog-whistling, and racial gas-lighting, as well as to inform intervention measures. Moreover, we demonstrate that our multimodal approach performs well across very different content platforms and over large datasets and networks.

[35] Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge

Juraj Vladika,Ihsan Soydemir,Florian Matthes

Main category: cs.CL

TL;DR: 本文研究了如何通过自我修正方法减少大型语言模型在新闻摘要生成中的幻觉问题,发现利用搜索引擎片段和少样本提示可提升效果,并且评估指标与人类判断高度一致。

Details Motivation: 尽管大型语言模型在生成连贯文本方面表现出色,但它们容易产生事实不准确的幻觉问题。自我修正方法在百科全书生成领域已有探索,但在新闻摘要等领域的应用较少。 Method: 研究者应用了两种最先进的自我修正系统,通过三款搜索引擎提供的证据来修正错误的摘要,并对结果进行了分析。 Result: 研究揭示了搜索引擎片段和少样本提示的实际优势,以及G-Eval与人类评估的高度一致性,表明自我修正系统在减少幻觉方面的有效性。 Conclusion: 应用自我修正方法可以有效减少大型语言模型在新闻摘要生成中的幻觉问题,同时利用搜索引擎片段和少样本提示能够提高系统性能,并且G-Eval与人类评估高度一致。 Abstract: While large language models (LLMs) have shown remarkable capabilities to generate coherent text, they suffer from the issue of hallucinations -- factually inaccurate statements. Among numerous approaches to tackle hallucinations, especially promising are the self-correcting methods. They leverage the multi-turn nature of LLMs to iteratively generate verification questions inquiring additional evidence, answer them with internal or external knowledge, and use that to refine the original response with the new corrections. These methods have been explored for encyclopedic generation, but less so for domains like news summarization. In this work, we investigate two state-of-the-art self-correcting systems by applying them to correct hallucinated summaries using evidence from three search engines. We analyze the results and provide insights into systems' performance, revealing interesting practical findings on the benefits of search engine snippets and few-shot prompts, as well as high alignment of G-Eval and human evaluation.

[36] Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager

Lucie Galland,Catherine Pelachaud,Florian Pecune

Main category: cs.CL

TL;DR: A new framework combining LLMs and RL effectively enhances adaptability, efficiency, and personalization in goal-oriented open-ended dialogues, particularly showing promise in fostering behavior change.

Details Motivation: To enhance the adaptability and efficiency of dialogue systems in creating personalized, goal-oriented conversations with limited data. Method: A novel framework using hierarchical reinforcement learning and meta-learning was developed to manage structured dialogue phases and adapt to diverse user profiles. Result: The proposed dialogue manager outperformed a state-of-the-art LLM baseline in terms of reward when applied to Motivational Interviews. Conclusion: The integration of large language models with an RL-based dialogue manager improves adaptability, efficiency, and personalization in open-ended dialogues aimed at behavior change. Abstract: In this work, we propose a novel framework that integrates large language models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a specific goal. By leveraging hierarchical reinforcement learning to model the structured phases of dialogue and employ meta-learning to enhance adaptability across diverse user profiles, our approach enhances adaptability and efficiency, enabling the system to learn from limited data, transition fluidly between dialogue phases, and personalize responses to heterogeneous patient needs. We apply our framework to Motivational Interviews, aiming to foster behavior change, and demonstrate that the proposed dialogue manager outperforms a state-of-the-art LLM baseline in terms of reward, showing a potential benefit of conditioning LLMs to create open-ended dialogue systems with specific goals.

[37] Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?

Chuxuan Hu,Yuxuan Zhu,Antony Kellermann,Caleb Biddulph,Suppakit Waiwitlikhit,Jason Benn,Daniel Kang

Main category: cs.CL

TL;DR: 强化后训练(RPT)在改善大语言模型推理能力方面表现出潜力,但其提升效果在不同推理模式的新领域中可能无法持续。

Details Motivation: 为了理解RPT方法的可推广性,因为之前的工作仅在与其微调数据相同的领域中评估RPT模型。 Method: 进行了两项研究:观察性研究比较了开放权重RPT模型与其基础模型在多个领域中的表现;干预性研究则是在单一领域上使用RPT进行微调,并评估其在多个领域的表现。 Result: 两项研究都得出相同结论,即RPT带来的提升在不同的推理模式下的领域中可能消失。 Conclusion: RPT虽然在与微调数据相似的任务上带来了显著提升,但其增益泛化不一致,在具有不同推理模式的领域中可能消失。 Abstract: Reinforcement post training (RPT) has recently shown promise in improving the reasoning abilities of large language models (LLMs). However, it remains unclear how well these improvements generalize to new domains, as prior work evaluates RPT models on data from the same domains used for fine-tuning. To understand the generalizability of RPT, we conduct two studies. (1) Observational: We compare a wide range of open-weight RPT models against their corresponding base models across multiple domains, including both seen and unseen domains in their fine-tuning data. (2) Interventional: we fine-tune LLMs with RPT on single domains and evaluate their performance across multiple domains. Both studies converge on the same conclusion that, although RPT brings substantial gains on tasks similar to the fine-tuning data, the gains generalize inconsistently and can vanish on domains with different reasoning patterns.

[38] Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach

Takashi Nishibayashi,Seiji Kanazawa,Kumpei Yamada

Main category: cs.CL

TL;DR: This study proposes a novel method using synthetic vignettes generated from the Human Phenotype Ontology to evaluate the impact of Symptom Checker algorithm updates on rare diseases, enabling efficient and low-cost pre-deployment testing.

Details Motivation: Evaluating diagnostic performance changes for rare diseases after Symptom Checker algorithm updates is challenging due to the difficulty in acquiring sufficient evaluation data and the high cost of manually creating clinical vignettes. A reliable and cost-effective evaluation method is needed. Method: A Synthetic Vignette Simulation Approach was developed using disease-phenotype annotations from the Human Phenotype Ontology (HPO) to generate synthetic vignettes. These vignettes were used to simulate Symptom Checker interviews and estimate the impact of algorithm updates on real-world diagnostic performance. The method's effectiveness was evaluated by comparing estimated values with actual metric changes using the R-squared coefficient. Result: For diseases with frequency information in HPO (n=5), the R² for recall@8 change was 0.831 (p=0.031), and for precision@8 change, it was 0.78 (p=0.047), indicating accurate prediction of post-deployment performance. However, large prediction errors occurred for diseases without frequency information (n=3). The manual effort required for mapping HPO phenotypes to SC symptoms was about 2 hours per disease. Conclusion: The proposed method enables pre-deployment evaluation of Symptom Checker algorithm changes for individual rare diseases using a publicly available, expert-created knowledge base. This approach allows developers to efficiently improve diagnostic performance and potentially enhance support for early diagnosis. Abstract: Background: Symptom Checkers (SCs) provide users with personalized medical information. To prevent performance degradation from algorithm updates, SC developers must evaluate diagnostic performance changes for individual diseases before deployment. However, acquiring sufficient evaluation data for rare diseases is difficult, and manually creating numerous clinical vignettes is costly and impractical. Objective: This study proposes and validates a novel Synthetic Vignette Simulation Approach to evaluate diagnostic performance changes for individual rare diseases following SC algorithm updates. Methods: We used disease-phenotype annotations from the Human Phenotype Ontology (HPO), a knowledge database for rare diseases, to generate synthetic vignettes. With these, we simulated SC interviews to estimate the impact of algorithm updates on real-world diagnostic performance. The method's effectiveness was evaluated retrospectively by comparing estimated values with actual metric changes using the R 2(R-squared) coefficient. Results: The experiment included eight past SC algorithm updates. For updates on diseases with frequency information in HPO (n=5), the R^2 for recall@8 change was 0.831 (p=0.031), and for precision@8 change, it was 0.78 (p=0.047), indicating the method can predict post-deployment performance. In contrast, large prediction errors occurred for diseases without frequency information (n=3), highlighting its importance. The manual effort to map HPO phenotypes to SC symptoms was approximately 2 hours per disease. Conclusions: Our method enables pre-deployment evaluation of SC algorithm changes for individual rare diseases using a publicly available, expert-created knowledge base. This transparent and low-cost approach allows developers to efficiently improve diagnostic performance for rare diseases, potentially enhancing support for early diagnosis.

[39] Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis

Omar A. Essameldin,Ali O. Elbeih,Wael H. Gomaa,Wael F. Elsersy

Main category: cs.CL

TL;DR: 本研究分析了18种阿拉伯语方言,并发现MARBERTv2模型在分类任务中表现最佳,展示了其在聊天机器人和社会媒体监测等领域的潜力。

Details Motivation: 阿拉伯语是世界上最流行的语言之一,在22个国家中有多种方言被使用,因此识别这些方言具有重要的应用价值。 Method: 通过使用RNN模型、Transformer模型和通过提示工程的大语言模型(LLMs),对QADI数据集中的阿拉伯语推文进行处理和测试。 Result: MARBERTv2模型在阿拉伯语方言分类任务中表现最好,同时本文也揭示了阿拉伯语方言识别中存在的主要语言学问题。 Conclusion: 这篇论文得出的结论是,使用MARBERTv2模型在分类18种阿拉伯语方言方面表现最佳,准确率达到65%,F1分数达到64%。 Abstract: The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries. In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets. RNN models, Transformer models, and large language models (LLMs) via prompt engineering are created and tested. Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score. Through the use of state-of-the-art preprocessing techniques and the latest NLP models, this paper identifies the most significant linguistic issues in Arabic dialect identification. The results corroborate applications like personalized chatbots that respond in users' dialects, social media monitoring, and greater accessibility for Arabic communities.

[40] Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR

Martin Ratajczak,Jean-Philippe Robichaud,Jennifer Drexler Fox

Main category: cs.CL

TL;DR: 本文提出了一种高效的长格式语音识别方法,利用RA层和Direction Dropout在保证准确率的同时显著提高了吞吐量。

Details Motivation: 基于多头注意力(MHA)的ASR模型因其序列长度的二次复杂度而不适合长格式ASR。 Method: 研究线性复杂度的循环注意力(RA)层,提出了一种有限上下文注意力(LCA)基线,并开发了长格式训练范式与Direction Dropout正则化方法。 Result: 双向RA层可以在短格式和长格式应用中达到与MHA相当的准确率,并且比LCA准确率更高,吞吐量提高44%。 Conclusion: RA层在长格式ASR中比LCA具有更高的吞吐量,并且通过Direction Dropout可以实现更细粒度的准确率/吞吐量权衡和交替方向解码模式。 Abstract: Long-form speech recognition is an application area of increasing research focus. ASR models based on multi-head attention (MHA) are ill-suited to long-form ASR because of their quadratic complexity in sequence length. We build on recent work that has investigated linear complexity recurrent attention (RA) layers for ASR. We find that bidirectional RA layers can match the accuracy of MHA for both short- and long-form applications. We present a strong limited-context attention (LCA) baseline, and show that RA layers are just as accurate while being more efficient. We develop a long-form training paradigm which further improves RA performance, leading to better accuracy than LCA with 44% higher throughput. We also present Direction Dropout, a novel regularization method that improves accuracy, provides fine-grained control of the accuracy/throughput trade-off of bidirectional RA, and enables a new alternating directions decoding mode with even higher throughput.

[41] SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Yuqian Fu,Tinghong Chen,Jiajun Chai,Xihuai Wang,Songjun Tu,Guojun Yin,Wei Lin,Qichao Zhang,Yuanheng Zhu,Dongbin Zhao

Main category: cs.CL

TL;DR: This paper proposes SRFT, a unified approach combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for optimizing large language models, achieving significant improvements in accuracy on reasoning and out-of-distribution tasks.

Details Motivation: The optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge in improving large language models (LLMs). Method: Analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives; proposal of SRFT, which combines SFT and RL through entropy-aware weighting mechanisms. Result: SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks. Conclusion: SRFT successfully integrates SFT and RL into a single-stage method, improving the performance of LLMs on mathematical reasoning and out-of-distribution benchmarks. Abstract: Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from entropy-based perspectives, we reveal key differences between these paradigms: SFT induces coarse-grained global changes to LLM policy distributions, while RL performs fine-grained selective optimizations, with entropy serving as a critical indicator of training effectiveness. Building on these observations, we propose Supervised Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach simultaneously applies SFT and RL to directly optimize the LLM using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT achieves 59.1% average accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning benchmarks and 10.9% on three out-of-distribution benchmarks.

[42] Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study

Yuqi Zhu,Yi Zhong,Jintian Zhang,Ziheng Zhang,Shuofei Qiao,Yujie Luo,Lun Du,Da Zheng,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: This paper explores ways to improve data analysis abilities in open-source large language models, revealing insights about the importance of strategic planning, interaction design, and data quality.

Details Motivation: Open-source large language models (LLMs) face significant limitations in reasoning-intensive data analysis tasks. This work aims to investigate strategies for enhancing their data analysis capabilities. Method: A seed dataset of diverse and realistic scenarios was curated to evaluate models across three dimensions: data understanding, code generation, and strategic planning. Result: Three key findings were revealed: (1) Strategic planning quality is the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality has a greater impact than diversity on achieving optimal performance. Conclusion: The study concludes that strategic planning quality is the main factor affecting model performance, and a data synthesis methodology developed based on findings significantly improves analytical reasoning capabilities of open-source LLMs. Abstract: Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.

[43] How Effectively Can BERT Models Interpret Context and Detect Bengali Communal Violent Text?

Abdullah Khondoker,Enam Ahmed Taufik,Md. Iftekhar Islam Tashik,S M Ishtiak Mahmud,Farig Sadeque

Main category: cs.CL

TL;DR: 本研究旨在提高检测煽动群体暴力文本的准确性,重点分析孟加拉语文本。

Details Motivation: 网络仇恨言论的传播导致了群体暴力,加剧了不同宗教、种族和社会群体之间的冲突,威胁社会和谐。然而,这方面的研究仍不充分。 Method: 研究中引入了一个微调的BanglaBERT模型,并通过添加1794个实例扩展数据集,开发了一个性能更好的集成模型。此外,应用LIME进行模型决策的可解释性分析。 Result: 微调的BanglaBERT模型达到了0.60的宏F1分数,而优化后的集成模型达到了0.63的宏F1分数。定性分析发现,模型在理解上下文时存在困难,有时会导致错误分类。 Conclusion: 本文得出结论,NLP和可解释性工具在减少在线群体暴力方面具有潜力,并为未来的研究提供了基础。 Abstract: The spread of cyber hatred has led to communal violence, fueling aggression and conflicts between various religious, ethnic, and social groups, posing a significant threat to social harmony. Despite its critical importance, the classification of communal violent text remains an underexplored area in existing research. This study aims to enhance the accuracy of detecting text that incites communal violence, focusing specifically on Bengali textual data sourced from social media platforms. We introduce a fine-tuned BanglaBERT model tailored for this task, achieving a macro F1 score of 0.60. To address the issue of data imbalance, our dataset was expanded by adding 1,794 instances, which facilitated the development and evaluation of a fine-tuned ensemble model. This ensemble model demonstrated an improved performance, achieving a macro F1 score of 0.63, thus highlighting its effectiveness in this domain. In addition to quantitative performance metrics, qualitative analysis revealed instances where the models struggled with context understanding, leading to occasional misclassifications, even when predictions were made with high confidence. Through analyzing the cosine similarity between words, we identified certain limitations in the pre-trained BanglaBERT models, particularly in their ability to distinguish between closely related communal and non-communal terms. To further interpret the model's decisions, we applied LIME, which helped to uncover specific areas where the model struggled in understanding context, contributing to errors in classification. These findings highlight the promise of NLP and interpretability tools in reducing online communal violence. Our work contributes to the growing body of research in communal violence detection and offers a foundation for future studies aiming to refine these techniques for better accuracy and societal impact.

[44] MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration

Yucheng Zhou,Lingran Song,Jianbing Shen

Main category: cs.CL

TL;DR: 本文提出了一种模块化的多代理框架MAM,用于多模态医疗诊断,通过角色分工提高诊断效果和模型灵活性。

Details Motivation: 现有的统一多模态医疗LLMs在知识更新成本、全面性和灵活性方面存在局限性,需要一种更高效的方法。 Method: 提出了一种模块化的多代理框架(MAM),将医疗诊断过程分解为多个专门角色,并基于LLM代理实现。 Result: 实验结果显示,MAM在多种公开的多模态医疗数据集上均优于模态特定的LLMs,性能提升了18%到365%。 Conclusion: MAM框架能够有效提升多模态医疗诊断的效果,并具有良好的模块化和协作特性。 Abstract: Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM-based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality-specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code is released at https://github.com/yczhou001/MAM.

cs.CV [Back]

[45] Correspondence-Free Multiview Point Cloud Registration via Depth-Guided Joint Optimisation

Yiran Zhou,Yingyu Wang,Shoudong Huang,Liang Zhao

Main category: cs.CV

TL;DR: This paper proposes a new correspondence-free multiview point cloud registration technique that bypasses traditional challenges by using depth maps and joint optimization, leading to improved performance in difficult environments.

Details Motivation: Multiview point cloud registration is essential for building globally consistent 3D models, but traditional methods struggle to find globally optimal solutions in complex environments due to difficulties in feature extraction and data association. Method: The method involves representing the global map as a depth map and using raw depth information to create a non-linear least squares optimization problem that jointly estimates poses of point clouds and the global map. This avoids explicit feature extraction and data association. Result: The proposed method achieves better accuracy than existing approaches on real-world datasets, particularly in environments where feature extraction and data association are difficult. Conclusion: The paper concludes that their proposed correspondence-free multiview point cloud registration method outperforms existing state-of-the-art approaches, especially in challenging environments. Abstract: Multiview point cloud registration is a fundamental task for constructing globally consistent 3D models. Existing approaches typically rely on feature extraction and data association across multiple point clouds; however, these processes are challenging to obtain global optimal solution in complex environments. In this paper, we introduce a novel correspondence-free multiview point cloud registration method. Specifically, we represent the global map as a depth map and leverage raw depth information to formulate a non-linear least squares optimisation that jointly estimates poses of point clouds and the global map. Unlike traditional feature-based bundle adjustment methods, which rely on explicit feature extraction and data association, our method bypasses these challenges by associating multi-frame point clouds with a global depth map through their corresponding poses. This data association is implicitly incorporated and dynamically refined during the optimisation process. Extensive evaluations on real-world datasets demonstrate that our method outperforms state-of-the-art approaches in accuracy, particularly in challenging environments where feature extraction and data association are difficult.

[46] Connecting Vision and Emissions: A Behavioural AI Approach to Carbon Estimation in Road Design

Ammar K Al Mhdawi,Nonso Nnamoko,Safanah Mudheher Raafat,M. K. S. Al-Mhdawi,Amjad J Humaidi

Main category: cs.CV

TL;DR: 本文提出了一种增强型YOLOv8框架,结合深度OCR技术,用于实时车辆检测、车牌识别及碳排放估算,具备在智能交通系统中大规模应用的潜力。

Details Motivation: 由于YOLOv8缺乏对细粒度识别任务(如读取车牌或确定车辆属性)的内置能力,因此需要构建一个混合流水线来提高精度和实用性。 Method: 改进YOLOv8架构以实现车辆检测、分割和跟踪,并引入一个基于CNN的OCR模块进行车牌识别和车辆分类,通过实时API验证车牌信息并交叉参考外部数据库。 Result: YOLOv8检测器在边界框上的mAP@0.5约为71%,分割掩码达到70%;字符级OCR准确率高达99%。 Conclusion: 该论文提出了一种结合实时目标检测与深度OCR的增强型YOLOv8框架,用于城市环境中的车辆碳排放估计,证明了该方法在智能交通系统中具有实际部署的可行性。 Abstract: We present an enhanced YOLOv8 real time vehicle detection and classification framework, for estimating carbon emissions in urban environments. The system enhances YOLOv8 architecture to detect, segment, and track vehicles from live traffic video streams. Once a vehicle is localized, a dedicated deep learning-based identification module is employed to recognize license plates and classify vehicle types. Since YOLOv8 lacks the built-in capacity for fine grained recognition tasks such as reading license plates or determining vehicle attributes beyond class labels, our framework incorporates a hybrid pipeline where each detected vehicle is tracked and its bounding box is cropped and passed to a deep Optical Character Recognition (OCR) module. This OCR system, composed of multiple convolutional neural network (CNN) layers, is trained specifically for character-level detection and license plate decoding under varied conditions such as motion blur, occlusion, and diverse font styles. Additionally, the recognized plate information is validated using a real time API that cross references with an external vehicle registration database to ensure accurate classification and emission estimation. This multi-stage approach enables precise, automated calculation of per vehicle carbon emissions. Extensive evaluation was conducted using a diverse vehicle dataset enriched with segmentation masks and annotated license plates. The YOLOv8 detector achieved a mean Average Precision (mAP@0.5) of approximately 71% for bounding boxes and 70% for segmentation masks. Character level OCR accuracy reached up to 99% with the best performing CNN model. These results affirm the feasibility of combining real time object detection with deep OCR for practical deployment in smart transportation systems, offering a scalable solution for automated, vehicle specific carbon emission monitoring.

[47] Interpretable and Granular Video-Based Quantification of Motor Characteristics from the Finger Tapping Test in Parkinson Disease

Tahereh Zarrat Ehsan,Michael Tangermann,Yağmur Güçlütürk,Bastiaan R. Bloem,Luc J. W. Evers

Main category: cs.CV

TL;DR: 本研究开发了一种基于视频分析的帕金森病运动特征量化方法,提高了手指敲击测试评分的准确性和可解释性。

Details Motivation: 传统的手指敲击测试主观性强,易受评分者间和评分者内变异性影响,且无法深入分析个体运动特征。因此需要一种更加客观且精细的量化方法。 Method: 提出了一种基于计算机视觉的精细量化方法,并使用主成分分析与方差最大化旋转验证了视频特征与四种运动缺陷的对应关系。此外,还利用这些特征训练机器学习分类器以估计MDS-UPDRS手指敲击评分。 Result: 提出的方法在MDS-UPDRS评分预测中比现有方法具有更高的准确性,同时识别出了序列效应和犹豫-停顿缺陷中的更精细的区别。 Conclusion: 该框架为帕金森病运动特征的客观评估提供了一个实用解决方案,未来需要进一步研究其对症状治疗和疾病进展的响应性。 Abstract: Accurately quantifying motor characteristics in Parkinson disease (PD) is crucial for monitoring disease progression and optimizing treatment strategies. The finger-tapping test is a standard motor assessment. Clinicians visually evaluate a patient's tapping performance and assign an overall severity score based on tapping amplitude, speed, and irregularity. However, this subjective evaluation is prone to inter- and intra-rater variability, and does not offer insights into individual motor characteristics captured during this test. This paper introduces a granular computer vision-based method for quantifying PD motor characteristics from video recordings. Four sets of clinically relevant features are proposed to characterize hypokinesia, bradykinesia, sequence effect, and hesitation-halts. We evaluate our approach on video recordings and clinical evaluations of 74 PD patients from the Personalized Parkinson Project. Principal component analysis with varimax rotation shows that the video-based features corresponded to the four deficits. Additionally, video-based analysis has allowed us to identify further granular distinctions within sequence effect and hesitation-halts deficits. In the following, we have used these features to train machine learning classifiers to estimate the Movement Disorder Society Unified Parkinson Disease Rating Scale (MDS-UPDRS) finger-tapping score. Compared to state-of-the-art approaches, our method achieves a higher accuracy in MDS-UPDRS score prediction, while still providing an interpretable quantification of individual finger-tapping motor characteristics. In summary, the proposed framework provides a practical solution for the objective assessment of PD motor characteristics, that can potentially be applied in both clinical and remote settings. Future work is needed to assess its responsiveness to symptomatic treatment and disease progression.

[48] Reinforcement Learning-Based Dynamic Grouping for Tubular Structure Tracking

Chong Di,Shuwang Zhou,Da Chen,Jean-Marie Mirebeau,Minglei Shu,Laurent D. Cohen

Main category: cs.CV

TL;DR: This paper proposes a reinforcement learning-based tracking method that addresses challenges in tracking tubular structures by dynamically exploring a segment graph without requiring a precomputed structure or extensive prior knowledge.

Details Motivation: Tracking tubular structures like blood vessels and roads is challenging due to complex morphologies and environmental variations. Existing methods, particularly segment-wise models, face issues of computational inefficiency and reliance on prior structural knowledge. Method: The method formulates segment-wise tracking as a Markov Decision Process (MDP) using Q-Learning to dynamically explore a graph of segments, compute edge weights on-demand, and adaptively expand the search space. Result: Experimental results show that the proposed method significantly outperforms state-of-the-art point-wise and segment-wise approaches on typical tubular structure datasets, proving robustness to incomplete initial information and reducing computational costs. Conclusion: The proposed reinforcement learning framework based on Q-Learning for segment-wise tracking effectively handles complex topologies and maintains global path coherence, outperforming existing point-wise and segment-wise approaches. Abstract: The computation of minimal paths for the applications in tracking tubular structures such as blood vessels and roads is challenged by complex morphologies and environmental variations. Existing approaches can be roughly categorized into two research lines: the point-wise based models and the segment-wise based models. Although segment-wise approaches have obtained promising results in many scenarios, they often suffer from computational inefficiency and heavily rely on a prescribed prior to fit the target elongated shapes. We propose a novel framework that casts segment-wise tracking as a Markov Decision Process (MDP), enabling a reinforcement learning approach. Our method leverages Q-Learning to dynamically explore a graph of segments, computing edge weights on-demand and adaptively expanding the search space. This strategy avoids the high cost of a pre-computed graph and proves robust to incomplete initial information. Experimental reuslts on typical tubular structure datasets demonstrate that our method significantly outperforms state-of-the-art point-wise and segment-wise approaches. The proposed method effectively handles complex topologies and maintains global path coherence without depending on extensive prior structural knowledge.

[49] Bird's-eye view safety monitoring for the construction top under the tower crane

Yanke Wang,Yu Hin Ng,Haobo Liang,Ching-Wei Chang,Hao Chen

Main category: cs.CV

TL;DR: This paper proposes an AI-based automated safety monitoring system for tower cranes using camera and LiDAR data fusion to protect workers and prevent collisions, demonstrating its effectiveness through on-site implementation.

Details Motivation: With the increasing automation and intelligence in tower crane operations, there is an urgent need to prioritize safety through advanced technologies like AI, particularly for protecting human workers near the construction top and preventing collisions. Method: The study integrated camera and LiDAR data into a software pipeline for 3D data fusion, enabling localization of humans and Modular Integrated Constructions (MiCs), and implemented state-of-the-art methods to improve accuracy and effectiveness. The system was tested on-site with visualization tools. Result: The system successfully fused 3D data to accurately monitor and protect human workers on the construction site while preventing crane collisions, demonstrated through on-site visualization and analysis. Conclusion: The proposed AI-based safety monitoring system effectively enhances tower crane operation safety by utilizing 3D data fusion from camera and LiDAR to localize humans and MiCs, offering real-time alerts to operators. Abstract: The tower crane is involving more automated and intelligent operation procedure, and importantly, the application of automation technologies to the safety issues is imperative ahead of the utilization of any other advances. Among diverse risk management tasks on site, it is essential to protect the human workers on the workspace between the tower crane and constructed building top area (construction top) from the bird's-eye view, especially with Modular Integrated Construction (MiC) lifted. Also, the camera and Light Detection And Ranging (LiDAR) can capture abundant 3D information on site, which is however yet made the best use. Considering the safety protection for humans and tower cranes, we present an AI-based fully automated safety monitoring system for tower crane lifting from the bird's-eye view, surveilling to shield the human workers on the construction top and avoid cranes' collision by alarming the crane operator. The system achieved a 3D data fusion for localization of humans and MiCs by integrating the captured information from camera and LiDAR. The state-of-the-art methods were explored and implemented into our proposed software pipeline coupled with the hardware and display systems. Furthermore, we conducted an analysis of the components in the pipeline to verify the accuracy and effectiveness of the involved methods. The display and visualization on the real site proved that our system can serve as a valuable safety monitoring toolkit on site.

[50] Damba-ST: Domain-Adaptive Mamba for Efficient Urban Spatio-Temporal Prediction

Rui An,Yifeng Zhang,Ziran Liang,Wenqi Fan,Yuxuan Liang,Xuequn Shang,Qing Li

Main category: cs.CV

TL;DR: This paper proposes Damba-ST, a domain-adaptive Mamba-based model for efficient urban spatio-temporal prediction that overcomes the limitations of existing models, achieving strong performance and zero-shot generalization.

Details Motivation: Existing Transformer-based urban spatio-temporal models suffer from high computational complexity and memory overhead, limiting scalability. Although Mamba offers linear time complexity, its direct application leads to performance degradation due to spatio-temporal heterogeneity and negative transfer. Method: The authors propose Damba-ST, which incorporates a domain-adaptive state space model and three distinct Domain Adapters. The state space model partitions latent representations into shared and domain-specific subspaces, while the Domain Adapters help bridge disparate domain distributions and align cross-domain commonalities. Result: Damba-ST achieves state-of-the-art performance on urban spatio-temporal prediction tasks and demonstrates strong zero-shot generalization across diverse regions without requiring extensive retraining or fine-tuning. Conclusion: Damba-ST is a novel domain-adaptive Mamba-based model that addresses the limitations of existing spatio-temporal models by maintaining linear complexity and significantly improving adaptability to heterogeneous domains, thereby enabling efficient urban spatio-temporal prediction with strong zero-shot generalization. Abstract: Training urban spatio-temporal foundation models that generalize well across diverse regions and cities is critical for deploying urban services in unseen or data-scarce regions. Recent studies have typically focused on fusing cross-domain spatio-temporal data to train unified Transformer-based models. However, these models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment. Inspired by the efficiency of Mamba, a state space model with linear time complexity, we explore its potential for efficient urban spatio-temporal prediction. However, directly applying Mamba as a spatio-temporal backbone leads to negative transfer and severe performance degradation. This is primarily due to spatio-temporal heterogeneity and the recursive mechanism of Mamba's hidden state updates, which limit cross-domain generalization. To overcome these challenges, we propose Damba-ST, a novel domain-adaptive Mamba-based model for efficient urban spatio-temporal prediction. Damba-ST retains Mamba's linear complexity advantage while significantly enhancing its adaptability to heterogeneous domains. Specifically, we introduce two core innovations: (1) a domain-adaptive state space model that partitions the latent representation space into a shared subspace for learning cross-domain commonalities and independent, domain-specific subspaces for capturing intra-domain discriminative features; (2) three distinct Domain Adapters, which serve as domain-aware proxies to bridge disparate domain distributions and facilitate the alignment of cross-domain commonalities. Extensive experiments demonstrate the generalization and efficiency of Damba-ST. It achieves state-of-the-art performance on prediction tasks and demonstrates strong zero-shot generalization, enabling seamless deployment in new urban environments without extensive retraining or fine-tuning.

[51] From Pixels and Words to Waves: A Unified Framework for Spectral Dictionary vLLMs

Andrew Kiruluta,Priscilla Burity

Main category: cs.CV

TL;DR: 本研究提出 SDict-VLM,通过引入频谱字典令牌混合器替代卷积和自注意力机制,在大幅减少计算资源消耗的前提下实现了高性能视觉语言建模。

Details Motivation: 当前最先进的视觉语言模型依赖于计算密集型的卷积和多模态融合自注意力机制,限制了其在资源受限环境下的应用。因此,本文旨在开发一种更高效、更轻量级的替代方案。 Method: 引入频谱字典令牌混合器(spectral dictionary token mixer),将图像块或词片段表示为可学习频率原子的稀疏组合,从而替代传统的卷积和自注意力机制。 Result: SDict-VLM 在 MS-COCO 图像描述任务上达到了 BLEU-4 39.2、CIDEr 127.5 和 SPICE 27.0 的性能,并在 VQAv2 上实现了 50.3% 的准确率,接近 BLIP-2 性能的 85%,同时参数减少 60%,GPU 内存占用降低 2.3 倍,推理速度比 PaLI-3 快 2.2 倍。 Conclusion: SDict-VLM 作为一种新型视觉语言模型,成功消除了传统方法中的卷积和自注意力机制,在保持与中等规模 Transformer 基线模型相当性能的同时,实现了更高的计算效率和可解释性。 Abstract: Vision-language models (VLMs) unify computer vision and natural language processing in a single architecture capable of interpreting and describing images. Most state-of-the-art systems rely on two computationally intensive components: convolutions in the vision encoder and quadratic self-attention for multimodal fusion. This work removes both by introducing a spectral dictionary token mixer, which represents each image patch or wordpiece as a sparse combination of learnable frequency atoms. Our 1.1B-parameter prototype, SDict-VLM, achieves BLEU-4 of 39.2, CIDEr of 127.5, and SPICE of 27.0 on MS-COCO captioning, along with 50.3 percent accuracy on VQAv2. These results close approximately 85 percent of the performance gap to BLIP-2 while using 60 percent fewer parameters, 2.3 times less peak GPU memory, and 2.2 times faster inference than PaLI-3. To our knowledge, this is the first VLM to eliminate both convolutions and self-attention while matching mid-scale transformer baselines. In addition to its O(L log L) complexity, the shared frequency dictionary enables transparent cross-modal alignment and offers a tunable trade-off between accuracy and compute, paving the way for efficient and interpretable VLMs.

[52] DiffRIS: Enhancing Referring Remote Sensing Image Segmentation with Pre-trained Text-to-Image Diffusion Models

Zhe Dong,Yuzhe Sun,Tianzhu Liu,Yanfeng Gu

Main category: cs.CV

TL;DR: 提出DiffRIS框架以解决远程感知图像分割中复杂对象特性带来的挑战,其通过上下文感知适配器和渐进式跨模态推理解码器实现了更优的跨模态对齐和精确分割。

Details Motivation: 现有的RRSIS方法在处理航空影像时面临复杂对象特性带来的显著挑战,如尺度变化、多样方向以及高空视角固有的语义模糊性。 Method: 提出了DiffRIS框架,包括上下文感知适配器(CP-adapter)和渐进式跨模态推理解码器(PCMRD),用于动态优化语言特征并通过多尺度特征交互实现细粒度语义对齐。 Result: 在三个基准数据集上的综合实验表明,DiffRIS在所有标准指标上均一致优于现有方法,为RRSIS任务建立了新的最先进的状态。 Conclusion: DiffRIS通过利用预训练文本到图像扩散模型的语义理解能力,为增强RRSIS任务中的跨模态对齐提供了一种新框架。 Abstract: Referring remote sensing image segmentation (RRSIS) enables the precise delineation of regions within remote sensing imagery through natural language descriptions, serving critical applications in disaster response, urban development, and environmental monitoring. Despite recent advances, current approaches face significant challenges in processing aerial imagery due to complex object characteristics including scale variations, diverse orientations, and semantic ambiguities inherent to the overhead perspective. To address these limitations, we propose DiffRIS, a novel framework that harnesses the semantic understanding capabilities of pre-trained text-to-image diffusion models for enhanced cross-modal alignment in RRSIS tasks. Our framework introduces two key innovations: a context perception adapter (CP-adapter) that dynamically refines linguistic features through global context modeling and object-aware reasoning, and a progressive cross-modal reasoning decoder (PCMRD) that iteratively aligns textual descriptions with visual regions for precise segmentation. The CP-adapter bridges the domain gap between general vision-language understanding and remote sensing applications, while PCMRD enables fine-grained semantic alignment through multi-scale feature interaction. Comprehensive experiments on three benchmark datasets-RRSIS-D, RefSegRS, and RISBench-demonstrate that DiffRIS consistently outperforms existing methods across all standard metrics, establishing a new state-of-the-art for RRSIS tasks. The significant performance improvements validate the effectiveness of leveraging pre-trained diffusion models for remote sensing applications through our proposed adaptive framework.

[53] GLIMPSE: Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation for Generative LVLMs

Guanxi Shen

Main category: cs.CV

TL;DR: GLIMPSE is a new interpretability framework that visualizes key image regions and textual elements used by LVLMs in open-ended visual question answering, offering improved understanding of model behavior.

Details Motivation: Understanding where LVLMs focus their attention during free-form response generation is crucial for diagnosing hallucinations, exposing biases, and ensuring transparency. Method: The method combines gradient-weighted attention, adaptive layer propagation, and weighted token aggregation to generate attribution heat maps for visual and textual saliency in open-ended VQA. Result: GLIMPSE outperforms previous interpretability methods in human-alignment and enables detailed insights into multimodal reasoning dynamics, including attention misalignment, hallucination, and bias. Conclusion: GLIMPSE provides a lightweight and model-agnostic framework for interpreting LVLMs' cross-modal reasoning, enabling fine-grained analysis of attention alignment, hallucination, and bias. Abstract: Recent advances in large vision language models (LVLMs) have unlocked unprecedented capabilities in generating coherent responses from visual inputs. However, interpreting where LVLMs direct their visual attention while generating free-form textual responses remains a significant challenge, yet is essential for understanding model behavior, diagnosing hallucination, exposing bias and ensuring transparency. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework for visualizing the salient image regions that LVLMs rely upon during open-ended visual question answering (VQA), while concurrently revealing the multimodal textual saliency. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and weighted token aggregation to produce holistic response-level attribution heat maps for interpreting cross-modal reasoning, outperforming prior interpretability methods in human-alignment. We demonstrate an analytic explainable AI (XAI) approach using GLIMPSE to uncover fine-grained insights into LVLM cross-modal attribution, trace token-level reasoning dynamics, and analyze systematic human-attention misalignment, hallucination, and bias.

[54] Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation

Yuan Yao,Yicong Hong,Difan Liu,Long Mai,Feng Liu,Jiebo Luo

Main category: cs.CV

TL;DR: This paper proposes T2MD, a distillation-based training method that transitions from self-attention-based transformers to efficient Mamba models, enabling high-resolution image generation with reduced cost.

Details Motivation: The quadratic computational complexity of self-attention in DiT causes high costs in high-resolution image generation, and while Mamba offers linear complexity, its direct training is challenging. Method: A hybrid diffusion self-attention and Mamba model is introduced with layer-level teacher forcing and feature-based knowledge distillation to overcome training difficulties in Mamba models. Result: T2MD achieves low overhead but high-quality text-to-image generation, extending to 2048×2048 resolution via lightweight adaptation and fine-tuning. Conclusion: The proposed T2MD method successfully enables efficient training of Mamba models for high-resolution image generation, demonstrating feasibility and quality. Abstract: The quadratic computational complexity of self-attention in diffusion transformers (DiT) introduces substantial computational costs in high-resolution image generation. While the linear-complexity Mamba model emerges as a potential alternative, direct Mamba training remains empirically challenging. To address this issue, this paper introduces diffusion transformer-to-mamba distillation (T2MD), forming an efficient training pipeline that facilitates the transition from the self-attention-based transformer to the linear complexity state-space model Mamba. We establish a diffusion self-attention and Mamba hybrid model that simultaneously achieves efficiency and global dependencies. With the proposed layer-level teacher forcing and feature-based knowledge distillation, T2MD alleviates the training difficulty and high cost of a state space model from scratch. Starting from the distilled 512$\times$512 resolution base model, we push the generation towards 2048$\times$2048 images via lightweight adaptation and high-resolution fine-tuning. Experiments demonstrate that our training path leads to low overhead but high-quality text-to-image generation. Importantly, our results also justify the feasibility of using sequential and causal Mamba models for generating non-causal visual output, suggesting the potential for future exploration.

[55] Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation

Jinlong Li,Dong Zhao,Qi Zang,Zequn Jie,Lin Ma,Nicu Sebe

Main category: cs.CV

TL;DR: This paper proposes OoPk, a novel method for Continual Test Time Adaptation that balances efficient model adaptation and strong performance by preserving knowledge integrity and enhancing domain adaptability.

Details Motivation: Existing CTTA methods struggle to balance performance and adaptation efficiency, especially in semantic segmentation tasks. Method: OoPk projects a tuning subspace orthogonally to preserve pre-trained knowledge and employs an online prior-knowledge aggregation strategy with image masking for domain adaptability. Result: Extensive experiments show that OoPk outperforms previous CTTA methods on continual TTA benchmarks in semantic segmentation. Conclusion: OoPk effectively addresses catastrophic forgetting and error accumulation in CTTA, achieving competitive performance across various benchmarks. Abstract: Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios with changing target distributions. Existing CTTA methods primarily focus on mitigating the challenges of catastrophic forgetting and error accumulation. Though there have been emerging methods based on forgetting adaptation with parameter-efficient fine-tuning, they still struggle to balance competitive performance and efficient model adaptation, particularly in complex tasks like semantic segmentation. In this paper, to tackle the above issues, we propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk. Specifically, we first project a tuning subspace orthogonally which allows the model to adapt to new domains while preserving the knowledge integrity of the pre-trained source model to alleviate catastrophic forgetting. Then, we elaborate an online prior-knowledge aggregation strategy that employs an aggressive yet efficient image masking strategy to mimic potential target dynamism, enhancing the student model's domain adaptability. This further gradually ameliorates the teacher model's knowledge, ensuring high-quality pseudo labels and reducing error accumulation. We demonstrate our method with extensive experiments that surpass previous CTTA methods and achieve competitive performances across various continual TTA benchmarks in semantic segmentation tasks.

[56] LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

Guang Yang,Victoria Ebert,Nazif Tamer,Luiza Pozzobon,Noah A. Smith

Main category: cs.CV

TL;DR: Legato 是一个用于光学音乐识别的新端到端 Transformer 模型,能高效识别完整乐谱并生成人类可读的 ABC 格式,性能达到当前领先水平。

Details Motivation: 光学音乐识别(OMR)领域缺乏标准化评估和能够处理完整乐谱的模型,因此需要一种新方法来提高识别准确性和适用性。 Method: 将预训练的视觉编码器与在超过 214K 张图像数据集上训练的 ABC 解码器结合,构建了一个端到端的 Transformer 模型 Legato。 Result: Legato 在各种类型乐谱上的泛化能力很强,并且在多个数据集上取得了最先进的性能。此外,作者全面比较了该模型与先前最先进模型的表现。 Conclusion: Legato 是第一个能够识别整页或跨页乐谱并生成 ABC 格式文档的大规模预训练 OMR 模型,表现出了良好的泛化能力,并在多种数据集上达到了最先进的性能。 Abstract: We propose Legato, a new end-to-end transformer model for optical music recognition (OMR). Legato is the first large-scale pretrained OMR model capable of recognizing full-page or multi-page typeset music scores and the first to generate documents in ABC notation, a concise, human-readable format for symbolic music. Bringing together a pretrained vision encoder with an ABC decoder trained on a dataset of more than 214K images, our model exhibits the strong ability to generalize across various typeset scores. We conduct experiments on a range of datasets and demonstrate that our model achieves state-of-the-art performance. Given the lack of a standardized evaluation for end-to-end OMR, we comprehensively compare our model against the previous state of the art using a diverse set of metrics.

[57] HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

Yimu Wang,Mozhgan Nasr Azadani,Sean Sedwards,Krzysztof Czarnecki

Main category: cs.CV

TL;DR: HAWAII improves vision-language model performance by distilling knowledge from multiple visual experts into a single vision encoder efficiently, leveraging LoRA adapters and advanced distillation methods.

Details Motivation: Improving the visual understanding of vision-language models is essential, but using multiple pretrained visual experts often incurs high computational costs. HAWAII aims to address this challenge by minimizing overhead while preserving complementary strengths. Method: HAWAII uses teacher-specific Low-Rank Adaptation (LoRA) adapters with a router for knowledge distillation from multiple visual experts into a single vision encoder. It incorporates fine-grained and coarse-grained distillation techniques to effectively transfer knowledge. Result: Extensive experiments show that HAWAII outperforms popular open-source vision-language models in various tasks, highlighting its efficiency and effectiveness in knowledge distillation. Conclusion: HAWAII demonstrates superiority over popular open-source vision-language models in various tasks, achieving enhanced performance with minimal computational overhead. Abstract: Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII, compared to the popular open-source VLMs.

[58] Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition

Iosif Tsangko,Andreas Triantafyllopoulos,Adem Abdelmoula,Adria Mallol-Ragolta,Bjoern W. Schuller

Main category: cs.CV

TL;DR: This paper explores how Vision Language Models (VLMs), particularly GPT-4o, infer emotions by analyzing visual cues such as eyebrow position, revealing both promising psychological grounding and potential risks like bias and shortcut learning in affective computing applications.

Details Motivation: The motivation is to investigate whether Vision Language Models (VLMs) rely on psychologically grounded visual cues or superficial patterns when inferring emotions, especially in light of their growing use in sensitive domains like mental health and education. Method: The study benchmarks various scale VLMs on a teeth-annotated subset of the AffectNet dataset and conducts structured introspection on the best-performing model (GPT-4o) to understand its reliance on specific facial attributes for affective reasoning. Result: Performance shifts were observed depending on the presence of visible teeth in the dataset. Introspection of GPT-4o revealed that it relies heavily on facial attributes like eyebrow position for affective reasoning, showing internal consistency in valence-arousal predictions. Conclusion: The paper concludes that while VLMs like GPT-4o show emergent and internally consistent affective reasoning based on psychologically grounded cues such as eyebrow position, there are risks related to shortcut learning, bias, and fairness, particularly in sensitive domains. Abstract: Foundation Models (FMs) are rapidly transforming Affective Computing (AC), with Vision Language Models (VLMs) now capable of recognising emotions in zero shot settings. This paper probes a critical but underexplored question: what visual cues do these models rely on to infer affect, and are these cues psychologically grounded or superficially learnt? We benchmark varying scale VLMs on a teeth annotated subset of AffectNet dataset and find consistent performance shifts depending on the presence of visible teeth. Through structured introspection of, the best-performing model, i.e., GPT-4o, we show that facial attributes like eyebrow position drive much of its affective reasoning, revealing a high degree of internal consistency in its valence-arousal predictions. These patterns highlight the emergent nature of FMs behaviour, but also reveal risks: shortcut learning, bias, and fairness issues especially in sensitive domains like mental health and education.

[59] RareSpot: Spotting Small and Rare Wildlife in Aerial Imagery with Multi-Scale Consistency and Context-Aware Augmentation

Bowen Zhang,Jesse T. Boulerice,Nikhil Kuniyil,Charvi Mendiratta,Satish Kumar,Hila Shamon,B. S. Manjunath

Main category: cs.CV

TL;DR: 本研究介绍了一个名为RareSpot的新框架,用于自动检测航拍图像中的小型和稀有野生动物,并取得了显著优于现有方法的性能。

Details Motivation: 草原犬鼠作为关键物种具有重要的生态意义,但其体型小、分布稀疏以及视觉特征不明显等问题使得现有的检测方法效果不佳。 Method: 提出了一种强大的检测框架RareSpot,结合了多尺度一致性学习和上下文感知增强。 Result: 在专家标注的草原犬鼠无人机图像基准上实现了最先进的性能,相比基线方法检测准确率提高了超过35%。 Conclusion: RareSpot不仅支持关键的生态监测,还为在复杂航拍场景中检测小型稀有物种奠定了新基础。 Abstract: Automated detection of small and rare wildlife in aerial imagery is crucial for effective conservation, yet remains a significant technical challenge. Prairie dogs exemplify this issue: their ecological importance as keystone species contrasts sharply with their elusive presence--marked by small size, sparse distribution, and subtle visual features--which undermines existing detection approaches. To address these challenges, we propose RareSpot, a robust detection framework integrating multi-scale consistency learning and context-aware augmentation. Our multi-scale consistency approach leverages structured alignment across feature pyramids, enhancing fine-grained object representation and mitigating scale-related feature loss. Complementarily, context-aware augmentation strategically synthesizes challenging training instances by embedding difficult-to-detect samples into realistic environmental contexts, significantly boosting model precision and recall. Evaluated on an expert-annotated prairie dog drone imagery benchmark, our method achieves state-of-the-art performance, improving detection accuracy by over 35% compared to baseline methods. Importantly, it generalizes effectively across additional wildlife datasets, demonstrating broad applicability. The RareSpot benchmark and approach not only support critical ecological monitoring but also establish a new foundation for detecting small, rare species in complex aerial scenes.

[60] Inverse-and-Edit: Effective and Fast Image Editing by Cycle Consistency Models

Ilia Beletskii,Andrey Kuznetsov,Aibek Alanov

Main category: cs.CV

TL;DR: 本文提出了一种高效的图像编辑框架,利用一致性模型进行高质量的图像反转,仅需四步即可实现精确编辑,同时在性能和效率上优于现有方法。

Details Motivation: 现有的扩散模型虽然在图像编辑方面取得了令人印象深刻的结果,但由于其迭代性质计算密集。而蒸馏扩散模型尽管可以更快推理,但其编辑能力受限于较差的反转质量。高保真度的反转和重建对于精确的图像编辑至关重要。 Method: 引入了一种循环一致性优化策略,显著提高了重建精度,并实现了可控制的编辑与内容保留之间的权衡。 Result: 实现了一种高效的图像编辑方法,只需四步即可完成高质量的编辑,并在各种任务和数据集上展示了卓越的性能。 Conclusion: 该研究提出了一种基于一致性模型的新框架,通过增强图像反转能力,在仅需四步的情况下实现了高质量的图像编辑。此外,该方法在多个图像编辑任务和数据集上达到了最先进的性能,并且效率更高。 Abstract: Recent advances in image editing with diffusion models have achieved impressive results, offering fine-grained control over the generation process. However, these methods are computationally intensive because of their iterative nature. While distilled diffusion models enable faster inference, their editing capabilities remain limited, primarily because of poor inversion quality. High-fidelity inversion and reconstruction are essential for precise image editing, as they preserve the structural and semantic integrity of the source image. In this work, we propose a novel framework that enhances image inversion using consistency models, enabling high-quality editing in just four steps. Our method introduces a cycle-consistency optimization strategy that significantly improves reconstruction accuracy and enables a controllable trade-off between editability and content preservation. We achieve state-of-the-art performance across various image editing tasks and datasets, demonstrating that our method matches or surpasses full-step diffusion models while being substantially more efficient. The code of our method is available on GitHub at https://github.com/ControlGenAI/Inverse-and-Edit.

[61] PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Scenes

Christina Ourania Tze,Daniel Dauner,Yiyi Liao,Dzmitry Tsishkou,Andreas Geiger

Main category: cs.CV

TL;DR: PrITTI利用3D基元和扩散模型生成高质量、可编辑的3D语义场景,优于传统体素方法。

Details Motivation: 体素基表示内存密集、受固定分辨率限制且难以编辑,而基元表示紧凑且易于操作,适合生成3D语义场景。 Method: 采用混合表示法,将地面表面建模为光栅化格式,同时将物体编码为矢量化的3D基元,并引入了稳定的Cholesky参数化方法以解决方向模糊问题。 Result: 在KITTI-360数据集上的实验表明,与体素基线相比,PrITTI在生成质量上更优,内存需求减少了3倍,并支持实例级对象操作和多种下游应用。 Conclusion: PrITTI是一个基于扩散模型的框架,使用基元作为主要元素来生成可编辑、可控的3D语义场景布局。 Abstract: Large-scale 3D semantic scene generation has predominantly relied on voxel-based representations, which are memory-intensive, bound by fixed resolutions, and challenging to edit. In contrast, primitives represent semantic entities using compact, coarse 3D structures that are easy to manipulate and compose, making them an ideal representation for this task. In this paper, we introduce PrITTI, a latent diffusion-based framework that leverages primitives as the main foundational elements for generating compositional, controllable, and editable 3D semantic scene layouts. Our method adopts a hybrid representation, modeling ground surfaces in a rasterized format while encoding objects as vectorized 3D primitives. This decomposition is also reflected in a structured latent representation that enables flexible scene manipulation of ground and object components. To overcome the orientation ambiguities in conventional encoding methods, we introduce a stable Cholesky-based parameterization that jointly encodes object size and orientation. Experiments on the KITTI-360 dataset show that PrITTI outperforms a voxel-based baseline in generation quality, while reducing memory requirements by up to $3\times$. In addition, PrITTI enables direct instance-level manipulation of objects in the scene and supports a range of downstream applications, including scene inpainting, outpainting, and photo-realistic street-view synthesis.

[62] Lightweight RGB-T Tracking with Mobile Vision Transformers

Mahdi Falaki,Maria A. Amer

Main category: cs.CV

TL;DR: 本文提出了一种高效的多模态目标跟踪方法,基于MobileViT架构设计了轻量级模型,结合渐进式融合框架,在保证精度的同时大幅提升了推理速度并减少了参数数量。

Details Motivation: 单模态目标跟踪(如仅使用RGB)在低照度或恶劣天气条件下效果较差,而现有的基于视觉Transformer的多模态跟踪器虽然性能良好,但由于模型较大导致计算开销高。因此,作者旨在设计一种更高效、轻量化的多模态跟踪方法。 Method: 文章引入了一个渐进式融合框架,通过可分离注意力机制联合学习模板区域与搜索区域的模内和模间交互。此外,采用了轻量级的Mobile Vision Transformer架构,以降低计算成本并提高推理效率。 Result: 该方法相比当前最先进的高效多模态跟踪器,在精度相当的情况下显著降低了参数数量(少于400万),并在GPU上实现了最快的每秒122帧的推理速度。 Conclusion: 本文提出了一种基于MobileViT的新型轻量级RGB-T跟踪算法,首次将MobileViT应用于RGB-T和多模态跟踪领域。该模型在保持较高精度的同时显著减少了参数数量并实现了更快的推理速度,为未来的研究提供了新的方向。 Abstract: Single-modality object tracking (e.g., RGB-only) encounters difficulties in challenging imaging conditions, such as low illumination and adverse weather conditions. To solve this, multimodal tracking (e.g., RGB-T models) aims to leverage complementary data such as thermal infrared features. While recent Vision Transformer-based multimodal trackers achieve strong performance, they are often computationally expensive due to large model sizes. In this work, we propose a novel lightweight RGB-T tracking algorithm based on Mobile Vision Transformers (MobileViT). Our tracker introduces a progressive fusion framework that jointly learns intra-modal and inter-modal interactions between the template and search regions using separable attention. This design produces effective feature representations that support more accurate target localization while achieving a small model size and fast inference speed. Compared to state-of-the-art efficient multimodal trackers, our model achieves comparable accuracy while offering significantly lower parameter counts (less than 4 million) and the fastest GPU inference speed of 122 frames per second. This paper is the first to propose a tracker using Mobile Vision Transformers for RGB-T tracking and multimodal tracking at large. Tracker code and model weights will be made publicly available upon acceptance.

[63] PRISM: Perceptual Recognition for Identifying Standout Moments in Human-Centric Keyframe Extraction

Mert Can Cakmak,Nitin Agarwal,Diwash Poudel

Main category: cs.CV

TL;DR: 本文提出了一种名为 PRISM 的轻量级框架,用于关键帧提取,其基于感知颜色差异度量,无需训练且计算高效。

Details Motivation: 检测视频内容中最重要或“突出”的时刻对于内容审核、摘要和取证分析至关重要。 Method: PRISM 在 CIELAB 颜色空间中运行,使用感知颜色差异度量来识别符合人类视觉敏感性的帧。 Result: PRISM 实现了强大的准确性和保真度,同时保持了高压缩比,并适合实时和资源受限环境。 Conclusion: PRISM 是一种有效的关键帧提取工具,适用于在线平台中分析和监管有害或政治敏感的媒体内容。 Abstract: Online videos play a central role in shaping political discourse and amplifying cyber social threats such as misinformation, propaganda, and radicalization. Detecting the most impactful or "standout" moments in video content is crucial for content moderation, summarization, and forensic analysis. In this paper, we introduce PRISM (Perceptual Recognition for Identifying Standout Moments), a lightweight and perceptually-aligned framework for keyframe extraction. PRISM operates in the CIELAB color space and uses perceptual color difference metrics to identify frames that align with human visual sensitivity. Unlike deep learning-based approaches, PRISM is interpretable, training-free, and computationally efficient, making it well suited for real-time and resource-constrained environments. We evaluate PRISM on four benchmark datasets: BBC, TVSum, SumMe, and ClipShots, and demonstrate that it achieves strong accuracy and fidelity while maintaining high compression ratios. These results highlight PRISM's effectiveness in both structured and unstructured video content, and its potential as a scalable tool for analyzing and moderating harmful or politically sensitive media in online platforms.

[64] MOSCARD -- Causal Reasoning and De-confounding for Multimodal Opportunistic Screening of Cardiovascular Adverse Events

Jialu Pi,Juan Maria Farina,Rimita Lahiri,Jiwoong Jeong,Archana Gurudu,Hyung-Bok Park,Chieh-Ju Chao,Chadi Ayoub,Reza Arsanjani,Imon Banerjee

Main category: cs.CV

TL;DR: This paper proposes MOSCARD, a predictive modeling framework integrating chest X-rays (CXR) and electrocardiogram (ECG) data through multimodal causal reasoning to improve MACE risk assessment, enabling early interventions for better patient outcomes.

Details Motivation: Major Adverse Cardiovascular Events (MACE) are the leading cause of global mortality. Conventional models using clinical scores, CT measurements, or biomarkers have limitations due to sampling bias and single modality constraints, prompting the need for improved risk assessment methods. Method: MOSCARD uses multimodal causal reasoning with co-attention to align CXR and ECG data, integrating a dual back-propagation graph to mitigate bias and confounders in risk estimation. Result: The MOSCARD model outperformed single-modality and state-of-the-art foundational models on internal shift data from an emergency department and external MIMIC datasets, achieving AUC scores of 0.75, 0.83, and 0.71 respectively. Conclusion: The proposed MOSCARD framework enables cost-effective opportunistic screening for MACE risk assessment, supporting early intervention to improve patient outcomes and reduce disparities. Abstract: Major Adverse Cardiovascular Events (MACE) remain the leading cause of mortality globally, as reported in the Global Disease Burden Study 2021. Opportunistic screening leverages data collected from routine health check-ups and multimodal data can play a key role to identify at-risk individuals. Chest X-rays (CXR) provide insights into chronic conditions contributing to major adverse cardiovascular events (MACE), while 12-lead electrocardiogram (ECG) directly assesses cardiac electrical activity and structural abnormalities. Integrating CXR and ECG could offer a more comprehensive risk assessment than conventional models, which rely on clinical scores, computed tomography (CT) measurements, or biomarkers, which may be limited by sampling bias and single modality constraints. We propose a novel predictive modeling framework - MOSCARD, multimodal causal reasoning with co-attention to align two distinct modalities and simultaneously mitigate bias and confounders in opportunistic risk estimation. Primary technical contributions are - (i) multimodal alignment of CXR with ECG guidance; (ii) integration of causal reasoning; (iii) dual back-propagation graph for de-confounding. Evaluated on internal, shift data from emergency department (ED) and external MIMIC datasets, our model outperformed single modality and state-of-the-art foundational models - AUC: 0.75, 0.83, 0.71 respectively. Proposed cost-effective opportunistic screening enables early intervention, improving patient outcomes and reducing disparities.

[65] OpenWildlife: Open-Vocabulary Multi-Species Wildlife Detector for Geographically-Diverse Aerial Imagery

Muhammed Patel,Javier Noa Turnes,Jayden Hsiao,Linlin Xu,David Clausi

Main category: cs.CV

TL;DR: 本文提出了一种名为 OpenWildlife 的开放词汇野生动物探测器,能够通过自然语言输入识别陆地和海洋环境中的物种,并引入了一种高效搜索算法,以较小的计算资源捕获大部分物种。

Details Motivation: 现有的自动化方法在特定环境下表现良好,但它们通常难以推广到不同的物种和环境中,因为分类覆盖范围有限且模型架构僵化。 Method: OW 利用语言感知嵌入和对 Grounding-DINO 框架的新颖改编,并结合 k 近邻和广度优先搜索的高效搜索算法来定位物种可能存在的区域。 Result: OW 在15个数据集上进行训练,在七个包含新物种的数据集中实现了高达0.981 mAP50(微调后)和0.597 mAP50的表现;此外,其高效搜索算法仅探索33%的可用图像就捕获了超过95%的物种。 Conclusion: OW 作为一种开放词汇的野生动物探测器,为全球生物多样性评估提供了一个灵活、经济高效的解决方案。 Abstract: We introduce OpenWildlife (OW), an open-vocabulary wildlife detector designed for multi-species identification in diverse aerial imagery. While existing automated methods perform well in specific settings, they often struggle to generalize across different species and environments due to limited taxonomic coverage and rigid model architectures. In contrast, OW leverages language-aware embeddings and a novel adaptation of the Grounding-DINO framework, enabling it to identify species specified through natural language inputs across both terrestrial and marine environments. Trained on 15 datasets, OW outperforms most existing methods, achieving up to \textbf{0.981} mAP50 with fine-tuning and \textbf{0.597} mAP50 on seven datasets featuring novel species. Additionally, we introduce an efficient search algorithm that combines k-nearest neighbors and breadth-first search to prioritize areas where social species are likely to be found. This approach captures over \textbf{95\%} of species while exploring only \textbf{33\%} of the available images. To support reproducibility, we publicly release our source code and dataset splits, establishing OW as a flexible, cost-effective solution for global biodiversity assessments.

[66] Ancient Script Image Recognition and Processing: A Review

Xiaolei Diao,Rite Bo,Yanling Xiao,Lida Shi,Zhihan Zhou,Hao Xu,Chuntao Li,Xiongfeng Tang,Massimo Poesio,Cédric M. John,Daqian Shi

Main category: cs.CV

TL;DR: 这篇论文综述了古代文字图像识别的方法,讨论了其挑战与解决方案,并展望了未来的研究方向。

Details Motivation: 自动化古代文字图像识别对于大规模解读和推动考古学及数字人文学科的研究具有重要意义。 Method: 通过分类现有的研究,分析各自的识别方法,并系统地考察古代文字所面临的独特挑战及其解决方案。 Result: 提供了对古代文字图像识别方法的全面综述,包括不同文字类型的方法、共同策略以及独特的挑战与解决方案。 Conclusion: 该论文总结了古代文字图像识别的现有方法,并提出了未来的研究方向。 Abstract: Ancient scripts, e.g., Egyptian hieroglyphs, Oracle Bone Inscriptions, and Ancient Greek inscriptions, serve as vital carriers of human civilization, embedding invaluable historical and cultural information. Automating ancient script image recognition has gained importance, enabling large-scale interpretation and advancing research in archaeology and digital humanities. With the rise of deep learning, this field has progressed rapidly, with numerous script-specific datasets and models proposed. While these scripts vary widely, spanning phonographic systems with limited glyphs to logographic systems with thousands of complex symbols, they share common challenges and methodological overlaps. Moreover, ancient scripts face unique challenges, including imbalanced data distribution and image degradation, which have driven the development of various dedicated methods. This survey provides a comprehensive review of ancient script image recognition methods. We begin by categorizing existing studies based on script types and analyzing respective recognition methods, highlighting both their differences and shared strategies. We then focus on challenges unique to ancient scripts, systematically examining their impact and reviewing recent solutions, including few-shot learning and noise-robust techniques. Finally, we summarize current limitations and outline promising future directions. Our goal is to offer a structured, forward-looking perspective to support ongoing advancements in the recognition, interpretation, and decipherment of ancient scripts.

[67] MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports

Sunggu Kyung,Hyungbin Park,Jinyoung Seo,Jimin Sung,Jihyun Kim,Dongyeong Kim,Wooyoung Jo,Yoojin Nam,Sangah Park,Taehee Kwon,Sang Min Lee,Namkug Kim

Main category: cs.CV

TL;DR: This paper proposes MedErr-CT, a benchmark for evaluating medical MLLMs' ability to identify and correct CT report errors, aiming to enhance the reliability and clinical applicability of MLLMs.

Details Motivation: The increasing demand for CT examinations has raised concerns about diagnostic errors. Existing medical VQA benchmarks lack clinical relevance and fail to assess expert-level knowledge. Method: The paper introduces MedErr-CT, a novel benchmark organized into three task levels: classification, detection, and correction. It evaluates state-of-the-art 3D medical MLLMs across six error categories. Result: The evaluation of advanced 3D medical MLLMs using MedErr-CT revealed substantial variation in their capabilities across different error types. Conclusion: MedErr-CT contributes to developing more reliable MLLMs, helping reduce diagnostic errors and improve accuracy in clinical practice. Abstract: Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels: classification, detection, and correction. Using this benchmark, we quantitatively assess the performance of state-of-the-art 3D medical MLLMs, revealing substantial variation in their capabilities across different error types. Our benchmark contributes to the development of more reliable and clinically applicable MLLMs, ultimately helping reduce diagnostic errors and improve accuracy in clinical practice. The code and datasets are available at https://github.com/babbu3682/MedErr-CT.

[68] Video-XL-2: Towards Very Long-Video Understanding Through Task-Aware KV Sparsification

Minghao Qin,Xiangrui Liu,Zhengyang Liang,Yan Shu,Huaying Yuan,Juenjie Zhou,Shitao Xiao,Bo Zhao,Zheng Liu

Main category: cs.CV

TL;DR: 为了解决现有模型在长视频理解中的高内存和计算成本问题,本文提出了一种新的多模态大语言模型Video-XL-2,其基于任务感知KV稀疏化实现了高效的长视频理解和先进的性能。

Details Motivation: 现有的多模态大语言模型在处理长视频输入时面临高内存和计算成本的挑战,因此需要一种更高效的方法来实现强大的性能和高效的长视频理解。 Method: Video-XL-2采用了分块预填充和双级键值解码两步主要方法,以提高内存效率并增强模型捕捉细粒度信息的能力。 Result: Video-XL-2在各种长视频理解基准测试中取得了最先进的性能,能够在单个NVIDIA A100(80GB)GPU上处理超过10,000帧,并在几秒钟内处理数千帧。 Conclusion: Video-XL-2是一个基于任务感知KV稀疏化的新型MLLM,在长视频理解方面实现了最先进的性能,并且具有很高的效率。 Abstract: Multi-modal large language models (MLLMs) models have made significant progress in video understanding over the past few years. However, processing long video inputs remains a major challenge due to high memory and computational costs. This makes it difficult for current models to achieve both strong performance and high efficiency in long video understanding. To address this challenge, we propose Video-XL-2, a novel MLLM that delivers superior cost-effectiveness for long-video understanding based on task-aware KV sparsification. The proposed framework operates with two key steps: chunk-based pre-filling and bi-level key-value decoding. Chunk-based pre-filling divides the visual token sequence into chunks, applying full attention within each chunk and sparse attention across chunks. This significantly reduces computational and memory overhead. During decoding, bi-level key-value decoding selectively reloads either dense or sparse key-values for each chunk based on its relevance to the task. This approach further improves memory efficiency and enhances the model's ability to capture fine-grained information. Video-XL-2 achieves state-of-the-art performance on various long video understanding benchmarks, outperforming existing open-source lightweight models. It also demonstrates exceptional efficiency, capable of processing over 10,000 frames on a single NVIDIA A100 (80GB) GPU and thousands of frames in just a few seconds.

[69] MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models

Yinan Xia,Yilei Jiang,Yingshui Tan,Xiaoyong Zhu,Xiangyu Yue,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了一种名为MSR-Align的多模态安全推理数据集,旨在解决现有安全对齐方法和数据集的不足,以提升视觉语言模型面对多模态输入时的安全性。

Details Motivation: 现有的安全对齐方法主要针对单模态语言模型,无法解决多模态输入带来的复杂而细微的威胁;此外,当前的安全数据集缺乏稳健对齐具备推理能力的视觉语言模型所需的细粒度、基于政策的推理。 Method: 开发了MSR-Align数据集,该数据集支持在标准安全性策略上进行细粒度、深思熟虑的推理,并利用强大的多模态判断器强调多模态多样性、基于政策的推理和严格的质量过滤。 Result: 在MSR-Align上进行微调的视觉语言模型显著提高了对文本和视觉语言越狱攻击的鲁棒性,同时保持或增强了总体推理性能。 Conclusion: MSR-Align是一个高质量的多模态安全推理数据集,为提高具备推理能力的视觉语言模型的安全对齐提供了一个可扩展且有效的基础。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.

[70] Automated Image Recognition Framework

Quang-Binh Nguyen,Trong-Vu Hoang,Ngoc-Do Tran,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: This paper proposes the Automated Image Recognition (AIR) framework, which uses generative AI to synthesize high-quality, pre-annotated datasets and automatically trains deep learning models, significantly improving efficiency in image recognition tasks.

Details Motivation: The motivation stems from the significant time and resource challenges involved in gathering and annotating data for specific tasks, especially for novel or sensitive subjects lacking relevant datasets. This work aims to automate and streamline the dataset creation and model training process. Method: The AIR framework includes two main data synthesis processes: AIR-Gen for generating datasets tailored to user specifications and AIR-Aug for enhancing existing datasets. It also features an automated prompt engineering module using large language models and a distribution adjustment algorithm to improve dataset quality. Result: Comprehensive experiments demonstrated the efficacy of the AIR framework in generating high-quality datasets and training deep learning models with strong image recognition performance. The framework proved particularly beneficial when limited data was available for specific tasks. Conclusion: The proposed AIR framework effectively generates high-quality, pre-annotated datasets using generative AI, enabling efficient training of deep learning models with robust image recognition performance. The system is well-received by users, as evidenced by a 4.4/5.0 score in a user study. Abstract: While the efficacy of deep learning models heavily relies on data, gathering and annotating data for specific tasks, particularly when addressing novel or sensitive subjects lacking relevant datasets, poses significant time and resource challenges. In response to this, we propose a novel Automated Image Recognition (AIR) framework that harnesses the power of generative AI. AIR empowers end-users to synthesize high-quality, pre-annotated datasets, eliminating the necessity for manual labeling. It also automatically trains deep learning models on the generated datasets with robust image recognition performance. Our framework includes two main data synthesis processes, AIR-Gen and AIR-Aug. The AIR-Gen enables end-users to seamlessly generate datasets tailored to their specifications. To improve image quality, we introduce a novel automated prompt engineering module that leverages the capabilities of large language models. We also introduce a distribution adjustment algorithm to eliminate duplicates and outliers, enhancing the robustness and reliability of generated datasets. On the other hand, the AIR-Aug enhances a given dataset, thereby improving the performance of deep classifier models. AIR-Aug is particularly beneficial when users have limited data for specific tasks. Through comprehensive experiments, we demonstrated the efficacy of our generated data in training deep learning models and showcased the system's potential to provide image recognition models for a wide range of objects. We also conducted a user study that achieved an impressive score of 4.4 out of 5.0, underscoring the AI community's positive perception of AIR.

[71] 3D-SSM: A Novel 3D Selective Scan Module for Remote Sensing Change Detection

Rui Huang,Jincheng Zeng,Sen Gao,Yan Xing

Main category: cs.CV

TL;DR: The paper proposes a new method for remote sensing change detection that improves upon existing Mamba-based approaches by capturing long-range dependencies between image channels.

Details Motivation: Existing Mamba-based approaches in remote sensing change detection are limited by their inability to capture long-range dependencies between image channels effectively, which restricts their feature representation capabilities. Method: The method involves a 3D selective scan module (3D-SSM), a spatiotemporal interaction module (SIM), and a multi-branch feature extraction module (MBFEM). Result: The proposed method demonstrates favourable performance compared to state-of-the-art change detection methods on five benchmark datasets through extensive experiments. Conclusion: The proposed method demonstrates favourable performance compared to state-of-the-art change detection methods on five benchmark datasets. Abstract: Existing Mamba-based approaches in remote sensing change detection have enhanced scanning models, yet remain limited by their inability to capture long-range dependencies between image channels effectively, which restricts their feature representation capabilities. To address this limitation, we propose a 3D selective scan module (3D-SSM) that captures global information from both the spatial plane and channel perspectives, enabling a more comprehensive understanding of the data.Based on the 3D-SSM, we present two key components: a spatiotemporal interaction module (SIM) and a multi-branch feature extraction module (MBFEM). The SIM facilitates bi-temporal feature integration by enabling interactions between global and local features across images from different time points, thereby enhancing the detection of subtle changes. Meanwhile, the MBFEM combines features from the frequency domain, spatial domain, and 3D-SSM to provide a rich representation of contextual information within the image. Our proposed method demonstrates favourable performance compared to state-of-the-art change detection methods on five benchmark datasets through extensive experiments. Code is available at https://github.com/VerdantMist/3D-SSM

[72] Self-Paced Collaborative and Adversarial Network for Unsupervised Domain Adaptation

Weichen Zhang,Dong Xu,Wanli Ouyang,Wen Li

Main category: cs.CV

TL;DR: This paper introduces a novel unsupervised domain adaptation technique named Collaborative and Adversarial Network (CAN) and its enhanced version Self-Paced CAN (SPCAN), achieving excellent results on multiple benchmark datasets by combining domain-collaborative and domain-adversarial learning strategies.

Details Motivation: The motivation behind this research is to develop an effective unsupervised domain adaptation method that learns both domain-specific and domain-invariant feature representations to reduce domain distribution mismatch and improve performance across different domains. Method: The paper proposes a new unsupervised domain adaptation approach called Collaborative and Adversarial Network (CAN), which uses domain-collaborative and domain-adversarial learning strategies. Additionally, Self-Paced CAN (SPCAN) is introduced to enhance discriminability in the target domain using a self-paced learning strategy. Result: Comprehensive experiments on benchmark datasets such as Office-31, ImageCLEF-DA, VISDA-2017, UCF101-10, and HMDB51-10 show that the proposed approaches achieve state-of-the-art performance in object and video action recognition tasks. Conclusion: The paper concludes that the proposed CAN and SPCAN approaches achieve state-of-the-art performance on various benchmark datasets for object and video action recognition tasks, demonstrating their effectiveness for unsupervised domain adaptation. Abstract: This paper proposes a new unsupervised domain adaptation approach called Collaborative and Adversarial Network (CAN), which uses the domain-collaborative and domain-adversarial learning strategy for training the neural network. The domain-collaborative learning aims to learn domain-specific feature representation to preserve the discriminability for the target domain, while the domain adversarial learning aims to learn domain-invariant feature representation to reduce the domain distribution mismatch between the source and target domains. We show that these two learning strategies can be uniformly formulated as domain classifier learning with positive or negative weights on the losses. We then design a collaborative and adversarial training scheme, which automatically learns domain-specific representations from lower blocks in CNNs through collaborative learning and domain-invariant representations from higher blocks through adversarial learning. Moreover, to further enhance the discriminability in the target domain, we propose Self-Paced CAN (SPCAN), which progressively selects pseudo-labeled target samples for re-training the classifiers. We employ a self-paced learning strategy to select pseudo-labeled target samples in an easy-to-hard fashion. Comprehensive experiments on different benchmark datasets, Office-31, ImageCLEF-DA, and VISDA-2017 for the object recognition task, and UCF101-10 and HMDB51-10 for the video action recognition task, show our newly proposed approaches achieve the state-of-the-art performance, which clearly demonstrates the effectiveness of our proposed approaches for unsupervised domain adaptation.

[73] AirV2X: Unified Air-Ground Vehicle-to-Everything Collaboration

Xiangbo Gao,Yuheng Wu,Xuewen Luo,Keshu Wu,Xinghao Chen,Yuping Wang,Chenxi Liu,Yang Zhou,Zhengzhong Tu

Main category: cs.CV

TL;DR: This paper introduces AirV2X-Perception, a UAV-based dataset for autonomous driving, enabling more flexible and affordable vehicle-to-drone communication solutions.

Details Motivation: Traditional V2X systems are limited by high costs and 'uncovered danger zones' in rural areas. UAVs offer a dynamic, low-cost solution with improved perception capabilities. Method: A large-scale dataset was created using UAVs to simulate driving scenarios across various environments, aiming to replace or complement traditional RSUs. Result: AirV2X-Perception dataset includes 6.73 hours of drone-assisted driving data across urban, suburban, and rural settings under diverse conditions. Conclusion: The AirV2X-Perception dataset contributes to the development and evaluation of V2D algorithms, offering a flexible, cost-effective alternative for enhancing autonomous driving systems. Abstract: While multi-vehicular collaborative driving demonstrates clear advantages over single-vehicle autonomy, traditional infrastructure-based V2X systems remain constrained by substantial deployment costs and the creation of "uncovered danger zones" in rural and suburban areas. We present AirV2X-Perception, a large-scale dataset that leverages Unmanned Aerial Vehicles (UAVs) as a flexible alternative or complement to fixed Road-Side Units (RSUs). Drones offer unique advantages over ground-based perception: complementary bird's-eye-views that reduce occlusions, dynamic positioning capabilities that enable hovering, patrolling, and escorting navigation rules, and significantly lower deployment costs compared to fixed infrastructure. Our dataset comprises 6.73 hours of drone-assisted driving scenarios across urban, suburban, and rural environments with varied weather and lighting conditions. The AirV2X-Perception dataset facilitates the development and standardized evaluation of Vehicle-to-Drone (V2D) algorithms, addressing a critical gap in the rapidly expanding field of aerial-assisted autonomous driving systems. The dataset and development kits are open-sourced at https://github.com/taco-group/AirV2X-Perception.

[74] Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

Runwei Guan,Ningwei Ouyang,Tianhao Xu,Shaofeng Liang,Wei Dai,Yafeng Sun,Shang Gao,Songning Lai,Shanliang Yao,Xuming Hu,Ryan Wen Liu,Yutao Yue,Hui Xiong

Main category: cs.CV

TL;DR: 本文介绍了WaterCaption,这是首个针对水路环境的描述生成数据集,并提出了一个高效的多模态大语言模型Da Yu,通过Nano Transformer Adaptor(NTA)实现了高性能的长文本输出生成。

Details Motivation: 由于水路环境的复杂性,现有的感知数据集和模型无法实现对水路的全局语义理解,限制了大规模监控和结构化日志生成。 Method: 提出了Da Yu,一种可部署在边缘的多模态大语言模型,用于USVs,并提出了一种新的视觉到语言投影器Nano Transformer Adaptor(NTA)。 Result: 构建了包含20.2k图像-文本对数据集和180万词汇量的WaterCaption,以及Da Yu在多个基准测试中的表现优于现有技术。 Conclusion: Da Yu实现了性能与效率的最佳平衡,在WaterCaption和其他几个描述生成基准测试中超越了最先进的模型。 Abstract: Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.

[75] HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

Xiaoyuan Wang,Yizhou Zhao,Botao Ye,Xiaojun Shan,Weijie Lyu,Lu Qi,Kelvin C. K. Chan,Yinxiao Li,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: 提出了一种名为HoliGS的可变形高斯点阵框架,用于解决从长单目RGB视频中进行具身视图合成的问题。

Details Motivation: 现有的4D高斯点阵和动态NeRF流程在处理分钟级捕获时存在训练开销问题。 Method: 将场景分解为静态背景加上时间变化的对象,通过全局刚性变换、骨架驱动的关节运动以及通过可逆神经流进行的微妙非刚性变形来表示每个对象。 Result: 实验表明,与最先进的单目可变形NeRF相比,该方法在挑战数据集上实现了更优的重建质量,并显著减少了训练和渲染时间。 Conclusion: HoliGS提供了一个实用且可扩展的解决方案,适用于现实世界中的EVS场景。 Abstract: We propose HoliGS, a novel deformable Gaussian splatting framework that addresses embodied view synthesis from long monocular RGB videos. Unlike prior 4D Gaussian splatting and dynamic NeRF pipelines, which struggle with training overhead in minute-long captures, our method leverages invertible Gaussian Splatting deformation networks to reconstruct large-scale, dynamic environments accurately. Specifically, we decompose each scene into a static background plus time-varying objects, each represented by learned Gaussian primitives undergoing global rigid transformations, skeleton-driven articulation, and subtle non-rigid deformations via an invertible neural flow. This hierarchical warping strategy enables robust free-viewpoint novel-view rendering from various embodied camera trajectories by attaching Gaussians to a complete canonical foreground shape (\eg, egocentric or third-person follow), which may involve substantial viewpoint changes and interactions between multiple actors. Our experiments demonstrate that \ourmethod~ achieves superior reconstruction quality on challenging datasets while significantly reducing both training and rendering time compared to state-of-the-art monocular deformable NeRFs. These results highlight a practical and scalable solution for EVS in real-world scenarios. The source code will be released.

[76] Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models

Kai Zhao,Wubang Yuan,Zheng Wang,Guanyi Li,Xiaoqiang Zhu,Deng-ping Fan,Dan Zeng

Main category: cs.CV

TL;DR: This paper introduces a novel VLM-guided framework that improves open-vocabulary camouflaged object segmentation and classification by integrating VLM semantics with the Segment Anything Model.

Details Motivation: Existing methods face challenges due to domain gaps between full-image training and cropped-region inference, and generic segmentation models are ineffective for camouflaged objects with subtle boundaries. Method: A cascaded framework combining the Segment Anything Model (SAM) and Vision Language Models (VLMs), using VLM-derived features as prompts to guide segmentation and a soft spatial prior via the alpha channel for classification. Result: Extensive experiments show the method outperforms previous approaches on both OVCOS and conventional camouflaged object segmentation benchmarks. Conclusion: The proposed VLM-guided cascaded framework demonstrates superior performance in OVCOS by effectively leveraging SAM and VLM for accurate segmentation and classification of camouflaged objects. Abstract: Open-Vocabulary Camouflaged Object Segmentation (OVCOS) seeks to segment and classify camouflaged objects from arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories.Recent approaches typically adopt a two-stage paradigm: first segmenting objects, then classifying the segmented regions using Vision Language Models (VLMs).However, these methods (1) suffer from a domain gap caused by the mismatch between VLMs' full-image training and cropped-region inference, and (2) depend on generic segmentation models optimized for well-delineated objects, making them less effective for camouflaged objects.Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation.In this paper,we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS.For segmentation, we leverage the Segment Anything Model (SAM), guided by the VLM.Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy.For classification, we avoid the domain gap introduced by hard cropping.Instead, we treat the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.The same VLM is shared across both segmentation and classification to ensure efficiency and semantic consistency.Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects.

[77] Airway Skill Assessment with Spatiotemporal Attention Mechanisms Using Human Gaze

Jean-Paul Ainam,Rahul,Lora Cavuoto,Matthew Hackett,Jack Norfleet,Suvranu De

Main category: cs.CV

TL;DR: A machine learning-based system using human gaze data and video recordings is proposed for assessing airway management skills, specifically endotracheal intubation, demonstrating improved accuracy and efficiency over conventional methods.

Details Motivation: Traditional subjective evaluation methods often fail to accurately assess competency in real-world emergency medicine scenarios, necessitating a more reliable and objective approach. Method: An autoencoder network extracts features from videos while an attention module generates attention from visual masks derived from human gaze data to classify successful and unsuccessful ETI procedures. Result: The proposed method demonstrated improvements in prediction accuracy, sensitivity, and trustworthiness compared to traditional approaches. Conclusion: The integration of human gaze data into machine learning models provides a robust and objective assessment tool for clinical skills, particularly in high-stress environments. Abstract: Airway management skills are critical in emergency medicine and are typically assessed through subjective evaluation, often failing to gauge competency in real-world scenarios. This paper proposes a machine learning-based approach for assessing airway skills, specifically endotracheal intubation (ETI), using human gaze data and video recordings. The proposed system leverages an attention mechanism guided by the human gaze to enhance the recognition of successful and unsuccessful ETI procedures. Visual masks were created from gaze points to guide the model in focusing on task-relevant areas, reducing irrelevant features. An autoencoder network extracts features from the videos, while an attention module generates attention from the visual masks, and a classifier outputs a classification score. This method, the first to use human gaze for ETI, demonstrates improved accuracy and efficiency over traditional methods. The integration of human gaze data not only enhances model performance but also offers a robust, objective assessment tool for clinical skills, particularly in high-stress environments such as military settings. The results show improvements in prediction accuracy, sensitivity, and trustworthiness, highlighting the potential for this approach to improve clinical training and patient outcomes in emergency medicine.

[78] Capturing Fine-Grained Alignments Improves 3D Affordance Detection

Junsei Tokumitsu,Yuiga Wada

Main category: cs.CV

TL;DR: 本文介绍了一种使用预训练语言模型进行细粒度对齐的新方法,在3D点云中实现了更好的可及性检测效果。

Details Motivation: 现有方法在标准基准测试中的性能有限,因为它们依赖于点云和文本嵌入之间的简单余弦相似度,缺乏细粒度推理所需的表达能力。 Method: 利用预训练语言模型的Affordance Query Module (AQM)来高效捕捉点云和文本之间的细粒度对齐。 Result: 实验结果表明,该方法在3D AffordanceNet数据集上的准确性和平均交并比现有方法更好。 Conclusion: 本文提出了一种新的用于点云中可及性检测的方法LM-AD,并通过引入可及性查询模块(AQM)有效地捕捉点云和文本之间的细粒度对齐。 Abstract: In this work, we address the challenge of affordance detection in 3D point clouds, a task that requires effectively capturing fine-grained alignments between point clouds and text. Existing methods often struggle to model such alignments, resulting in limited performance on standard benchmarks. A key limitation of these approaches is their reliance on simple cosine similarity between point cloud and text embeddings, which lacks the expressiveness needed for fine-grained reasoning. To address this limitation, we propose LM-AD, a novel method for affordance detection in 3D point clouds. Moreover, we introduce the Affordance Query Module (AQM), which efficiently captures fine-grained alignment between point clouds and text by leveraging a pretrained language model. We demonstrated that our method outperformed existing approaches in terms of accuracy and mean Intersection over Union on the 3D AffordanceNet dataset.

[79] Progressive Modality Cooperation for Multi-Modality Domain Adaptation

Weichen Zhang,Dong Xu,Jing Zhang,Wanli Ouyang

Main category: cs.CV

TL;DR: This paper proposes the PMC framework for multi-modality domain adaptation, which leverages multiple modalities to improve knowledge transfer between domains, especially when some modalities are missing in the target domain.

Details Motivation: To improve knowledge transfer in multi-modality domain adaptation by utilizing multiple modalities and addressing challenges like missing modalities and domain distribution mismatches. Method: Progressive Modality Cooperation (PMC) framework and PMC with Privileged Information (PMC-PI) method incorporating a Multi-modality Data Generation (MMG) network using adversarial learning and weighted pseudo semantics. Result: Extensive experiments showed that the PMC framework under both MMDA and MMDA-PI settings is effective for various cross-domain visual recognition tasks on multiple datasets. Conclusion: The proposed PMC and PMC-PI methods effectively enhance multi-modality domain adaptation by leveraging modality-specific and modality-integrated information, as well as generating missing modalities in the target domain. Abstract: In this work, we propose a new generic multi-modality domain adaptation framework called Progressive Modality Cooperation (PMC) to transfer the knowledge learned from the source domain to the target domain by exploiting multiple modality clues (\eg, RGB and depth) under the multi-modality domain adaptation (MMDA) and the more general multi-modality domain adaptation using privileged information (MMDA-PI) settings. Under the MMDA setting, the samples in both domains have all the modalities. In two newly proposed modules of our PMC, the multiple modalities are cooperated for selecting the reliable pseudo-labeled target samples, which captures the modality-specific information and modality-integrated information, respectively. Under the MMDA-PI setting, some modalities are missing in the target domain. Hence, to better exploit the multi-modality data in the source domain, we further propose the PMC with privileged information (PMC-PI) method by proposing a new multi-modality data generation (MMG) network. MMG generates the missing modalities in the target domain based on the source domain data by considering both domain distribution mismatch and semantics preservation, which are respectively achieved by using adversarial learning and conditioning on weighted pseudo semantics. Extensive experiments on three image datasets and eight video datasets for various multi-modality cross-domain visual recognition tasks under both MMDA and MMDA-PI settings clearly demonstrate the effectiveness of our proposed PMC framework.

[80] Continual Retinal Vision-Language Pre-training upon Incremental Imaging Modalities

Yuang Yao,Ruiqi Wu,Yi Zhou,Tao Zhou

Main category: cs.CV

TL;DR: RetCoP是一种新的持续视觉-语言预训练框架,用于整合多模态视网膜图像,通过回放策略和信息蒸馏方法有效降低模型遗忘并提升泛化能力。

Details Motivation: 传统模型专注于单一任务,忽略了多模态数据的互补性;而现有的视网膜基础模型仍然局限于特定模态,难以应对动态环境中的增量数据整合需求。 Method: 提出了RetCoP,这是一个持续的视觉-语言预训练框架,采用回放策略和非对角信息蒸馏方法来缓解灾难性遗忘。 Result: 实验表明RetCoP在所有比较方法中表现最佳,展示了卓越的泛化能力和更低的遗忘率。 Conclusion: RetCoP有效地解决了动态环境中多模态视网膜图像整合的问题,实现了优于现有方法的泛化能力和最低的遗忘率。 Abstract: Traditional fundus image analysis models focus on single-modal tasks, ignoring fundus modality complementarity, which limits their versatility. Recently, retinal foundation models have emerged, but most still remain modality-specific. Integrating multiple fundus imaging modalities into a single foundation model is valuable. However, in dynamic environments, data from different modalities often arrive incrementally, necessitating continual pre-training. To address this, we propose RetCoP, the first continual vision-language pre-training framework in the fundus domain, which incrementally integrates image and text features from different imaging modalities into a single unified foundation model. To mitigate catastrophic forgetting in continual pre-training, we introduce a rehearsal strategy utilizing representative image-text pairs and an off-diagonal information distillation approach. The former allows the model to revisit knowledge from previous stages, while the latter explicitly preserves the alignment between image and text representations. Experiments show that RetCoP outperforms all the compared methods, achieving the best generalization and lowest forgetting rate. The code can be found at https://github.com/Yuang-Yao/RetCoP.

[81] Memory-Augmented Incomplete Multimodal Survival Prediction via Cross-Slide and Gene-Attentive Hypergraph Learning

Mingcheng Qu,Guang Yang,Donglin Di,Yue Gao,Tonghua Su,Yang Song,Lei Fan

Main category: cs.CV

TL;DR: 本文提出了一种用于癌症生存预测的新多模态融合框架,通过超图学习和记忆机制提升预测性能并应对模态不平衡和缺失的问题。

Details Motivation: 现有的癌症生存预测方法主要整合FFPE切片与基因组数据,忽略了其他保存切片(如FF切片)的可用性,并且存在病理数据高分辨率空间特性导致的跨模态融合困难及模态不平衡问题。此外,这些方法通常需要完整的数据模态,限制了其在临床中的应用。 Method: 研究引入了基于超图学习的方法,并结合一种存储先前学习到的成对病理-基因组特征的记忆机制,以动态补偿不完全模态。 Result: 在五个TCGA数据集上的实验表明,该模型在C-Index上比现有先进方法高出2.3%,在不完整模态场景下分别比仅病理(3.3%)和仅基因模型(7.9%)表现更优。 Conclusion: 本研究提出了一种新的多模态生存预测框架,利用超图学习和记忆机制,在解决模态不平衡的同时有效整合多WSI信息和病理与基因组数据的跨模态交互。 Abstract: Multimodal pathology-genomic analysis is critical for cancer survival prediction. However, existing approaches predominantly integrate formalin-fixed paraffin-embedded (FFPE) slides with genomic data, while neglecting the availability of other preservation slides, such as Fresh Froze (FF) slides. Moreover, as the high-resolution spatial nature of pathology data tends to dominate the cross-modality fusion process, it hinders effective multimodal fusion and leads to modality imbalance challenges between pathology and genomics. These methods also typically require complete data modalities, limiting their clinical applicability with incomplete modalities, such as missing either pathology or genomic data. In this paper, we propose a multimodal survival prediction framework that leverages hypergraph learning to effectively integrate multi-WSI information and cross-modality interactions between pathology slides and genomics data while addressing modality imbalance. In addition, we introduce a memory mechanism that stores previously learned paired pathology-genomic features and dynamically compensates for incomplete modalities. Experiments on five TCGA datasets demonstrate that our model outperforms advanced methods by over 2.3% in C-Index. Under incomplete modality scenarios, our approach surpasses pathology-only (3.3%) and gene-only models (7.9%). Code: https://github.com/MCPathology/M2Surv

[82] Comparative Performance of Finetuned ImageNet Pre-trained Models for Electronic Component Classification

Yidi Shao,Longfei Zhou,Fangshuo Tang,Xinyi Shi,Dalang Chen,Shengtao Xia

Main category: cs.CV

TL;DR: 该研究评估了十二个ImageNet预训练模型在电子元件分类任务中的性能,发现它们都取得了良好的结果,其中MobileNet-V2效果最佳。

Details Motivation: 电子元件分类和检测在制造业中至关重要,可以显著降低劳动力成本并促进技术和产业发展。 Method: 比较了十二个ImageNet预训练模型在分类电子元件中的性能。 Result: 所有测试模型都提供了可观的准确率。MobileNet-V2记录了最高的准确率99.95%,而EfficientNet-B0的准确率最低为92.26%。 Conclusion: 使用ImageNet预训练模型在图像分类任务中具有显著优势,并证实了这些方法在电子制造行业的实用性。 Abstract: Electronic component classification and detection are crucial in manufacturing industries, significantly reducing labor costs and promoting technological and industrial development. Pre-trained models, especially those trained on ImageNet, are highly effective in image classification, allowing researchers to achieve excellent results even with limited data. This paper compares the performance of twelve ImageNet pre-trained models in classifying electronic components. Our findings show that all models tested delivered respectable accuracies. MobileNet-V2 recorded the highest at 99.95%, while EfficientNet-B0 had the lowest at 92.26%. These results underscore the substantial benefits of using ImageNet pre-trained models in image classification tasks and confirm the practical applicability of these methods in the electronics manufacturing sector.

[83] Segment Any 3D-Part in a Scene from a Sentence

Hongyu Wu,Pengwan Yang,Yuki M. Asano,Cees G. M. Snoek

Main category: cs.CV

TL;DR: This paper introduces the 3D-PU dataset and OpenPart3D framework to enable effective part-level segmentation in 3D scenes, overcoming data and methodological challenges.

Details Motivation: The motivation stems from the limitations of existing datasets and methods in achieving part-level 3D scene understanding due to expensive data acquisition and annotation processes. Method: The authors introduce the 3D-PU dataset, created using a cost-effective synthetic method with fine-grained part annotations. They also propose OpenPart3D, a 3D-input-only framework designed to address challenges in part-level segmentation. Result: The experiments show that the proposed approach outperforms previous methods in part-level 3D scene understanding tasks, particularly in terms of generalization across various 3D scene datasets. Conclusion: The paper concludes that the proposed OpenPart3D framework and the introduced 3D-PU dataset significantly advance part-level 3D scene understanding, demonstrating strong performance and generalization capabilities in open-vocabulary tasks. Abstract: This paper aims to achieve the segmentation of any 3D part in a scene based on natural language descriptions, extending beyond traditional object-level 3D scene understanding and addressing both data and methodological challenges. Due to the expensive acquisition and annotation burden, existing datasets and methods are predominantly limited to object-level comprehension. To overcome the limitations of data and annotation availability, we introduce the 3D-PU dataset, the first large-scale 3D dataset with dense part annotations, created through an innovative and cost-effective method for constructing synthetic 3D scenes with fine-grained part-level annotations, paving the way for advanced 3D-part scene understanding. On the methodological side, we propose OpenPart3D, a 3D-input-only framework to effectively tackle the challenges of part-level segmentation. Extensive experiments demonstrate the superiority of our approach in open-vocabulary 3D scene understanding tasks at the part level, with strong generalization capabilities across various 3D scene datasets.

[84] Trajectory Prediction in Dynamic Object Tracking: A Critical Study

Zhongping Dong,Liming Chen,Mohand Tahar Kechadi

Main category: cs.CV

TL;DR: 这篇论文综述了动态目标跟踪和轨迹预测的技术进展,讨论了其应用领域的效果与面临的挑战,并提出了未来的改进方向。

Details Motivation: 动态目标跟踪和轨迹预测技术在多个领域(如汽车、安防、医疗和工业自动化)中具有广泛应用,但依然存在一些挑战,例如泛化能力、计算效率和伦理问题。 Method: 研究分析了基于特征、分割、估计和学习的方法,并评估了它们在现实场景中的有效性、部署性和局限性。 Result: 研究发现这些技术对提升安全性与效率有显著贡献,但仍需解决数据依赖性、通用性和隐私保护等问题。 Conclusion: 该研究总结了动态目标跟踪和轨迹预测技术的当前进展,强调了它们在多个领域的重要作用,并提出了未来研究的方向,如多模态数据整合、语义信息融合及伦理隐私保护框架。 Abstract: This study provides a detailed analysis of current advancements in dynamic object tracking (DOT) and trajectory prediction (TP) methodologies, including their applications and challenges. It covers various approaches, such as feature-based, segmentation-based, estimation-based, and learning-based methods, evaluating their effectiveness, deployment, and limitations in real-world scenarios. The study highlights the significant impact of these technologies in automotive and autonomous vehicles, surveillance and security, healthcare, and industrial automation, contributing to safety and efficiency. Despite the progress, challenges such as improved generalization, computational efficiency, reduced data dependency, and ethical considerations still exist. The study suggests future research directions to address these challenges, emphasizing the importance of multimodal data integration, semantic information fusion, and developing context-aware systems, along with ethical and privacy-preserving frameworks.

[85] Image Segmentation using Chan-Vese Active Contours

Pranav Shenoy K. P

Main category: cs.CV

TL;DR: 本文提出了Chan-Vese主动轮廓模型的全面推导和实现,该模型基于Mumford-Shah变分框架,通过区域强度差异而非图像梯度来演化轮廓,适用于分割噪声图像或边界较弱的图像。

Details Motivation: 由于Chan-Vese主动轮廓模型能够基于区域强度差异而非图像梯度来分割图像,因此它在处理噪声图像或者边界较弱的图像时具有显著优势。这激发了我们对这个模型进行全面推导和实现的研究动机。 Method: 文章通过对Mumford-Shah变分框架的运用,并结合水平集公式进行数学推导,详细探讨了每个能量项的处理方法,包括散度定理和曲线演化理论的应用。算法使用有限差分法在Python中实现,特别注意了数值稳定性,例如采用迎风熵格式和曲率正则化方法。 Result: 实验结果表明,该模型在医学图像和合成图像上的分割准确,抗噪性强,且相较于传统的边缘检测方法表现更优。 Conclusion: 本研究证实了Chan-Vese模型在复杂图像分割任务中的适用性,并强调了其在实际应用中的潜力。 Abstract: This paper presents a comprehensive derivation and implementation of the Chan-Vese active contour model for image segmentation. The model, derived from the Mumford-Shah variational framework, evolves contours based on regional intensity differences rather than image gradients, making it highly effective for segmenting noisy images or images with weak boundaries. We provide a rigorous mathematical derivation of the level set formulation, including detailed treatment of each energy term using the divergence theorem and curve evolution theory. The resulting algorithm is implemented in Python using finite difference methods with special care to numerical stability, including an upwind entropy scheme and curvature-based regularization. Experimental results on medical and synthetic images demonstrate accurate segmentation, robustness to noise, and superior performance compared to classical edge-based methods. This study confirms the suitability of the Chan-Vese model for complex segmentation tasks and highlights its potential for use in real-world imaging applications.

[86] Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation

Jintao Rong,Xin Xie,Xinyi Yu,Linlin Ou,Xinyu Zhang,Chunhua Shen,Dong Gong

Main category: cs.CV

TL;DR: 提出了一种名为MotionEcho的新框架,用于在蒸馏视频生成模型中实现运动定制化,同时保持高效性。

Details Motivation: 现有的无需训练的方法无法在蒸馏模型中推广,因为生成过程加速和去噪步骤较多。 Method: 利用高质量、慢速教师模型通过端点预测和插值引导快速学生模型的推理,并根据指导需求动态分配计算量。 Result: 大量实验表明,该方法显著提高了运动保真度和生成质量,同时保持了高效率。 Conclusion: MotionEcho是一个无需训练的测试时蒸馏框架,能够通过扩散教师强制实现运动定制化。 Abstract: Distilled video generation models offer fast and efficient synthesis but struggle with motion customization when guided by reference videos, especially under training-free settings. Existing training-free methods, originally designed for standard diffusion models, fail to generalize due to the accelerated generative process and large denoising steps in distilled models. To address this, we propose MotionEcho, a novel training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing. Our approach uses high-quality, slow teacher models to guide the inference of fast student models through endpoint prediction and interpolation. To maintain efficiency, we dynamically allocate computation across timesteps according to guidance needs. Extensive experiments across various distilled video generation models and benchmark datasets demonstrate that our method significantly improves motion fidelity and generation quality while preserving high efficiency. Project page: https://euminds.github.io/motionecho/

[87] Online camera-pose-free stereo endoscopic tissue deformation recovery with tissue-invariant vision-biomechanics consistency

Jiahe Chen,Naoki Tomii,Ichiro Sakuma,Etsuko Kobayashi

Main category: cs.CV

TL;DR: 该研究提出了一种新的在线方法来建模组织几何形状和变形,解决了先前研究中的多个问题,并在不同数据集上验证了其高精度和稳定性。

Details Motivation: 先前的研究存在相机运动、遮挡、大组织变形、缺乏组织特异性生物力学先验知识以及依赖离线处理等问题。 Method: 提出的方法将组织几何形状建模为3D点和导数图,将组织变形建模为3D位移和局部变形图,并引入了规范图的概念以在线优化组织几何形状和变形。 Result: 实验结果显示,在非遮挡和遮挡区域的3D重建精度分别达到了0.37±0.27毫米和0.39±0.21毫米的表面距离,并且可以估计表面应变分布作为机械分析的额外模式。 Conclusion: 该方法能够稳定地建模组织几何形状和变形,并在非遮挡和遮挡区域均达到较高的3D重建精度,同时还能估计表面应变分布,为机械分析提供额外的模式。 Abstract: Tissue deformation recovery based on stereo endoscopic images is crucial for tool-tissue interaction analysis and benefits surgical navigation and autonomous soft tissue manipulation. Previous research suffers from the problems raised from camera motion, occlusion, large tissue deformation, lack of tissue-specific biomechanical priors, and reliance on offline processing. Unlike previous studies where the tissue geometry and deformation are represented by 3D points and displacements, the proposed method models tissue geometry as the 3D point and derivative map and tissue deformation as the 3D displacement and local deformation map. For a single surface point, 6 parameters are used to describe its rigid motion and 3 parameters for its local deformation. The method is formulated under the camera-centric setting, where all motions are regarded as the scene motion with respect to the camera. Inter-frame alignment is realized by optimizing the inter-frame deformation, making it unnecessary to estimate camera pose. The concept of the canonical map is introduced to optimize tissue geometry and deformation in an online approach. Quantitative and qualitative experiments were conducted using in vivo and ex vivo laparoscopic datasets. With the inputs of depth and optical flow, the method stably models tissue geometry and deformation even when the tissue is partially occluded or moving outside the field of view. Results show that the 3D reconstruction accuracy in the non-occluded and occluded areas reaches 0.37$\pm$0.27 mm and 0.39$\pm$0.21 mm in terms of surface distance, respectively. The method can also estimate surface strain distribution during various manipulations as an extra modality for mechanical-based analysis.

[88] Emergence of Text Readability in Vision Language Models

Jaeyoo Park,Sanghyuk Chun,Wonjae Kim,Sangdoo Yun,Bohyung Han

Main category: cs.CV

TL;DR: 该研究探讨了视觉-语言模型在训练过程中获得识别图像中文本能力的现象,发现这种能力比语义理解能力发展得更晚,并建议需要特定的训练策略来加速这一过程。

Details Motivation: 了解视觉-语言模型在训练过程中如何发展出识别图像中文本内容的能力,并探索其发展的延迟现象。 Method: 分析视觉-语言模型(VLMs)训练过程中识别图像中文本内容的能力的发展过程。 Result: 研究发现,在大量训练迭代后,模型突然获得了读取图像中文本信息的能力(文本可读性),而语义内容的理解则从训练早期阶段开始逐渐发展。匹配带有渲染文本的图像的能力发展得更慢,表明需要更深层次的语义整合。 Conclusion: 研究强调了需要定制的训练策略来加速VLMs中的文本理解,为优化多模态学习的未来研究奠定了基础。 Abstract: We investigate how the ability to recognize textual content within images emerges during the training of Vision-Language Models (VLMs). Our analysis reveals a critical phenomenon: the ability to read textual information in a given image \textbf{(text readability)} emerges abruptly after substantial training iterations, in contrast to semantic content understanding which develops gradually from the early stages of training. This delayed emergence may reflect how contrastive learning tends to initially prioritize general semantic understanding, with text-specific symbolic processing developing later. Interestingly, the ability to match images with rendered text develops even slower, indicating a deeper need for semantic integration. These findings highlight the need for tailored training strategies to accelerate robust text comprehension in VLMs, laying the groundwork for future research on optimizing multimodal learning.

[89] Generate the Forest before the Trees -- A Hierarchical Diffusion model for Climate Downscaling

Declan J. Curran,Sanaa Hobeichi,Hira Saleem,Hao Xue,Flora D. Salim

Main category: cs.CV

TL;DR: HDD는 기후 다운스케일링을 위한 효율적인 확산 모델로, 전통적인 방법들보다 계산 부담을 줄이고 다양한 CMIP6 모델에 쉽게 적용할 수 있습니다.

Details Motivation: 고해상도 기후 데이터를 생성하기 위한 기존 다운스케일링 방법들은 계산 비용이 많이 들며, AI 기반 모델들도 마찬가지로 높은 계산량을 요구합니다. 이에 따라 보다 경량화된 모델 개발의 필요성이 대두되고 있습니다. Method: HDD(Hierarchical Diffusion Downscaling) 모델은 기존 확산 모델 프레임워크에 coarse-to-fine 방식의 계층적 샘플링 과정을 도입하였습니다. 간단한 다운샘플링 방식을 통해 계층 구조를 형성하고, 이를 사용하여 고해상도 데이터를 점진적으로 생성합니다. Result: HDD는 ERA5 재분석 데이터셋과 CMIP6 모델에서 경쟁력 있는 정확도를 달성하면서도, 처리해야 할 픽셀 수를 최대 절반으로 줄여 계산 부담을 크게 낮추었습니다. 또한, 0.25° 해상도에서 훈련된 단일 모델이 더 낮은 해상도의 여러 CMIP6 모델에 그대로 적용될 수 있음을 입증하였습니다. Conclusion: HDD는 계산 효율성을 갖춘 확률적 기후 다운스케일링 방법으로, 저비용 고해상도 기후 예측 및 대규모 앙상블 시뮬레이션에 유용하게 활용될 수 있습니다. Abstract: Downscaling is essential for generating the high-resolution climate data needed for local planning, but traditional methods remain computationally demanding. Recent years have seen impressive results from AI downscaling models, particularly diffusion models, which have attracted attention due to their ability to generate ensembles and overcome the smoothing problem common in other AI methods. However, these models typically remain computationally intensive. We introduce a Hierarchical Diffusion Downscaling (HDD) model, which introduces an easily-extensible hierarchical sampling process to the diffusion framework. A coarse-to-fine hierarchy is imposed via a simple downsampling scheme. HDD achieves competitive accuracy on ERA5 reanalysis datasets and CMIP6 models, significantly reducing computational load by running on up to half as many pixels with competitive results. Additionally, a single model trained at 0.25{\deg} resolution transfers seamlessly across multiple CMIP6 models with much coarser resolution. HDD thus offers a lightweight alternative for probabilistic climate downscaling, facilitating affordable large-ensemble high-resolution climate projections. See a full code implementation at: https://github.com/HDD-Hierarchical-Diffusion-Downscaling/HDD-Hierarchical-Diffusion-Downscaling.

[90] A Global-Local Cross-Attention Network for Ultra-high Resolution Remote Sensing Image Semantic Segmentation

Chen Yi,Shan LianLei

Main category: cs.CV

TL;DR: 本文提出了一种名为GLCANet的轻量级语义分割框架,专用于超高分辨率遥感图像。通过双流架构融合全局语义和局部细节,并在计算效率和分割精度方面优于现有方法。

Details Motivation: 随着超高分辨率遥感技术的快速发展,对准确且高效的语义分割需求日益增加,但现有方法在计算效率和多尺度特征融合上面临挑战。 Method: 提出了GLCANet,采用双流架构,结合自注意力机制和掩码交叉注意力机制,以高效融合全局语义与局部细节并减少GPU使用。 Result: 实验结果表明,GLCANet在准确性和计算效率方面均优于当前最先进的方法,能够在小内存占用的情况下有效处理高分辨率图像。 Conclusion: GLCANet为超高分辨率遥感图像提供了一个有前景的解决方案,解决了现有方法在效率和精度上的不足。 Abstract: With the rapid development of ultra-high resolution (UHR) remote sensing technology, the demand for accurate and efficient semantic segmentation has increased significantly. However, existing methods face challenges in computational efficiency and multi-scale feature fusion. To address these issues, we propose GLCANet (Global-Local Cross-Attention Network), a lightweight segmentation framework designed for UHR remote sensing imagery.GLCANet employs a dual-stream architecture to efficiently fuse global semantics and local details while minimizing GPU usage. A self-attention mechanism enhances long-range dependencies, refines global features, and preserves local details for better semantic consistency. A masked cross-attention mechanism also adaptively fuses global-local features, selectively enhancing fine-grained details while exploiting global context to improve segmentation accuracy. Experimental results show that GLCANet outperforms state-of-the-art methods regarding accuracy and computational efficiency. The model effectively processes large, high-resolution images with a small memory footprint, providing a promising solution for real-world remote sensing applications.

[91] EvDetMAV: Generalized MAV Detection from Moving Event Cameras

Yin Zhang,Zian Ning,Xiaoyu Zhang,Shiliang Guo,Peidong Liu,Shiyu Zhao

Main category: cs.CV

TL;DR: 本文提出一种无需训练、基于事件相机的MAV检测方法,通过提取螺旋桨特征并构建首个事件驱动MAV数据集,显著提升了检测性能。

Details Motivation: 现有的MAV检测方法依赖于RGB图像中的外观特征,但由于目标外观多样性较大,难以实现通用检测。而事件相机可以捕捉高速旋转螺旋桨的独特特征,为检测提供了新思路。 Method: 该方法包含三个模块,分别用于提取螺旋桨的显著特征和时空特征,并过滤背景物体和相机运动带来的噪声。此外,作者还引入了一个新的事件驱动的MAV数据集。 Result: 在提出的测试数据集上,该方法实现了83.0%的精确率(+30.3%)和81.5%的召回率(+36.4%),表现优异且适用于复杂场景。 Conclusion: 该论文提出了一种基于事件相机的微型飞行器(MAV)检测方法,利用旋转螺旋桨的时空特征,在没有训练的情况下显著优于现有技术。 Abstract: Existing micro aerial vehicle (MAV) detection methods mainly rely on the target's appearance features in RGB images, whose diversity makes it difficult to achieve generalized MAV detection. We notice that different types of MAVs share the same distinctive features in event streams due to their high-speed rotating propellers, which are hard to see in RGB images. This paper studies how to detect different types of MAVs from an event camera by fully exploiting the features of propellers in the original event stream. The proposed method consists of three modules to extract the salient and spatio-temporal features of the propellers while filtering out noise from background objects and camera motion. Since there are no existing event-based MAV datasets, we introduce a novel MAV dataset for the community. This is the first event-based MAV dataset comprising multiple scenarios and different types of MAVs. Without training, our method significantly outperforms state-of-the-art methods and can deal with challenging scenarios, achieving a precision rate of 83.0\% (+30.3\%) and a recall rate of 81.5\% (+36.4\%) on the proposed testing dataset. The dataset and code are available at: https://github.com/WindyLab/EvDetMAV.

[92] Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System

Lixuan He,Haoyu Dong,Zhenxing Chen,Yangcheng Yu,Jie Feng,Yong Li

Main category: cs.CV

TL;DR: Mem4Nav is a new hierarchical memory system that significantly improves the performance of Vision-and-Language Navigation agents by enhancing their spatial cognition and memory retention.

Details Motivation: Existing approaches to VLN suffer from limitations in memory integration or spatial reasoning capabilities, prompting the need for a more robust solution like Mem4Nav. Method: Mem4Nav utilizes a hierarchical spatial-cognition memory system combining a sparse octree and a semantic topology graph, embedded through a reversible Transformer for efficient memory management. Result: Mem4Nav achieves 7-13 percentage points improvement in Task Completion, notable SPD reduction, and over 10 percentage points gain in nDTW scores on Touchdown and Map2Seq datasets. Conclusion: Mem4Nav proves effective in enhancing VLN performance across multiple backbones, demonstrating significant improvements in task completion and spatial reasoning. Abstract: Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.

[93] AMF-MedIT: An Efficient Align-Modulation-Fusion Framework for Medical Image-Tabular Data

Congjing Yu,Jing Ye,Yang Liu,Xiaodong Zhang,Zhiyong Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种名为AMF-MedIT的医学图像与表格数据融合框架,通过AMF模块和FT-Mamba编码器解决了跨模态差异和噪声问题,在多模态性能和数据效率上取得了良好的平衡。

Details Motivation: 由于特征维度和模态贡献的跨模态差异以及高维表格输入中的噪声,医学图像和表格数据的有效融合仍然具有挑战性。 Method: 提出了AMF-MedIT框架,包括自适应调制与融合(AMF)模块以及FT-Mamba表格编码器,以解决医学图像和表格数据的融合问题。 Result: 实验表明,AMF-MedIT在多模态性能和数据效率之间取得了优异的平衡,并且能够有效处理不完整的表格数据。此外,FT-Mamba在提取表格特征和指导图像编码器注意力模式方面表现出色。 Conclusion: AMF-MedIT实现了多模态性能与数据效率的卓越平衡,并显示出对不完整表格数据的强大适应性。解释性分析还强调了FT-Mamba在提取不同表格特征和引导图像编码器实现更准确、灵活注意力模式方面的能力。 Abstract: Multimodal medical analysis combining image and tabular data has gained increasing attention. However, effective fusion remains challenging due to cross-modal discrepancies in feature dimensions and modality contributions, as well as the noise from high-dimensional tabular inputs. To address these problems, we present AMF-MedIT, an efficient Align-Modulation-Fusion framework for medical image and tabular data integration, particularly under data-scarce conditions. To harmonize dimension discrepancies and dynamically adjust modality contributions, we propose the Adaptive Modulation and Fusion (AMF) module, a novel modulation-based fusion paradigm with a streamlined architecture. We first derive the modulation objectives and introduce a modality confidence ratio, enabling the incorporation of prior knowledge into the fusion process. Then, the feature masks, density and leakage losses are proposed to achieve the modulation objectives. Additionally, we introduce FT-Mamba, a powerful tabular encoder leveraging a selective mechanism to handle noisy medical tabular data efficiently. Furthermore, interpretability studies are conducted to explore how different tabular encoders supervise the imaging modality during contrastive pretraining for the first time. Extensive experiments demonstrate that AMF-MedIT achieves a superior balance between multimodal performance and data efficiency while showing strong adaptability to incomplete tabular data. Interpretability analysis also highlights FT-Mamba's capabilities in extracting distinct tabular features and guiding the image encoder toward more accurate and flexible attention patterns.

[94] Sampling Matters in Explanations: Towards Trustworthy Attribution Analysis Building Block in Visual Models through Maximizing Explanation Certainty

Róisín Luo,James McDermott,Colm O'Riordan

Main category: cs.CV

TL;DR: This paper proposes a semi-optimal sampling approach for attribution analysis by suppressing input features, achieving better alignment with natural image distribution and improving explanation certainty.

Details Motivation: Prior works using noise-added samples often result in low explanation certainty due to sample distribution misalignment. To build trustworthy attribution analysis, addressing this misalignment problem is essential. Method: A semi-optimal sampling approach is introduced, which suppresses features from inputs. This method aims to align the sample distribution with the natural image distribution, improving explanation certainty. Theoretical analysis and experimental evaluation on ImageNet are conducted. Result: The proposed semi-optimal sampling approach effectively improves explanation certainty and provides more satisfactory explanations compared to state-of-the-art baselines across all experimental models. Conclusion: The paper concludes that suppressing features from inputs, as opposed to adding extra information, leads to better alignment of sample distribution with natural image distribution, enhancing explanation certainty in attribution analysis. Abstract: Image attribution analysis seeks to highlight the feature representations learned by visual models such that the highlighted feature maps can reflect the pixel-wise importance of inputs. Gradient integration is a building block in the attribution analysis by integrating the gradients from multiple derived samples to highlight the semantic features relevant to inferences. Such a building block often combines with other information from visual models such as activation or attention maps to form ultimate explanations. Yet, our theoretical analysis demonstrates that the extent to the alignment of the sample distribution in gradient integration with respect to natural image distribution gives a lower bound of explanation certainty. Prior works add noise into images as samples and the noise distributions can lead to low explanation certainty. Counter-intuitively, our experiment shows that extra information can saturate neural networks. To this end, building trustworthy attribution analysis needs to settle the sample distribution misalignment problem. Instead of adding extra information into input images, we present a semi-optimal sampling approach by suppressing features from inputs. The sample distribution by suppressing features is approximately identical to the distribution of natural images. Our extensive quantitative evaluation on large scale dataset ImageNet affirms that our approach is effective and able to yield more satisfactory explanations against state-of-the-art baselines throughout all experimental models.

[95] Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation

Yuanhe Tian,Lei Mao,Yan Song

Main category: cs.CV

TL;DR: 本文提出了一种基于大型语言模型(LLM)的CT报告生成(CTRG)方法,该方法通过递归视觉特征提取和立体注意力机制来进行分层特征建模。

Details Motivation: 现有的解决方案通常使用通用的2D或3D图像处理技术从CT体积中提取特征,但这些方法没有明确考虑CT切片之间的转换,也没有有效地整合多级图像特征,尤其是那些包含特定器官病变的特征,以指导CT报告生成(CTRG)。 Method: 我们使用视觉Transformer来递归处理CT体积中的每个切片,并通过对不同视角的编码切片进行注意力机制来选择性地获取重要的视觉信息,并将其与文本特征对齐,从而更好地指导CT报告生成(CTRG)的大型语言模型(LLM)。 Result: 我们的方法在M3D-Cap数据集上的实验结果和进一步分析表明,我们的方法优于强基线模型,并达到了最先进的结果。 Conclusion: 该方法在M3D-Cap数据集上的实验结果和进一步分析表明,该方法优于强基线模型,并达到了最先进的结果,证明了其有效性和有效性。 Abstract: Generating reports for computed tomography (CT) images is a challenging task, while similar to existing studies for medical image report generation, yet has its unique characteristics, such as spatial encoding of multiple images, alignment between image volume and texts, etc. Existing solutions typically use general 2D or 3D image processing techniques to extract features from a CT volume, where they firstly compress the volume and then divide the compressed CT slices into patches for visual encoding. These approaches do not explicitly account for the transformations among CT slices, nor do they effectively integrate multi-level image features, particularly those containing specific organ lesions, to instruct CT report generation (CTRG). In considering the strong correlation among consecutive slices in CT scans, in this paper, we propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling. Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to selectively obtain important visual information and align them with textual features, so as to better instruct an LLM for CTRG. Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models and achieves state-of-the-art results, demonstrating its validity and effectiveness.

[96] Deblurring in the Wild: A Real-World Dataset from Smartphone High-Speed Videos

Mahdi Mohd Hossain Noki,Syed Mumtahin Mahmud,Prothito Shovon Majumder,Abdul Mohaimen Al Radi,Md. Haider Ali,Md. Mosaddek Khan

Main category: cs.CV

TL;DR: 这篇论文介绍了一个基于智能手机慢动作视频构建的最大真实世界图像去模糊数据集,旨在作为推动去模糊模型发展的新基准。

Details Motivation: 为了提供一个更大且多样化的数据集,以便更好地评估和推动图像去模糊技术的发展。 Method: 通过使用智能手机慢动作视频构建了最大的真实世界图像去模糊数据集,并利用240帧在一秒内捕捉来模拟真实的长曝光模糊效果。 Result: 该数据集包含超过42,000张高分辨率模糊-清晰图像对,是广泛使用的数据集的大约10倍,场景多样性是其8倍,包括室内和室外环境。 Conclusion: 该论文提出了一种具有挑战性的新基准,以促进稳健和可推广的去模糊模型的发展。 Abstract: We introduce the largest real-world image deblurring dataset constructed from smartphone slow-motion videos. Using 240 frames captured over one second, we simulate realistic long-exposure blur by averaging frames to produce blurry images, while using the temporally centered frame as the sharp reference. Our dataset contains over 42,000 high-resolution blur-sharp image pairs, making it approximately 10 times larger than widely used datasets, with 8 times the amount of different scenes, including indoor and outdoor environments, with varying object and camera motions. We benchmark multiple state-of-the-art (SOTA) deblurring models on our dataset and observe significant performance degradation, highlighting the complexity and diversity of our benchmark. Our dataset serves as a challenging new benchmark to facilitate robust and generalizable deblurring models.

[97] ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing,Qidong Huang,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Jinsong Li,Shuangrui Ding,Weiming Zhang,Nenghai Yu,Jiaqi Wang,Feng Wu,Dahua Lin

Main category: cs.CV

TL;DR: 本论文介绍了一种名为ScaleCap的可扩展图像描述方法,旨在解决现有模型中存在的多模态和语言偏差问题,通过启发式问答和对比句子评分机制,实现更准确、平衡和丰富的图像描述生成。

Details Motivation: 高质量图像描述的关键挑战在于LVLMs的固有偏差:多模态偏差导致描述粒度不平衡;语言偏差导致对不存在对象的幻觉描述。为解决这些问题,作者提出了一个可扩展的去偏描述策略。 Method: 提出了一种名为ScaleCap的可扩展图像描述生成策略,包括启发式问答(用于基于图像生成特定问题并回答以逐步注入相关信息)和对比句子评分(利用句子级离线对比解码识别并消除由语言偏差引起的幻觉描述)。 Result: 随着推理成本的增加,ScaleCap通过提出更多启发式问题来逐步捕捉更多的视觉细节,生成更准确、平衡和信息丰富的描述。实验表明,ScaleCap标注450K图像并用于LVLM预训练后,在11个广泛使用的基准测试中均带来了性能提升,并展示了生成描述的高度丰富性和保真度。 Conclusion: ScaleCap有效地解决了图像描述生成中的多模态偏差和语言偏差问题,提供了一种可扩展的去偏描述策略。通过使用启发式问答和对比句子评分组件,该方法在增加推理预算时能持续丰富并校准描述内容,从而生成更准确、平衡和丰富的图像描述。 Abstract: This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.

[98] Stylized Structural Patterns for Improved Neural Network Pre-training

Farnood Salehi,Vandit Sharma,Amirhossein Askari Farsangi,Tunç Ozan Aydın

Main category: cs.CV

TL;DR: 本文提出了一种利用合成数据训练计算机视觉模型的新方法,通过改进神经分形公式和引入反向风格化技术,显著提高了模型性能,并减少了与使用真实图像训练的模型之间的领域差距。

Details Motivation: 现代计算机视觉中的深度学习模型需要大量真实图像的数据集,而这些数据集难以整理并存在隐私和法律问题,限制了它们的商业用途。已有研究显示合成数据存在问题,使用合成数据训练的模型常常表现不佳。 Method: 首先改进神经分形公式以生成新的合成数据类别;其次,通过反向风格化技术将视觉特征转移到合成数据集上。 Result: 该方法相比现有合成数据集显著降低了分布差距,在EDM2扩散模型预训练中实现了FID降低11%,自编码器重建误差减少了20%,并且ViT-S分类模型在ImageNet-100上的准确率提高了超过10%。 Conclusion: 合成数据的两步法在减少与真实图像的领域差距方面表现出色,为在没有足够大的真实训练集时训练实用模型开辟了激动人心的可能性。 Abstract: Modern deep learning models in computer vision require large datasets of real images, which are difficult to curate and pose privacy and legal concerns, limiting their commercial use. Recent works suggest synthetic data as an alternative, yet models trained with it often underperform. This paper proposes a two-step approach to bridge this gap. First, we propose an improved neural fractal formulation through which we introduce a new class of synthetic data. Second, we propose reverse stylization, a technique that transfers visual features from a small, license-free set of real images onto synthetic datasets, enhancing their effectiveness. We analyze the domain gap between our synthetic datasets and real images using Kernel Inception Distance (KID) and show that our method achieves a significantly lower distributional gap compared to existing synthetic datasets. Furthermore, our experiments across different tasks demonstrate the practical impact of this reduced gap. We show that pretraining the EDM2 diffusion model on our synthetic dataset leads to an 11% reduction in FID during image generation, compared to models trained on existing synthetic datasets, and a 20% decrease in autoencoder reconstruction error, indicating improved performance in data representation. Furthermore, a ViT-S model trained for classification on this synthetic data achieves over a 10% improvement in ImageNet-100 accuracy. Our work opens up exciting possibilities for training practical models when sufficiently large real training sets are not available.

[99] Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning

Pengfei Hao,Shuaibo Li,Hongqiu Wang,Zhizhuo Kou,Junhang Zhang,Guang Yang,Lei Zhu

Main category: cs.CV

TL;DR: 本文提出了Surgery-R1,一种用于手术场景视觉问答任务的推理多模态大语言模型,通过新构建的数据集和两阶段微调机制显著提升了模型性能与可靠性。

Details Motivation: 现有Surgical-VQLA模型缺乏对手术场景的深度推理能力和可解释性,限制了其在临床应用中的可靠性和发展潜力。 Method: 设计了一个两阶段微调机制,并构建了包含54,000条配对数据的Surgery-R1-54k数据集,同时采用监督微调(SFT)和强化微调(RFT)方法提升模型能力。此外,还设计了一种多模态一致性奖励机制来优化RFT效果。 Result: Surgery-R1在Surgical-VQLA任务中表现优于其他最先进的模型和广泛使用的MLLMs,验证了其推理能力和所提方法的有效性。 Conclusion: Surgery-R1有效地提升了手术场景下的视觉问答任务的推理能力和可解释性,优于现有的最先进模型。 Abstract: In recent years, significant progress has been made in the field of surgical scene understanding, particularly in the task of Visual Question Localized-Answering in robotic surgery (Surgical-VQLA). However, existing Surgical-VQLA models lack deep reasoning capabilities and interpretability in surgical scenes, which limits their reliability and potential for development in clinical applications. To address this issue, inspired by the development of Reasoning Multimodal Large Language Models (MLLMs), we first build the Surgery-R1-54k dataset, including paired data for Visual-QA, Grounding-QA, and Chain-of-Thought (CoT). Then, we propose the first Reasoning MLLM for Surgical-VQLA (Surgery-R1). In our Surgery-R1, we design a two-stage fine-tuning mechanism to enable the basic MLLM with complex reasoning abilities by utilizing supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Furthermore, for an efficient and high-quality rule-based reward system in our RFT, we design a Multimodal Coherence reward mechanism to mitigate positional illusions that may arise in surgical scenarios. Experiment results demonstrate that Surgery-R1 outperforms other existing state-of-the-art (SOTA) models in the Surgical-VQLA task and widely-used MLLMs, while also validating its reasoning capabilities and the effectiveness of our approach. The code and dataset will be organized in https://github.com/FiFi-HAO467/Surgery-R1.

[100] USIS16K: High-Quality Dataset for Underwater Salient Instance Segmentation

Lin Hong,Xin Wang,Yihao Li,Xia Wang

Main category: cs.CV

TL;DR: This paper introduces USIS16K, a large-scale dataset for underwater salient instance segmentation, addressing challenges posed by the dynamic nature of underwater environments and lack of data.

Details Motivation: The motivation is to address the challenges of underwater salient instance segmentation due to the inaccessibility of underwater environments and lack of annotated datasets. Method: The method involves creating a comprehensive dataset with high-resolution underwater images annotated with instance-level salient object masks. Result: The result is the creation of the USIS16K dataset, which includes 16,151 images covering 158 categories of underwater objects. Conclusion: The paper concludes by introducing USIS16K, a large-scale dataset for underwater salient instance segmentation and providing benchmark evaluations to advance research in this domain. Abstract: Inspired by the biological visual system that selectively allocates attention to efficiently identify salient objects or regions, underwater salient instance segmentation (USIS) aims to jointly address the problems of where to look (saliency prediction) and what is there (instance segmentation) in underwater scenarios. However, USIS remains an underexplored challenge due to the inaccessibility and dynamic nature of underwater environments, as well as the scarcity of large-scale, high-quality annotated datasets. In this paper, we introduce USIS16K, a large-scale dataset comprising 16,151 high-resolution underwater images collected from diverse environmental settings and covering 158 categories of underwater objects. Each image is annotated with high-quality instance-level salient object masks, representing a significant advance in terms of diversity, complexity, and scalability. Furthermore, we provide benchmark evaluations on underwater object detection and USIS tasks using USIS16K. To facilitate future research in this domain, the dataset and benchmark models are publicly available.

[101] HMSViT: A Hierarchical Masked Self-Supervised Vision Transformer for Corneal Nerve Segmentation and Diabetic Neuropathy Diagnosis

Xin Zhang,Liangxiu Han,Yue Shi,Yanlin Zheng,Alam Uazman,Maryam Ferdousi,Rayaz Malik

Main category: cs.CV

TL;DR: 本文提出了一种新的HMSViT模型,用于糖尿病周围神经病变的无创诊断,相比现有方法在分割精度和诊断准确性上表现更优,同时减少对标注数据的依赖。

Details Motivation: 现有的糖尿病周围神经病变(DPN)自动检测方法在特征提取效率、对手工先验特征的依赖以及数据限制方面存在不足。需要一种更高效、自动化的方法来实现早期无创诊断。 Method: 提出了一种新的分层掩码自监督视觉变换模型(HMSViT),用于角膜神经分割和DPN诊断。该方法采用基于池化的层次结构和双注意力机制,并引入块掩码自监督学习框架,以减少对标注数据的依赖并增强特征鲁棒性。此外,使用多尺度解码器进行分割和分类。 Result: 在临床CCM数据集上的实验表明,HMSViT在神经分割方面的mIoU达到61.34%,诊断准确率达到70.40%,比Swin Transformer和HiViT等现有层次模型高出最多6.39%的分割精度,并且参数更少。消融研究表明,块掩码SSL与层次多尺度特征提取的结合显著优于传统监督训练。 Conclusion: HMSViT通过结合分层掩码自监督学习框架和多尺度特征提取机制,在神经分割和DPN诊断方面实现了卓越的性能,减少了对标注数据的依赖,具有在实际诊断应用中大规模部署的潜力。 Abstract: Diabetic Peripheral Neuropathy (DPN) affects nearly half of diabetes patients, requiring early detection. Corneal Confocal Microscopy (CCM) enables non-invasive diagnosis, but automated methods suffer from inefficient feature extraction, reliance on handcrafted priors, and data limitations. We propose HMSViT, a novel Hierarchical Masked Self-Supervised Vision Transformer (HMSViT) designed for corneal nerve segmentation and DPN diagnosis. Unlike existing methods, HMSViT employs pooling-based hierarchical and dual attention mechanisms with absolute positional encoding, enabling efficient multi-scale feature extraction by capturing fine-grained local details in early layers and integrating global context in deeper layers, all at a lower computational cost. A block-masked self supervised learning framework is designed for the HMSViT that reduces reliance on labelled data, enhancing feature robustness, while a multi-scale decoder is used for segmentation and classification by fusing hierarchical features. Experiments on clinical CCM datasets showed HMSViT achieves state-of-the-art performance, with 61.34% mIoU for nerve segmentation and 70.40% diagnostic accuracy, outperforming leading hierarchical models like the Swin Transformer and HiViT by margins of up to 6.39% in segmentation accuracy while using fewer parameters. Detailed ablation studies further reveal that integrating block-masked SSL with hierarchical multi-scale feature extraction substantially enhances performance compared to conventional supervised training. Overall, these comprehensive experiments confirm that HMSViT delivers excellent, robust, and clinically viable results, demonstrating its potential for scalable deployment in real-world diagnostic applications.

[102] SceneCrafter: Controllable Multi-View Driving Scene Editing

Zehao Zhu,Yuliang Zou,Chiyu Max Jiang,Bo Sun,Vincent Casser,Xiukun Huang,Jiahao Wang,Zhenpei Yang,Ruiqi Gao,Leonidas Guibas,Mingxing Tan,Dragomir Anguelov

Main category: cs.CV

TL;DR: This paper presents SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. It addresses the challenges of cross-camera 3D consistency, learning 'empty street' priors, and obtaining paired image tuples while preserving consistent layout and geometry.

Details Motivation: Simulation is crucial for developing and evaluating autonomous vehicle systems. While generative models can synthesize highly realistic images, purely synthetically generated scenes are not grounded in reality and have difficulty inspiring confidence in the relevance of its outcomes. Editing models leverage source scenes from real driving logs, enabling the simulation of different traffic layouts, behaviors, and operating conditions, but present fresh sets of challenges in driving simulation such as cross-camera 3D consistency, learning 'empty street' priors, and obtaining paired image tuples while preserving consistent layout and geometry. Method: The paper proposes SceneCrafter, which builds on recent advancements in multi-view diffusion models. It uses a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes, and high-definition maps. The paper also proposes a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits and introduces an alpha-blending framework to synthesize data with local edits. Result: SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines. Conclusion: SceneCrafter is a versatile editor that demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines. Abstract: Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning ``empty street" priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines.

[103] Visual hallucination detection in large vision-language models via evidential conflict

Tao Huang,Zhekun Liu,Rui Wang,Yang Zhang,Liping Jing

Main category: cs.CV

TL;DR: This paper introduces a new benchmark (PRE-HAL) and a Dempster-Shafer theory-based method to detect visual hallucinations in Large Vision-Language Models, showing significant performance improvements.

Details Motivation: To address the reliability gap caused by visual hallucinations in LVLMs, particularly in safety-critical AI applications. Method: Development of the PRE-HAL dataset and application of Dempster-Shafer theory for uncertainty estimation to detect visual hallucinations in LVLMs. Result: The proposed method showed improved performance over baseline metrics with average AUROC improvements of 4%, 10%, and 7% across three LVLMs. Conclusion: The study concludes that visual hallucinations in LVLMs can be effectively detected using the proposed DST-based method, which outperforms existing uncertainty metrics. Abstract: Despite the remarkable multimodal capabilities of Large Vision-Language Models (LVLMs), discrepancies often occur between visual inputs and textual outputs--a phenomenon we term visual hallucination. This critical reliability gap poses substantial risks in safety-critical Artificial Intelligence (AI) applications, necessitating a comprehensive evaluation benchmark and effective detection methods. Firstly, we observe that existing visual-centric hallucination benchmarks mainly assess LVLMs from a perception perspective, overlooking hallucinations arising from advanced reasoning capabilities. We develop the Perception-Reasoning Evaluation Hallucination (PRE-HAL) dataset, which enables the systematic evaluation of both perception and reasoning capabilities of LVLMs across multiple visual semantics, such as instances, scenes, and relations. Comprehensive evaluation with this new benchmark exposed more visual vulnerabilities, particularly in the more challenging task of relation reasoning. To address this issue, we propose, to the best of our knowledge, the first Dempster-Shafer theory (DST)-based visual hallucination detection method for LVLMs through uncertainty estimation. This method aims to efficiently capture the degree of conflict in high-level features at the model inference phase. Specifically, our approach employs simple mass functions to mitigate the computational complexity of evidence combination on power sets. We conduct an extensive evaluation of state-of-the-art LVLMs, LLaVA-v1.5, mPLUG-Owl2 and mPLUG-Owl3, with the new PRE-HAL benchmark. Experimental results indicate that our method outperforms five baseline uncertainty metrics, achieving average AUROC improvements of 4%, 10%, and 7% across three LVLMs. Our code is available at https://github.com/HT86159/Evidential-Conflict.

[104] ReMAR-DS: Recalibrated Feature Learning for Metal Artifact Reduction and CT Domain Transformation

Mubashara Rehman,Niki Martinel,Michele Avanzo,Riccardo Spizzo,Christian Micheloni

Main category: cs.CV

TL;DR: ReMAR-DS is a deep learning framework that improves kVCT image quality by reducing metal artifacts and mimicking MVCT, enhancing radiotherapy planning and reducing patient radiation exposure.

Details Motivation: Artifacts in kVCT imaging degrade image quality and impact clinical decisions. Traditional MAR methods have limitations in balancing artifact reduction and anatomical preservation. Method: ReMAR-DS uses an encoder-decoder architecture with enhanced feature recalibration to focus on artifact regions and preserve anatomical structures during reconstruction. Result: The approach achieves high-quality MVCT-like reconstructions, validated through qualitative and quantitative evaluations, effectively reducing artifacts while preserving important anatomical details. Conclusion: The proposed ReMAR-DS framework successfully reduces metal artifacts and transforms kVCT to MVCT, improving radiotherapy planning and reducing the need for high-dose MVCT scans. Abstract: Artifacts in kilo-Voltage CT (kVCT) imaging degrade image quality, impacting clinical decisions. We propose a deep learning framework for metal artifact reduction (MAR) and domain transformation from kVCT to Mega-Voltage CT (MVCT). The proposed framework, ReMAR-DS, utilizes an encoder-decoder architecture with enhanced feature recalibration, effectively reducing artifacts while preserving anatomical structures. This ensures that only relevant information is utilized in the reconstruction process. By infusing recalibrated features from the encoder block, the model focuses on relevant spatial regions (e.g., areas with artifacts) and highlights key features across channels (e.g., anatomical structures), leading to improved reconstruction of artifact-corrupted regions. Unlike traditional MAR methods, our approach bridges the gap between high-resolution kVCT and artifact-resistant MVCT, enhancing radiotherapy planning. It produces high-quality MVCT-like reconstructions, validated through qualitative and quantitative evaluations. Clinically, this enables oncologists to rely on kVCT alone, reducing repeated high-dose MVCT scans and lowering radiation exposure for cancer patients.

[105] Identifying Physically Realizable Triggers for Backdoored Face Recognition Networks

Ankita Raj,Ambar Pal,Chetan Arora

Main category: cs.CV

TL;DR: 论文提出了一种新方法来检测和识别针对人脸识别系统的物理可实现触发器后门攻击,效果优于朴素基线方法。

Details Motivation: 由于带有后门的人脸识别系统可能被用于高安全性应用,因此需要一种方法来检测并识别这些隐藏的攻击模式以保障系统安全。 Method: 该文研究了对抗性后门攻击对人脸识别系统的影响,并开发了一种技术来检测和识别这些攻击中的触发模式。 Result: 作者展示了他们方法的有效性,与56%准确率的基线方法相比,他们的方法能以74%的top-5准确率识别触发器(如绿色太阳镜或红色帽子). Conclusion: 本文提出了一种新的检测方法,能够有效识别遭受物理可实现触发器攻击的FR网络,并在74%的准确率下识别出触发器。 Abstract: Backdoor attacks embed a hidden functionality into deep neural networks, causing the network to display anomalous behavior when activated by a predetermined pattern in the input Trigger, while behaving well otherwise on public test data. Recent works have shown that backdoored face recognition (FR) systems can respond to natural-looking triggers like a particular pair of sunglasses. Such attacks pose a serious threat to the applicability of FR systems in high-security applications. We propose a novel technique to (1) detect whether an FR network is compromised with a natural, physically realizable trigger, and (2) identify such triggers given a compromised network. We demonstrate the effectiveness of our methods with a compromised FR network, where we are able to identify the trigger (e.g., green sunglasses or red hat) with a top-5 accuracy of 74%, whereas a naive brute force baseline achieves 56% accuracy.

[106] General Methods Make Great Domain-specific Foundation Models: A Case-study on Fetal Ultrasound

Jakob Ambsdorf,Asbjørn Munk,Sebastian Llambias,Anders Nymark Christensen,Kamil Mikolaj,Randall Balestriero,Martin Tolsgaard,Aasa Feragen,Mads Nielsen

Main category: cs.CV

TL;DR: 本论文研究了在大规模未标记医学数据集上预训练自定义基础模型与使用现有通用模型进行迁移学习的效果,发现预训练定制模型更优且无需复杂方法创新。

Details Motivation: 研究人员面临是否应基于医学数据预训练自定义基础模型或直接使用迁移学习的决策问题,并希望了解当前计算机视觉方法在医学领域中的适应性。 Method: 通过选择成熟的DINOv2方法,在一个包含200万张胎儿超声图像的区域数据集上训练基础模型,并将其性能与多种预训练模型及基线模型进行比较。 Result: 该模型在三个来自不同国家的胎儿超声数据集中表现出色,涵盖了分类、分割和小样本任务;结果表明,即使模型规模较小,定制数据预训练优于自然图像预训练,且不需要复杂的调参或方法创新。 Conclusion: 论文认为在常见的计算资源限制下,开发特定领域基础模型时应避免过度强调方法创新,而应优先考虑使用经过良好调整的现有方法。 Abstract: With access to large-scale, unlabeled medical datasets, researchers are confronted with two questions: Should they attempt to pretrain a custom foundation model on this medical data, or use transfer-learning from an existing generalist model? And, if a custom model is pretrained, are novel methods required? In this paper we explore these questions by conducting a case-study, in which we train a foundation model on a large regional fetal ultrasound dataset of 2M images. By selecting the well-established DINOv2 method for pretraining, we achieve state-of-the-art results on three fetal ultrasound datasets, covering data from different countries, classification, segmentation, and few-shot tasks. We compare against a series of models pretrained on natural images, ultrasound images, and supervised baselines. Our results demonstrate two key insights: (i) Pretraining on custom data is worth it, even if smaller models are trained on less data, as scaling in natural image pretraining does not translate to ultrasound performance. (ii) Well-tuned methods from computer vision are making it feasible to train custom foundation models for a given medical domain, requiring no hyperparameter tuning and little methodological adaptation. Given these findings, we argue that a bias towards methodological innovation should be avoided when developing domain specific foundation models under common computational resource constraints.

[107] MambaOutRS: A Hybrid CNN-Fourier Architecture for Remote Sensing Image Classification

Minjong Cheon,Changbae Mun

Main category: cs.CV

TL;DR: 本文提出了一种新的混合卷积架构MambaOutRS,用于遥感图像分类,通过结合门控卷积和基于频域的全局上下文捕获技术,实现了高效的高性能图像分类。

Details Motivation: 为了克服现有状态空间模型在2D视觉数据上的效率损失问题,重新评估了循环状态空间模型的必要性。 Method: 引入了一种新的混合卷积架构MambaOutRS,该架构基于堆叠门控CNN块进行局部特征提取,并引入了一个新颖的傅里叶滤波门模块,在频域中操作以高效捕捉全局上下文信息。 Result: MambaOutRS在多个挑战性的遥感数据集上实现了最先进的性能,其中MambaOutRS-t变体(24.0M参数)在UC Merced和AID上的F1分数分别达到了98.41%和95.99%,明显优于现有的基线模型。 Conclusion: MambaOutRS有效取代了循环状态空间模型的复杂性,提供了一种引人注目的高效范式,适用于开发高性能的遥感和其他视觉领域的深度学习模型,特别是在计算效率至关重要的情况下。 Abstract: Recent advances in deep learning for vision tasks have seen the rise of State Space Models (SSMs) like Mamba, celebrated for their linear scalability. However, their adaptation to 2D visual data often necessitates complex modifications that may diminish efficiency. In this paper, we introduce MambaOutRS, a novel hybrid convolutional architecture for remote sensing image classification that re-evaluates the necessity of recurrent SSMs. MambaOutRS builds upon stacked Gated CNN blocks for local feature extraction and introduces a novel Fourier Filter Gate (FFG) module that operates in the frequency domain to capture global contextual information efficiently. Our architecture employs a four-stage hierarchical design and was extensively evaluated on challenging remote sensing datasets: UC Merced, AID, NWPU-RESISC45, and EuroSAT. MambaOutRS consistently achieved state-of-the-art (SOTA) performance across these benchmarks. Notably, our MambaOutRS-t variant (24.0M parameters) attained the highest F1-scores of 98.41\% on UC Merced and 95.99\% on AID, significantly outperforming existing baselines, including larger transformer models and Mamba-based architectures, despite using considerably fewer parameters. An ablation study conclusively demonstrates the critical role of the Fourier Filter Gate in enhancing the model's ability to capture global spatial patterns, leading to robust and accurate classification. These results strongly suggest that the complexities of recurrent SSMs can be effectively superseded by a judicious combination of gated convolutions for spatial mixing and frequency-based gates for spectral global context. Thus, MambaOutRS provides a compelling and efficient paradigm for developing high-performance deep learning models in remote sensing and other vision domains, particularly where computational efficiency is paramount.

[108] SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images

Gencer Sumbul,Chang Xu,Emanuele Dalsasso,Devis Tuia

Main category: cs.CV

TL;DR: SMARTIES是一种新型通用基础模型,可以灵活处理多种遥感传感器数据,无需针对每个传感器进行单独优化。

Details Motivation: 传统深度学习模型通常局限于单一或固定组合的传感器输入,需要架构更改和重新训练来适应不同传感器,限制了在多传感器遥感数据处理中的可扩展性和泛化能力。因此,需要一个能够接受多样化传感器输入并调节其特征表示的统一模型。 Method: SMARTIES通过跨传感器token混合的单一流形变压器模型训练,将来自异构传感器的数据投影到共享的光谱感知空间中,从而实现任意波段组合的使用。 Result: SMARTIES在单模态和多模态任务上均优于依赖传感器预训练的先前模型,展示了其在多样化的遥感传感器上的卓越性能。 Conclusion: SMARTIES是一个通用且多功能的基础模型,提升了传感器特定/依赖工作的可扩展性和泛化能力,使得遥感数据处理更加灵活和高效。 Abstract: From optical sensors to microwave radars, leveraging the complementary strengths of remote sensing (RS) sensors is crucial for achieving dense spatio-temporal monitoring of our planet. In contrast, recent deep learning models, whether task-specific or foundational, are often specific to single sensors or to fixed combinations: adapting such models to different sensory inputs requires both architectural changes and re-training, limiting scalability and generalization across multiple RS sensors. On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SMARTIES, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SMARTIES projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the use of arbitrary combinations of bands both for training and inference. To obtain sensor-agnostic representations, we train a single, unified transformer model reconstructing masked multi-sensor data with cross-sensor token mixup. On both single- and multi-modal tasks across diverse sensors, SMARTIES outperforms previous models that rely on sensor-specific pretraining. Our code and pretrained models are available at https://gsumbul.github.io/SMARTIES.

[109] Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications

Lujun Li,Yiqun Wang,Radu State

Main category: cs.CV

TL;DR: This paper proposes a Time-series ViT framework to improve early season crop mapping by reconstructing MSI data in cloud-covered areas using temporal coherence of MSI and complementary SAR data.

Details Motivation: Cloud cover in multispectral imagery (MSI) causes missing or corrupted spectral information, which poses challenges for early season crop mapping. Synthetic aperture radar (SAR) data is not affected by cloud interference but lacks sufficient spectral detail for precise crop mapping. Method: A novel framework, Time-series MSI Image Reconstruction using Vision Transformer (ViT), was developed to reconstruct MSI data in cloud-covered regions by utilizing the attention mechanism to integrate temporal coherence of MSI and complementary information from SAR. Result: Comprehensive experiments using rigorous reconstruction evaluation metrics showed that the Time-series ViT framework significantly outperforms baselines that use non-time-series MSI and SAR or time-series MSI without SAR. Conclusion: The proposed Time-series ViT framework effectively enhances MSI image reconstruction in cloud-covered regions by leveraging the temporal coherence of MSI and the complementary information from SAR. Abstract: Cloud cover in multispectral imagery (MSI) poses significant challenges for early season crop mapping, as it leads to missing or corrupted spectral information. Synthetic aperture radar (SAR) data, which is not affected by cloud interference, offers a complementary solution, but lack sufficient spectral detail for precise crop mapping. To address this, we propose a novel framework, Time-series MSI Image Reconstruction using Vision Transformer (ViT), to reconstruct MSI data in cloud-covered regions by leveraging the temporal coherence of MSI and the complementary information from SAR from the attention mechanism. Comprehensive experiments, using rigorous reconstruction evaluation metrics, demonstrate that Time-series ViT framework significantly outperforms baselines that use non-time-series MSI and SAR or time-series MSI without SAR, effectively enhancing MSI image reconstruction in cloud-covered regions.

[110] Implementing blind navigation through multi-modal sensing and gait guidance

Feifan Yan,Tianle Zeng,Meixi He

Main category: cs.CV

TL;DR: 本文介绍了一种新的可穿戴盲人引导设备,利用步态分析和多模态感知技术,提高了导航指导的效果。

Details Motivation: 全球视力受损人口已超过2.2亿,传统辅助工具如导盲杖和导盲犬存在不足,需要更有效的解决方案。 Method: 提出了一种基于步态的引导系统,并结合多模态传感技术进行环境感知。 Result: 实验结果显示该设备在盲人引导方面表现出色,优于标准导盲杖。 Conclusion: 本文提出了一种创新的可穿戴盲人引导设备,通过步态分析和多模态感知技术,在室内和室外实验中均显示出比传统导盲杖更优越的性能。 Abstract: By the year 2023, the global population of individuals with impaired vision has surpassed 220 million. People with impaired vision will find it difficult while finding path or avoiding obstacles, and must ask for auxiliary tools for help. Although traditional aids such as guide canes and guide dogs exist, they still have some shortcomings. In this paper, we present our wearable blind guiding device, what perform navigation guidance through our proposed Gait-based Guiding System. Our device innovatively integrates gait phase analysis for walking guide, and in terms of environmental perception, we use multimodal sensing to acquire diverse environment information. During the experiment, we conducted both indoor and outdoor experiments, and compared with the standard guide cane. The result shows superior performance of our device in blind guidance.

[111] Self-Supervised Multimodal NeRF for Autonomous Driving

Gaurav Sharma,Ravi Kothari,Josef Schmid

Main category: cs.CV

TL;DR: 提出了一种自监督的多模态NeRF框架(NVSF),用于高效的新视角合成就自动驾驶场景,无需3D标签且表现优异。

Details Motivation: 为了解决现有动态NeRF方法需要3D标签的问题,并提高在自动驾驶场景中对静态和动态物体的新视角合成性能。 Method: 通过结合LiDAR和Camera数据,使用隐式神经表示空间和时间变化场景,并引入启发式图像像素采样与Double Gradient掩码提升训练效率和局部特征保留。 Result: 在KITTI-360数据集上的实验表明,该框架在LiDAR和Camera领域均优于基线模型。 Conclusion: 该论文提出了一种基于NeRF的自监督框架NVSF,用于自动驾驶场景中的多模态新视角合成。 Abstract: In this paper, we propose a Neural Radiance Fields (NeRF) based framework, referred to as Novel View Synthesis Framework (NVSF). It jointly learns the implicit neural representation of space and time-varying scene for both LiDAR and Camera. We test this on a real-world autonomous driving scenario containing both static and dynamic scenes. Compared to existing multimodal dynamic NeRFs, our framework is self-supervised, thus eliminating the need for 3D labels. For efficient training and faster convergence, we introduce heuristic-based image pixel sampling to focus on pixels with rich information. To preserve the local features of LiDAR points, a Double Gradient based mask is employed. Extensive experiments on the KITTI-360 dataset show that, compared to the baseline models, our framework has reported best performance on both LiDAR and Camera domain. Code of the model is available at https://github.com/gaurav00700/Selfsupervised-NVSF

[112] VideoPCDNet: Video Parsing and Prediction with Phase Correlation Networks

Noel José Rodrigues Vicente,Enrique Lehner,Angel Villar-Corrales,Jan Nogga,Sven Behnke

Main category: cs.CV

TL;DR: VideoPCDNet是一种基于频域分析的对象中心视频分解和预测框架,在无监督任务中表现优异。

Details Motivation: 为了提高动态环境中视频内容理解与预测的准确性与可解释性。 Method: 利用频域相位相关技术递归解析视频对象,结合频域操作与轻量级学习模块建模物体运动。 Result: 成功实现无监督对象跟踪和未来视频帧预测,学习到可解释的对象和运动表示。 Conclusion: VideoPCDNet实现了视频内容的对象中心分解和预测,并在多个合成数据集上优于其他无监督模型。 Abstract: Understanding and predicting video content is essential for planning and reasoning in dynamic environments. Despite advancements, unsupervised learning of object representations and dynamics remains challenging. We present VideoPCDNet, an unsupervised framework for object-centric video decomposition and prediction. Our model uses frequency-domain phase correlation techniques to recursively parse videos into object components, which are represented as transformed versions of learned object prototypes, enabling accurate and interpretable tracking. By explicitly modeling object motion through a combination of frequency domain operations and lightweight learned modules, VideoPCDNet enables accurate unsupervised object tracking and prediction of future video frames. In our experiments, we demonstrate that VideoPCDNet outperforms multiple object-centric baseline models for unsupervised tracking and prediction on several synthetic datasets, while learning interpretable object and motion representations.

[113] HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions

Mrunmai Vivek Phatak,Julian Lorenz,Nico Hörmann,Jörg Hähner,Rainer Lienhart

Main category: cs.CV

TL;DR: This paper presents HOIverse, a new synthetic dataset for scene understanding that combines scene graphs and human-object interactions, offering detailed relationship data to advance research in this field.

Details Motivation: Current research lacks reliable datasets for scene understanding in indoor environments involving humans. Scene Graphs are used to create structured representations for better visual understanding. Method: The authors introduce HOIverse, a synthetic dataset combining scene graphs and human-object interactions with detailed parametric relations. They benchmark this dataset on state-of-the-art scene graph generation models. Result: HOIverse provides accurate and dense relationship ground truths between humans and surrounding objects, including RGB images, segmentation masks, depth images, and human keypoints. Conclusion: The paper concludes that the HOIverse dataset can significantly accelerate research in scene understanding involving people by providing accurate and dense relationship ground truths. Abstract: When humans and robotic agents coexist in an environment, scene understanding becomes crucial for the agents to carry out various downstream tasks like navigation and planning. Hence, an agent must be capable of localizing and identifying actions performed by the human. Current research lacks reliable datasets for performing scene understanding within indoor environments where humans are also a part of the scene. Scene Graphs enable us to generate a structured representation of a scene or an image to perform visual scene understanding. To tackle this, we present HOIverse a synthetic dataset at the intersection of scene graph and human-object interaction, consisting of accurate and dense relationship ground truths between humans and surrounding objects along with corresponding RGB images, segmentation masks, depth images and human keypoints. We compute parametric relations between various pairs of objects and human-object pairs, resulting in an accurate and unambiguous relation definitions. In addition, we benchmark our dataset on state-of-the-art scene graph generation models to predict parametric relations and human-object interactions. Through this dataset, we aim to accelerate research in the field of scene understanding involving people.

[114] PEVLM: Parallel Encoding for Vision-Language Models

Letian Kang,Shixian Luo,Yiqiang Li,Xiaoyang Yu,Shenxuan Zhou,Yong Wu

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉-语言模型并行编码策略PEVLM,在不需模型微调的情况下显著提升长视频理解的效率和准确率。

Details Motivation: 现有的视觉-语言模型在长视频理解上的应用受限于标准注意力机制的二次复杂度。 Method: 提出了PEVLM,一种并行编码策略,将输入分块处理,保留全注意力位置嵌入,并对齐注意力权重以模拟全注意力分布。 Result: 实验结果表明,PEVLM比现有推理高效方法准确率提高了8.37%,注意力计算速度提升了7.47倍,端到端延迟降低了40%。在严格的延迟限制下,准确率从23.26%提高到61.03%。 Conclusion: PEVLM是一种适用于低延迟、长上下文视频理解的有效方法,特别适合自动驾驶等实际应用。 Abstract: Vision-Language Models (VLMs) have demonstrated strong performance in video-language tasks, yet their application to long video understanding remains constrained by the quadratic complexity of standard attention mechanisms. In this paper, we propose \textbf{PEVLM}, a parallel encoding strategy specifically designed to improve the prefill efficiency of VLMs without requiring model finetuning. PEVLM partitions the input into block-wise segments with a shared sink, preserves full-attention positional embeddings, and aligns attention weights to mimic full-attention distributions. This design reduces attention computation from $O((T \times N)^2)$ to $O(T \times N)$ while maintaining high accuracy. Extensive experiments on the LongVideoBench benchmark show that PEVLM achieves up to 8.37\% accuracy improvement over existing inference-efficient methods and delivers up to 7.47x speedup in attention computation and 40\% reduction in end-to-end latency. Under strict latency constraints, PEVLM significantly outperforms baselines, raising accuracy from 23.26\% to 61.03\%. These results highlight PEVLM's effectiveness for low-latency, long-context video understanding, making it well-suited for real-world applications such as autonomous driving.

[115] Video Compression for Spatiotemporal Earth System Data

Oscar J. Pellicer-Valero,Cesar Aybar,Gustau Camps Valls

Main category: cs.CV

TL;DR: xarrayvideo is a Python library that compresses Earth observation datasets as videos, achieving high compression ratios while preserving data quality, demonstrated through real-world examples and applications in machine learning.

Details Motivation: Large-scale Earth system datasets have characteristics similar to videos, and their spatial, temporal, and spectral redundancies can be exploited by video compression techniques to achieve high compression ratios while maintaining fidelity. Method: xarrayvideo compresses multichannel spatiotemporal datasets by encoding them as videos leveraging standard video codecs through ffmpeg. Result: The library achieves compression ratios up to 250x with high fidelity across four real-world datasets. PSNRs achieved range from 40.60 dB to 65.91 dB depending on dataset and compression rate. Compressed versions of DeepExtremeCubes and DynamicEarthNet are redistributed in TACO format via HuggingFace at significantly reduced sizes without compromising quality, and no performance loss is observed in downstream deep learning tasks. Conclusion: xarrayvideo presents an efficient solution for handling the rapidly growing size of Earth observation datasets, making advanced compression techniques accessible and practical to the Earth science community. Abstract: Large-scale Earth system datasets, from high-resolution remote sensing imagery to spatiotemporal climate model outputs, exhibit characteristics analogous to those of standard videos. Their inherent spatial, temporal, and spectral redundancies can thus be readily exploited by established video compression techniques. Here, we present xarrayvideo, a Python library for compressing multichannel spatiotemporal datasets by encoding them as videos. Our approach achieves compression ratios of up to 250x while maintaining high fidelity by leveraging standard, well-optimized video codecs through ffmpeg. We demonstrate the library's effectiveness on four real-world multichannel spatiotemporal datasets: DynamicEarthNet (very high resolution Planet images), DeepExtremeCubes (high resolution Sentinel-2 images), ERA5 (weather reanalysis data), and the SimpleS2 dataset (high resolution multichannel Sentinel-2 images), achieving Peak Signal-to-Noise Ratios (PSNRs) of 55.86, 40.60, 46.58, and 43.23 dB at 0.1 bits per pixel per band (bpppb) and 65.91, 54.28, 62.90, and 55.04 dB at 1 bpppb. We are redistributing two of these datasets, DeepExtremeCubes (2.3 Tb) and DynamicEarthNet (525 Gb), in the machine-learning-ready and cloud-ready TACO format through HuggingFace at significantly reduced sizes (270 Gb and 8.5 Gb, respectively) without compromising quality (PSNR 55.77-56.65 and 60.15). No performance loss is observed when the compressed versions of these datasets are used in their respective deep learning-based downstream tasks (next step reflectance prediction and landcover segmentation). In conclusion, xarrayvideo presents an efficient solution for handling the rapidly growing size of Earth observation datasets, making advanced compression techniques accessible and practical to the Earth science community. The library is available for use at https://github.com/IPL-UV/xarrayvideo

[116] SAM2-SGP: Enhancing SAM2 for Medical Image Segmentation via Support-Set Guided Prompting

Yang Xing,Jiong Wu,Yuheng Bu,Kuang Gong

Main category: cs.CV

TL;DR: This paper proposes SAM2-SGP, an automated prompting framework for medical image segmentation that improves performance by generating pseudo-masks and mitigating domain shift.

Details Motivation: Existing vision foundation models like SAM2 rely on manual prompts and suffer from domain shift, limiting their effectiveness in medical image segmentation. Method: SAM2-SGP uses a Pseudo-mask Generation module, a Pseudo-mask Attention module, and a low-rank adaptation strategy to automate prompting and mitigate domain shift. Result: SAM2-SGP outperformed state-of-the-art models such as nnUNet and SwinUNet, as well as foundation models like SAM2 and MedSAM2, across multiple medical imaging modalities. Conclusion: The proposed SAM2-SGP framework effectively eliminates the need for manual prompts in medical image segmentation and significantly improves performance over existing models. Abstract: Although new vision foundation models such as Segment Anything Model 2 (SAM2) have significantly enhanced zero-shot image segmentation capabilities, reliance on human-provided prompts poses significant challenges in adapting SAM2 to medical image segmentation tasks. Moreover, SAM2's performance in medical image segmentation was limited by the domain shift issue, since it was originally trained on natural images and videos. To address these challenges, we proposed SAM2 with support-set guided prompting (SAM2-SGP), a framework that eliminated the need for manual prompts. The proposed model leveraged the memory mechanism of SAM2 to generate pseudo-masks using image-mask pairs from a support set via a Pseudo-mask Generation (PMG) module. We further introduced a novel Pseudo-mask Attention (PMA) module, which used these pseudo-masks to automatically generate bounding boxes and enhance localized feature extraction by guiding attention to relevant areas. Furthermore, a low-rank adaptation (LoRA) strategy was adopted to mitigate the domain shift issue. The proposed framework was evaluated on both 2D and 3D datasets across multiple medical imaging modalities, including fundus photography, X-ray, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), and ultrasound. The results demonstrated a significant performance improvement over state-of-the-art models, such as nnUNet and SwinUNet, as well as foundation models, such as SAM2 and MedSAM2, underscoring the effectiveness of the proposed approach. Our code is publicly available at https://github.com/astlian9/SAM_Support.

[117] Genome-Anchored Foundation Model Embeddings Improve Molecular Prediction from Histology Images

Cheng Jin,Fengtao Zhou,Yunfang Yu,Jiabo Ma,Yihui Wang,Yingxue Xu,Huajun Zhou,Hao Jiang,Luyang Luo,Luhui Mao,Zifan He,Xiuming Zhang,Jing Zhang,Ronald Chan,Herui Yao,Hao Chen

Main category: cs.CV

TL;DR: 本研究提出PathLUPI方法,通过整合转录组学信息,在不依赖基因组数据的情况下,实现了基于全切片图像(WSI)的高效分子预测。

Details Motivation: 精准肿瘤学需要准确的分子见解,但直接从基因组学获得这些见解对于广泛临床应用来说成本高且耗时。当前深度学习方法仍难以直接从常规全切片图像(WSI)预测复杂分子特征和患者预后。 Method: 引入PathLUPI方法,利用训练中的转录组学信息提取基因组锚定的组织学嵌入。 Result: 在20个队列中11,257例病例的49项肿瘤学任务评估中,PathLUPI表现优于传统方法,在14项生物标志物预测和分子分型任务中AUC≥0.80,在5种主要癌症类型的生存队列中C-index≥0.70。 Conclusion: PathLUPI通过在训练中使用转录组学信息,在推理时仅使用WSIs即可进行有效的分子预测,为弥合现有模型的局限性提供了新的策略。 Abstract: Precision oncology requires accurate molecular insights, yet obtaining these directly from genomics is costly and time-consuming for broad clinical use. Predicting complex molecular features and patient prognosis directly from routine whole-slide images (WSI) remains a major challenge for current deep learning methods. Here we introduce PathLUPI, which uses transcriptomic privileged information during training to extract genome-anchored histological embeddings, enabling effective molecular prediction using only WSIs at inference. Through extensive evaluation across 49 molecular oncology tasks using 11,257 cases among 20 cohorts, PathLUPI demonstrated superior performance compared to conventional methods trained solely on WSIs. Crucially, it achieves AUC $\geq$ 0.80 in 14 of the biomarker prediction and molecular subtyping tasks and C-index $\geq$ 0.70 in survival cohorts of 5 major cancer types. Moreover, PathLUPI embeddings reveal distinct cellular morphological signatures associated with specific genotypes and related biological pathways within WSIs. By effectively encoding molecular context to refine WSI representations, PathLUPI overcomes a key limitation of existing models and offers a novel strategy to bridge molecular insights with routine pathology workflows for wider clinical application.

[118] Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance

Xuesong Li,Dianye Huang,Yameng Zhang,Nassir Navab,Zhongliang Jiang

Main category: cs.CV

TL;DR: 本研究通过引入超声场景图(SG)解释图像内容,并提供扫描指导,以提高非专业人士对超声图像的解读能力和操作标准化。

Details Motivation: 由于成像和采集参数的差异,医学超声成像存在显著的视觉变化,且目前缺乏针对非专业用户的解读和扫描指导工具。 Method: 首先采用基于Transformer的一阶段方法计算超声SG,无需显式目标检测;然后利用用户查询通过大语言模型(LLMs)进一步优化SG表示,生成易于理解的解释;同时探索SG在指导超声扫描中的潜力。 Result: 在五名志愿者颈部区域(包括颈动脉和甲状腺)的图像验证中,该方法成功展示了其在增强超声图像可解释性和可用性方面的潜力。 Conclusion: 该方法通过提升普通用户的超声图像解读与操作能力,有望实现超声技术的普及化应用。 Abstract: Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for ordinary, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for ordinaries.

[119] UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation

Yue Zhou,Yuan Bi,Wenjuan Tong,Wei Wang,Nassir Navab,Zhongliang Jiang

Main category: cs.CV

TL;DR: UltraAD proposes a vision-language model for generalized anomaly localization and fine-grained classification in breast ultrasound images, outperforming state-of-the-art methods.

Details Motivation: Precise anomaly detection in medical images is crucial for clinical decision-making. However, existing unsupervised or semi-supervised methods lack fine-grained differentiation and face challenges due to domain gaps caused by variations in ultrasound imaging devices and parameters. Method: UltraAD leverages few-shot ultrasound examples using a vision-language model (VLM) approach. It fuses image-level tokens with learnable text embeddings for improved localization, then integrates this feature with patch-level tokens to refine local representations. For classification, it builds a memory bank of few-shot samples and text descriptions while freezing text embeddings during training to better align with medical data. Result: UltraAD was evaluated on three breast US datasets and demonstrated superior performance over state-of-the-art methods in both lesion localization and fine-grained medical classification. Conclusion: UltraAD effectively addresses the challenges of precise and fine-grained anomaly detection in ultrasound images, showing strong performance across multiple datasets. Abstract: Precise anomaly detection in medical images is critical for clinical decision-making. While recent unsupervised or semi-supervised anomaly detection methods trained on large-scale normal data show promising results, they lack fine-grained differentiation, such as benign vs. malignant tumors. Additionally, ultrasound (US) imaging is highly sensitive to devices and acquisition parameter variations, creating significant domain gaps in the resulting US images. To address these challenges, we propose UltraAD, a vision-language model (VLM)-based approach that leverages few-shot US examples for generalized anomaly localization and fine-grained classification. To enhance localization performance, the image-level token of query visual prototypes is first fused with learnable text embeddings. This image-informed prompt feature is then further integrated with patch-level tokens, refining local representations for improved accuracy. For fine-grained classification, a memory bank is constructed from few-shot image samples and corresponding text descriptions that capture anatomical and abnormality-specific features. During training, the stored text embeddings remain frozen, while image features are adapted to better align with medical data. UltraAD has been extensively evaluated on three breast US datasets, outperforming state-of-the-art methods in both lesion localization and fine-grained medical classification. The code will be released upon acceptance.

[120] Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images

Stephanie Käs,Sven Peter,Henrik Thillmann,Anton Burenko,David Benjamin Adrian,Dennis Mack,Timm Linder,Bastian Leibe

Main category: cs.CV

TL;DR: This paper investigates how different fisheye camera models affect 3D human pose estimation accuracy, showing that advanced models improve results and proposing a novel dataset for evaluation.

Details Motivation: Fisheye cameras provide a wider field of view (FOV) than standard pinhole cameras, making them useful for human-robot interaction and automotive applications. However, accurately detecting human poses in fisheye images is challenging due to curved distortions, which necessitates systematic evaluation of existing undistortion methods for wide FOV poses. Method: The research evaluates the impact of pinhole, equidistant, and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. A heuristic is proposed for selecting the appropriate projection model based on the detection bounding box. Result: In close-up scenarios, pinhole projection was found inadequate. The usage of advanced fisheye models such as the double sphere model significantly improves 3D human pose estimation accuracy. The study also introduces the FISHnCHIPS dataset with 3D human skeleton annotations in fisheye images. Conclusion: The study concludes that advanced fisheye models like the double sphere model significantly enhance 3D human pose estimation accuracy compared to traditional methods, and the optimal projection method varies with the FOV covered by the human pose. Abstract: Fisheye cameras offer robots the ability to capture human movements across a wider field of view (FOV) than standard pinhole cameras, making them particularly useful for applications in human-robot interaction and automotive contexts. However, accurately detecting human poses in fisheye images is challenging due to the curved distortions inherent to fisheye optics. While various methods for undistorting fisheye images have been proposed, their effectiveness and limitations for poses that cover a wide FOV has not been systematically evaluated in the context of absolute human pose estimation from monocular fisheye images. To address this gap, we evaluate the impact of pinhole, equidistant and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. We find that in close-up scenarios, pinhole projection is inadequate, and the optimal projection method varies with the FOV covered by the human pose. The usage of advanced fisheye models like the double sphere model significantly enhances 3D human pose estimation accuracy. We propose a heuristic for selecting the appropriate projection model based on the detection bounding box to enhance prediction quality. Additionally, we introduce and evaluate on our novel dataset FISHnCHIPS, which features 3D human skeleton annotations in fisheye images, including images from unconventional angles, such as extreme close-ups, ground-mounted cameras, and wide-FOV poses, available at: https://www.vision.rwth-aachen.de/fishnchips

[121] CoCo4D: Comprehensive and Complex 4D Scene Generation

Junwei Zhou,Xueting Li,Lu Qi,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: CoCo4D是一种从文本提示生成详细动态四维场景的新框架,能够有效实现多视角一致性与沉浸式体验。

Details Motivation: 现有四维合成方法受限于多视角一致性和场景沉浸感,因此提出一种新框架以解决这些问题。 Method: 将四维场景分为前景和背景分别处理,并利用视频扩散模型生成初始运动序列,通过渐进式外绘方案合成动态前景与背景。 Result: 实验表明,CoCo4D在四维场景生成方面表现优异,具备高效性和有效性。 Conclusion: CoCo4D实现了细节丰富的动态四维场景生成,相较现有方法表现出色。 Abstract: Existing 4D synthesis methods primarily focus on object-level generation or dynamic scene synthesis with limited novel views, restricting their ability to generate multi-view consistent and immersive dynamic 4D scenes. To address these constraints, we propose a framework (dubbed as CoCo4D) for generating detailed dynamic 4D scenes from text prompts, with the option to include images. Our method leverages the crucial observation that articulated motion typically characterizes foreground objects, whereas background alterations are less pronounced. Consequently, CoCo4D divides 4D scene synthesis into two responsibilities: modeling the dynamic foreground and creating the evolving background, both directed by a reference motion sequence. Given a text prompt and an optional reference image, CoCo4D first generates an initial motion sequence utilizing video diffusion models. This motion sequence then guides the synthesis of both the dynamic foreground object and the background using a novel progressive outpainting scheme. To ensure seamless integration of the moving foreground object within the dynamic background, CoCo4D optimizes a parametric trajectory for the foreground, resulting in realistic and coherent blending. Extensive experiments show that CoCo4D achieves comparable or superior performance in 4D scene generation compared to existing methods, demonstrating its effectiveness and efficiency. More results are presented on our website https://colezwhy.github.io/coco4d/.

[122] One Prototype Is Enough: Single-Prototype Activation for Interpretable Image Classification

Yitao Peng,Lianghua He,Die Hu

Main category: cs.CV

TL;DR: ProtoSolo 是一种新的深度神经架构,通过使用单一原型激活、基于特征图的比较和非原型投影学习策略,在保证高性能的同时显著提高了图像分类的可解释性。

Details Motivation: 为了解决现有原型网络需要多个原型协同决策导致解释复杂的问题,同时提升模型的可解释性和分类性能。 Method: ProtoSolo 使用单一原型激活进行分类,引入基于特征图的相似性比较方法和非原型投影学习策略。 Result: 在 CUB-200-2011 和 Stanford Cars 数据集上的实验表明 ProtoSolo 在分类任务中表现优异,并显著降低了解释的认知复杂度。 Conclusion: ProtoSolo 展示了在分类任务中的卓越性能,并在解释的认知复杂度方面达到了最先进的可解释方法的最佳水平。 Abstract: In this paper, we propose ProtoSolo, a novel deep neural architecture for interpretable image classification inspired by prototypical networks such as ProtoPNet. Existing prototype networks usually rely on the collaborative decision-making of multiple prototypes to achieve the classification and interpretation of a single category. In contrast, ProtoSolo only requires the activation of a single prototype to complete the classification. This allows the network to explain each category decision by only providing the features that are most similar to the prototype of that category, significantly reducing the cognitive complexity of the explanation. Secondly, we propose a feature-based comparison method, which uses feature map instead of full-channel feature vector as the object of similarity comparison and prototype learning. This design enables ProtoSolo to utilize richer global information for classification while relying on a single prototype activation. In addition, we propose a non-prototype projection learning strategy, which preserves the information association between the prototype and the training image patches while avoiding the sharp change of the network structure caused by the projection operation, thus avoiding its negative impact on the classification performance. Experiments on the CUB-200-2011 and Stanford Cars datasets show that ProtoSolo achieves superior performance in classification tasks and reaches the best level in terms of cognitive complexity of explanations compared to state-of-the-art interpretable methods. The code is available at https://github.com/pyt19/ProtoSolo.

[123] Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router

Yubo Huang,Weiqiang Wang,Sirui Zhao,Tong Xu,Lin Liu,Enhong Chen

Main category: cs.CV

TL;DR: 本文提出了一种针对多角色在同一场景中进行对话的视频生成方法,解决了音频与角色对应及数据集缺失的问题,并展示了优越的性能。

Details Motivation: 当前音频驱动的虚拟人物生成研究主要集中在单角色场景,而缺乏对多人物共处同一空间环境的统一对话视频生成的研究。 Method: 提出了一种基于MM-DiT模型的框架,包括细粒度Embedding Router、3D-mask embedding router以及相应的损失函数与优化策略,并构建了首个用于多角色对话视频生成的数据集和基准测试。 Result: 成功实现了对多角色在相同场景中的精细控制与音频对应关系管理,并发布了相关数据集和开源工具,实验表明其性能优于多个最先进的方法。 Conclusion: Bind-Your-Avatar通过新提出的框架和数据集,在多角色对话视频生成领域取得了显著成果,展示了优于现有方法的性能。 Abstract: Recent years have witnessed remarkable advances in audio-driven talking head generation. However, existing approaches predominantly focus on single-character scenarios. While some methods can create separate conversation videos between two individuals, the critical challenge of generating unified conversation videos with multiple physically co-present characters sharing the same spatial environment remains largely unaddressed. This setting presents two key challenges: audio-to-character correspondence control and the lack of suitable datasets featuring multi-character talking videos within the same scene. To address these challenges, we introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene. Specifically, we propose (1) A novel framework incorporating a fine-grained Embedding Router that binds `who' and `speak what' together to address the audio-to-character correspondence control. (2) Two methods for implementing a 3D-mask embedding router that enables frame-wise, fine-grained control of individual characters, with distinct loss functions based on observed geometric priors and a mask refinement strategy to enhance the accuracy and temporal smoothness of the predicted masks. (3) The first dataset, to the best of our knowledge, specifically constructed for multi-talking-character video generation, and accompanied by an open-source data processing pipeline, and (4) A benchmark for the dual-talking-characters video generation, with extensive experiments demonstrating superior performance over multiple state-of-the-art methods.

[124] SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution

Liangbin Xie,Yu Li,Shian Du,Menghan Xia,Xintao Wang,Fanghua Yu,Ziyan Chen,Pengfei Wan,Jiantao Zhou,Chao Dong

Main category: cs.CV

TL;DR: This paper explores key design principles for cascaded video super-resolution models, introducing novel strategies for alignment, training, and efficiency, ultimately delivering superior performance in high-resolution video generation.

Details Motivation: As user expectations shift toward higher-resolution outputs, latent computation alone becomes inadequate. Cascaded video super-resolution offers a promising solution, but current design principles for these models remain underexplored. Method: The study proposes two degradation strategies to generate training pairs, analyzes key aspects of VSR model behavior such as timestep sampling and noise augmentation effects, and introduces an interleaving temporal unit with sparse local attention for efficiency. Result: Extensive experiments demonstrate that the proposed framework outperforms existing methods, with ablation studies confirming the effectiveness of each design choice in achieving efficient and high-quality video super-resolution. Conclusion: This paper concludes that cascaded video super-resolution models can be designed effectively by aligning them with the base model's output characteristics, leveraging architectural and training innovations for efficient high-resolution video generation. Abstract: Latent diffusion models have emerged as a leading paradigm for efficient video generation. However, as user expectations shift toward higher-resolution outputs, relying solely on latent computation becomes inadequate. A promising approach involves decoupling the process into two stages: semantic content generation and detail synthesis. The former employs a computationally intensive base model at lower resolutions, while the latter leverages a lightweight cascaded video super-resolution (VSR) model to achieve high-resolution output. In this work, we focus on studying key design principles for latter cascaded VSR models, which are underexplored currently. First, we propose two degradation strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator. Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs. These findings directly inform our architectural and training innovations. Finally, we introduce interleaving temporal unit and sparse local attention to achieve efficient training and inference, drastically reducing computational overhead. Extensive experiments demonstrate the superiority of our framework over existing methods, with ablation studies confirming the efficacy of each design choice. Our work establishes a simple yet effective baseline for cascaded video super-resolution generation, offering practical insights to guide future advancements in efficient cascaded synthesis systems.

[125] Improving Progressive Generation with Decomposable Flow Matching

Moayed Haji-Ali,Willi Menapace,Ivan Skorokhodov,Arpit Sahni,Sergey Tulyakov,Vicente Ordonez,Aliaksandr Siarohin

Main category: cs.CV

TL;DR: 本文提出了一种名为Decomposable Flow Matching (DFM)的新方法,用于视觉媒体的渐进生成。这种方法在多个尺度上独立应用流匹配,显著提高了生成质量,同时保持了架构的简单性和训练流程的最小改动。

Details Motivation: 现有的多阶段架构增加了整体方法的复杂性,并需要定制的扩散公式、分解依赖的阶段转换和额外的采样器等。因此需要一种新的方法来简化这一过程。 Method: DFM通过在用户定义的多尺度表示(如拉普拉斯金字塔)的每个级别上独立应用流匹配来实现渐进式生成。 Result: 实验表明,DFM在相同训练计算量下,在ImageNet-1k 512px数据集上比基础架构提升了35.2%的FDD分数,比表现最好的基线提升了26.4%。此外,DFM还加快了对大模型(如FLUX)微调时向训练分布收敛的速度。 Conclusion: DFM是一个简单而有效的视觉媒体渐进生成框架,它在不增加复杂性的情况下提高了生成质量。 Abstract: Generating high-dimensional visual modalities is a computationally intensive task. A common solution is progressive generation, where the outputs are synthesized in a coarse-to-fine spectral autoregressive manner. While diffusion models benefit from the coarse-to-fine nature of denoising, explicit multi-stage architectures are rarely adopted. These architectures have increased the complexity of the overall approach, introducing the need for a custom diffusion formulation, decomposition-dependent stage transitions, add-hoc samplers, or a model cascade. Our contribution, Decomposable Flow Matching (DFM), is a simple and effective framework for the progressive generation of visual media. DFM applies Flow Matching independently at each level of a user-defined multi-scale representation (such as Laplacian pyramid). As shown by our experiments, our approach improves visual quality for both images and videos, featuring superior results compared to prior multistage frameworks. On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline, under the same training compute. When applied to finetuning of large models, such as FLUX, DFM shows faster convergence speed to the training distribution. Crucially, all these advantages are achieved with a single model, architectural simplicity, and minimal modifications to existing training pipelines.

[126] GenHSI: Controllable Generation of Human-Scene Interaction Videos

Zekun Li,Rui Zhou,Rahul Sajnani,Xiaoyan Cong,Daniel Ritchie,Srinath Sridhar

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的可控长视频生成方法GenHSI,通过分阶段策略结合现成扩散模型,实现高质量人类-场景交互视频生成。

Details Motivation: 现有大规模预训练视频扩散模型在生成具有丰富人类-场景交互的长电影式视频时存在诸多问题,包括不真实的人类-场景交互、缺乏身份一致性以及高昂的训练成本。 Method: 将长视频生成划分为三个阶段:脚本编写、预可视化和动画制作,并利用现成的扩散模型进行渲染。 Result: 实验表明,GenHSI能够从单张场景图像生成包含多个角色动作且相机姿态一致的长视频,同时有效保持场景内容和角色身份。 Conclusion: GenHSI成功实现了无需训练的可控长视频生成,解决了现有方法在人类-场景交互、身份保持和训练成本上的局限性。 Abstract: Large-scale pre-trained video diffusion models have exhibited remarkable capabilities in diverse video generation. However, existing solutions face several challenges in using these models to generate long movie-like videos with rich human-object interactions that include unrealistic human-scene interaction, lack of subject identity preservation, and require expensive training. We propose GenHSI, a training-free method for controllable generation of long human-scene interaction videos (HSI). Taking inspiration from movie animation, our key insight is to overcome the limitations of previous work by subdividing the long video generation task into three stages: (1) script writing, (2) pre-visualization, and (3) animation. Given an image of a scene, a user description, and multiple images of a person, we use these three stages to generate long-videos that preserve human-identity and provide rich human-scene interactions. Script writing converts complex human tasks into simple atomic tasks that are used in the pre-visualization stage to generate 3D keyframes (storyboards). These 3D keyframes are rendered and animated by off-the-shelf video diffusion models for consistent long video generation with rich contacts in a 3D-aware manner. A key advantage of our work is that we alleviate the need for scanned, accurate scenes and create 3D keyframes from single-view images. We are the first to generate a long video sequence with a consistent camera pose that contains arbitrary numbers of character actions without training. Experiments demonstrate that our method can generate long videos that effectively preserve scene content and character identity with plausible human-scene interaction from a single image scene. Visit our project homepage https://kunkun0w0.github.io/project/GenHSI/ for more information.

[127] Active View Selector: Fast and Accurate Active View Selection with Cross Reference Image Quality Assessment

Zirui Wang,Yash Bhalgat,Ruining Li,Victor Adrian Prisacariu

Main category: cs.CV

TL;DR: 该论文提出了一种基于2D图像质量评估的新方法,用于主动视角选择,在新颖视角合成和3D重建中取得了显著的定量和定性改进。

Details Motivation: 现有方法需要针对不同的3D表示进行专门设计,并且涉及复杂的3D空间建模,因此作者希望提出一种与3D表示无关的方法。 Method: 将主动视角选择问题重构为2D图像质量评估任务,通过训练模型在多视角设置中预测SSIM来指导视角选择。 Result: 所提出的跨参考图像质量评估框架在标准基准测试中实现了显著的定量和定性改进,并且运行速度比之前的方法快14-33倍。 Conclusion: 论文成功地将视角选择问题转化为2D图像质量评估任务,并展示了其在效率和效果上的优势。 Abstract: We tackle active view selection in novel view synthesis and 3D reconstruction. Existing methods like FisheRF and ActiveNeRF select the next best view by minimizing uncertainty or maximizing information gain in 3D, but they require specialized designs for different 3D representations and involve complex modelling in 3D space. Instead, we reframe this as a 2D image quality assessment (IQA) task, selecting views where current renderings have the lowest quality. Since ground-truth images for candidate views are unavailable, full-reference metrics like PSNR and SSIM are inapplicable, while no-reference metrics, such as MUSIQ and MANIQA, lack the essential multi-view context. Inspired by a recent cross-referencing quality framework CrossScore, we train a model to predict SSIM within a multi-view setup and use it to guide view selection. Our cross-reference IQA framework achieves substantial quantitative and qualitative improvements across standard benchmarks, while being agnostic to 3D representations, and runs 14-33 times faster than previous methods.

[128] A Comparative Study of NAFNet Baselines for Image Restoration

Vladislav Esaulov,M. Moein Esfahani

Main category: cs.CV

TL;DR: 本研究提出了一种用于图像恢复的简单高效的深度学习模型NAFNet,并验证其组件的有效性。

Details Motivation: 寻找一种简单高效的图像恢复深度学习基线模型。 Method: 使用CIFAR10图像进行去噪和去模糊的消融实验,并比较不同组件替换或移除后的性能变化。 Result: 定量结果显示了各个修改对恢复性能的影响,并通过示例进行了说明。 Conclusion: NAFNet的设计得到了支持:SimpleGate和简化注意力机制比传统激活和注意力产生更好的结果,而LayerNorm对于稳定训练至关重要。 Abstract: We study NAFNet (Nonlinear Activation Free Network), a simple and efficient deep learning baseline for image restoration. By using CIFAR10 images corrupted with noise and blur, we conduct an ablation study of NAFNet's core components. Our baseline model implements SimpleGate activation, Simplified Channel Activation (SCA), and LayerNormalization. We compare this baseline to different variants that replace or remove components. Quantitative results (PSNR, SSIM) and examples illustrate how each modification affects restoration performance. Our findings support the NAFNet design: the SimpleGate and simplified attention mechanisms yield better results than conventional activations and attention, while LayerNorm proves to be important for stable training. We conclude with recommendations for model design, discuss potential improvements, and future work.

[129] Unified Vision-Language-Action Model

Yuqi Wang,Xinghang Li,Wenxuan Wang,Junbo Zhang,Yingyan Li,Yuntao Chen,Xinlong Wang,Zhaoxiang Zhang

Main category: cs.CV

TL;DR: 本文提出了UniVLA,一种统一的多模态VLA模型,通过自回归建模视觉、语言和动作信号,并结合世界建模显著提升了长期任务的效果。

Details Motivation: 现有方法主要依赖于视觉-语言模型的通用理解能力,忽视了视觉观察中嵌入的丰富时间和因果结构。 Method: 提出了一种统一的多模态VLA模型UniVLA,将视觉、语言和动作信号作为离散令牌序列进行自回归建模,并结合世界建模进行后期训练。 Result: UniVLA在多个模拟基准测试中取得新SOTA结果,例如在LIBERO基准测试中平均成功率达到95.5%,优于pi0-FAST的85.5%。 Conclusion: UniVLA实现了多个模拟基准测试的新SOTA结果,并展示了在现实世界任务中的广泛适用性。 Abstract: Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This formulation enables flexible multimodal tasks learning, particularly from large-scale video data. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning--especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, significantly surpassing previous methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing pi0-FAST's 85.5%. We further demonstrate its broad applicability on real-world ALOHA manipulation and autonomous driving.

[130] AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models

Zehuan Huang,Haoran Feng,Yangtian Sun,Yuanchen Guo,Yanpei Cao,Lu Sheng

Main category: cs.CV

TL;DR: AnimaX is a 3D animation framework that transfers video-based motion knowledge to generate controllable and diverse 3D animations efficiently.

Details Motivation: Traditional motion synthesis methods are limited by fixed skeletal topologies or require costly high-dimensional optimization. AnimaX aims to bridge the motion priors of video diffusion models with controllable skeleton-based animation for more flexible and efficient 3D motion generation. Method: AnimaX represents 3D motion using multi-view, multi-frame 2D pose maps, enabling joint video-pose diffusion conditioned on template renderings and textual prompts. It uses shared positional encodings and modality-aware embeddings to align video and pose sequences temporally and spatially. Resulting pose sequences are converted into mesh animations via inverse kinematics. Result: AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, trained on a dataset of 160,000 rigged sequences. Conclusion: AnimaX offers a scalable and efficient solution for 3D animation by effectively transferring video-based motion knowledge to support diverse articulated meshes with arbitrary skeletons. Abstract: We present AnimaX, a feed-forward 3D animation framework that bridges the motion priors of video diffusion models with the controllable structure of skeleton-based animation. Traditional motion synthesis methods are either restricted to fixed skeletal topologies or require costly optimization in high-dimensional deformation spaces. In contrast, AnimaX effectively transfers video-based motion knowledge to the 3D domain, supporting diverse articulated meshes with arbitrary skeletons. Our method represents 3D motion as multi-view, multi-frame 2D pose maps, and enables joint video-pose diffusion conditioned on template renderings and a textual motion prompt. We introduce shared positional encodings and modality-aware embeddings to ensure spatial-temporal alignment between video and pose sequences, effectively transferring video priors to motion generation task. The resulting multi-view pose sequences are triangulated into 3D joint positions and converted into mesh animation via inverse kinematics. Trained on a newly curated dataset of 160,000 rigged sequences, AnimaX achieves state-of-the-art results on VBench in generalization, motion fidelity, and efficiency, offering a scalable solution for category-agnostic 3D animation. Project page: \href{https://anima-x.github.io/}{https://anima-x.github.io/}.

[131] Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation

Xingyang Li,Muyang Li,Tianle Cai,Haocheng Xi,Shuo Yang,Yujun Lin,Lvmin Zhang,Songlin Yang,Jinbo Hu,Kelly Peng,Maneesh Agrawala,Ion Stoica,Kurt Keutzer,Song Han

Main category: cs.CV

TL;DR: This paper introduces Radial Attention, an efficient attention mechanism for video diffusion models that significantly reduces computational costs while maintaining video quality, enabling longer video generation and faster inference.

Details Motivation: The motivation stems from the prohibitively high computational costs of training and inference in video diffusion models due to the additional temporal dimension. The authors aim to address this challenge by leveraging the observed Spatiotemporal Energy Decay phenomenon. Method: The authors propose Radial Attention, which uses a static attention mask to reduce computational complexity by focusing on spatially nearby tokens and shrinking the attention window with increasing temporal distance. They also employ LoRA-based fine-tuning for extending pre-trained model capabilities efficiently. Result: Radial Attention achieves up to a 1.9× speedup over dense attention, enables video generation up to 4× longer with minimal tuning, reduces training costs by up to 4.4× compared to direct fine-tuning, and accelerates inference by up to 3.7×. Conclusion: Radial Attention is a scalable sparse attention mechanism that effectively reduces computational costs in video diffusion models while maintaining video quality, allowing for longer video generation and efficient fine-tuning. Abstract: Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with $O(n \log n)$ complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard $O(n^2)$ dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup over the original dense attention. With minimal tuning, it enables video generation up to 4$\times$ longer while reducing training costs by up to 4.4$\times$ compared to direct fine-tuning and accelerating inference by up to 3.7$\times$ compared to dense attention inference.

eess.IV [Back]

[132] Assessing Risk of Stealing Proprietary Models for Medical Imaging Tasks

Ankita Raj,Harsh Swaika,Deepankar Varma,Chetan Arora

Main category: eess.IV

TL;DR: 本文研究了医疗影像模型在黑盒条件下的模型窃取攻击,在查询预算受限的情况下提出了QueryWise方法来提升攻击效果。

Details Motivation: 尽管深度学习在医疗影像领域取得了成功,但模型权重通常隐藏以保护知识产权。然而,这些模型仍易受到模型窃取攻击,尤其是在缺乏访问训练数据和查询预算受限的情况下。 Method: 提出了一种名为QueryWise的两步模型窃取方法,利用代理分布中的未标记数据训练盗窃模型,无需额外查询。评估基于胆囊癌和新冠肺炎分类的两个医学图像模型。 Result: 研究表明,即使在查询预算受限的情况下,攻击者仍能通过公开可用的数据集有效执行模型窃取攻击。实验验证了QueryWise方法的有效性。 Conclusion: 医疗成像模型在保护知识产权的同时,依然面临严重的模型窃取攻击威胁。通过QueryWise方法,即使查询预算有限,攻击者也能有效实施模型窃取。论文展示了使用公共数据集进行此类攻击的可行性,并提出了增强模型窃取能力的方法。 Abstract: The success of deep learning in medical imaging applications has led several companies to deploy proprietary models in diagnostic workflows, offering monetized services. Even though model weights are hidden to protect the intellectual property of the service provider, these models are exposed to model stealing (MS) attacks, where adversaries can clone the model's functionality by querying it with a proxy dataset and training a thief model on the acquired predictions. While extensively studied on general vision tasks, the susceptibility of medical imaging models to MS attacks remains inadequately explored. This paper investigates the vulnerability of black-box medical imaging models to MS attacks under realistic conditions where the adversary lacks access to the victim model's training data and operates with limited query budgets. We demonstrate that adversaries can effectively execute MS attacks by using publicly available datasets. To further enhance MS capabilities with limited query budgets, we propose a two-step model stealing approach termed QueryWise. This method capitalizes on unlabeled data obtained from a proxy distribution to train the thief model without incurring additional queries. Evaluation on two medical imaging models for Gallbladder Cancer and COVID-19 classification substantiates the effectiveness of the proposed attack. The source code is available at https://github.com/rajankita/QueryWise.