cs.CL [Back]

[1] Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction

Juliana Resplande Sant'anna Gomes,Arlindo Rodrigues Galvão Filho

Main category: cs.CL

TL;DR: This dissertation introduces a methodology to enrich Portuguese news datasets with external evidence using LLMs and search APIs, addressing gaps in Semi-Automated Fact-Checking system development.

Details

Motivation: The scarcity of publicly available datasets integrating external evidence for Portuguese-language Automated Fact-Checking systems necessitates this research. Method: The methodology uses Large Language Models (LLMs) like Gemini 1.5 Flash to extract claims from texts and employs search engine APIs (e.g., Google Search API) to retrieve external evidence. A data validation and preprocessing framework is also introduced. Result: A methodology to enrich Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence was successfully developed and analyzed, including a data validation framework to enhance dataset quality. Conclusion: The dissertation successfully develops and analyzes a methodology to enrich Portuguese news corpora with external evidence, addressing the scarcity of such datasets and contributing to the development of more robust Semi-Automated Fact-Checking systems. Abstract: The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence. The approach simulates a user's verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora.

[2] Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models

Yao Ge,Sudeshna Das,Yuting Guo,Abeed Sarker

Main category: cs.CL

TL;DR: 文章提出一种结合RAG的动态提示策略，有效提升大型语言模型在少量数据下的生物医学NER任务表现。

Details

Motivation: 解决在少量训练数据情况下，大型语言模型（LLMs）在生物医学命名实体识别（NER）任务中的性能挑战。 Method: 使用基于输入文本相似性的检索增强生成（RAG）动态选择上下文学习示例，并在推理过程中动态更新提示。 Result: 静态提示策略使GPT-4的平均F1分数提高了12%，GPT-3.5和LLaMA 3-70B提高了11%；动态提示策略在5-shot和10-shot设置下分别将平均F1分数提高了7.3%和5.6%。 Conclusion: 动态提示策略结合RAG方法在生物医学NER任务中表现优异，强调了上下文自适应提示的有效性。 Abstract: Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER.

[3] CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

Lei Jiang,Fan Chen

Main category: cs.CL

TL;DR: 本文提出了一个名为CarbonScaling的分析框架，扩展了神经缩放定律，以结合LLM训练中的运营和体现碳排放，结果显示了准确性与碳足迹之间的幂律关系，但现实世界的低效率显著增加了缩放因子。

Details

Motivation: 神经缩放定律忽略了随着LLM规模扩大而呈指数增长的碳排放，因此需要一个综合考虑碳排放的分析框架。 Method: 通过整合神经缩放模型、GPU硬件发展、并行优化和碳估算模型，构建了CarbonScaling框架。 Result: 结果显示，虽然准确性与碳足迹之间存在幂律关系，但现实世界的低效率显著增加了缩放因子；硬件技术缩放对小型到中型模型减少碳排放有效，但对极大LLM效果有限。 Conclusion: CarbonScaling框架为训练更可持续和碳高效的LLM提供了关键见解。 Abstract: Neural scaling laws have driven the development of increasingly large language models (LLMs) by linking accuracy improvements to growth in parameter count, dataset size, and compute. However, these laws overlook the carbon emissions that scale exponentially with LLM size. This paper presents \textit{CarbonScaling}, an analytical framework that extends neural scaling laws to incorporate both operational and embodied carbon in LLM training. By integrating models for neural scaling, GPU hardware evolution, parallelism optimization, and carbon estimation, \textit{CarbonScaling} quantitatively connects model accuracy to carbon footprint. Results show that while a power-law relationship between accuracy and carbon holds, real-world inefficiencies significantly increase the scaling factor. Hardware technology scaling reduces carbon emissions for small to mid-sized models, but offers diminishing returns for extremely large LLMs due to communication overhead and underutilized GPUs. Training optimizations-especially aggressive critical batch size scaling-help alleviate this inefficiency. \textit{CarbonScaling} offers key insights for training more sustainable and carbon-efficient LLMs.

[4] The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

Aamod Thakur,Ajay Nagpal,Atharva Savarkar,Kundeshwar Pundalik,Siddhesh Dosi,Piyush Sawarkar,Viraj Thakur,Rohit Saluja,Maunendra Sankar Desarkar,Ganesh Ramakrishnan

Main category: cs.CL

TL;DR: 本文研究了多语言大语言模型中的分词问题，提出了改进的数据组成算法，有效提高了模型性能和推理速度。

Details

Motivation: 现有的多语言分词器存在较高的分词与单词比率、上下文长度使用效率低下以及推理速度较慢的问题。 Method: 作者通过系统研究词汇量大小、预分词规则和训练语料组成对分词效率和模型质量的影响，并进行了广泛的实验。 Result: 所提出的预分词策略显著提高了模型性能，数据组成算法将平均分词与单词比率降低了约6％，并在多语言印度模型中平均分词与单词比率上实现了超过40％的改进。 Conclusion: 本文提出了一种新的数据组成算法，用于分词器训练，并展示了其在多语言环境下提高模型性能和推理速度方面的有效性。 Abstract: While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs

[5] Factor Augmented Supervised Learning with Text Embeddings

Zhanye Luo,Yuefeng Han,Xiufan Yu

Main category: cs.CL

TL;DR: AEALT是一种将自编码器与文本处理相结合的监督式降维框架，能够有效降低文本嵌入的高维性，提升下游任务的效率和性能。

Details

Motivation: 大型语言模型生成的文本嵌入具有高维性，这在下游任务中影响了效率并增加了计算成本。为解决这一问题，研究者提出了AEALT。 Method: AEALT通过从文本文档中提取嵌入，并通过监督增强自编码器学习低维、任务相关的潜在因子，从而建模复杂嵌入的非线性结构。 Result: 通过在多个真实世界公共数据集上的分类、异常检测和预测任务进行广泛实验，数值结果表明AEALT在性能上优于传统的嵌入方法和几种标准的降维方法。 Conclusion: AEALT是一个监督式的、因子增强的框架，能有效降低文本嵌入的高维性，提高下游任务的效率和性能。 Abstract: Large language models (LLMs) generate text embeddings from text data, producing vector representations that capture the semantic meaning and contextual relationships of words. However, the high dimensionality of these embeddings often impedes efficiency and drives up computational cost in downstream tasks. To address this, we propose AutoEncoder-Augmented Learning with Text (AEALT), a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained LLM workflows. First, we extract embeddings from text documents; next, we pass them through a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors. By modeling the nonlinear structure of complex embeddings, AEALT outperforms conventional deep-learning approaches that rely on raw embeddings. We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks using multiple real-world public datasets. Numerical results demonstrate that AEALT yields substantial gains over both vanilla embeddings and several standard dimension reduction methods.

[6] Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

Ying Liu,Can Li,Ting Zhang,Mei Wang,Qiannan Zhu,Jian Li,Hua Huang

Main category: cs.CL

TL;DR: 本研究提出了GuideEval基准，用于评估大型语言模型在教育对话中的指导能力，并强调了以学习者为中心的互动评估的重要性。

Details

Motivation: 研究动机在于当前对LLM的研究主要集中在苏格拉底式提问能力上，而忽视了根据学习者的认知状态进行自适应指导的重要维度。 Method: 本研究通过提出GuideEval基准，采用三阶段行为框架（感知、协调、引导）来评估LLM在教育对话中的指导能力，并引入了一种行为引导的微调策略。 Result: 研究发现，现有的LLM在学习者表现出困惑或需要重新引导时，常常无法提供有效的自适应支架。通过引入行为引导的微调策略，显著提高了指导性能。 Conclusion: 本研究提出了一种新的基准GuideEval，用于评估大型语言模型（LLM）在教育对话中提供指导的能力，强调了从孤立的内容评估转向以学习者为中心的互动评估的重要性。 Abstract: The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their capacity for Socratic questioning, it often overlooks a critical dimension: adaptively guiding learners based on their cognitive states. This study shifts focus from mere question generation to the broader instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners' understanding? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical findings reveal that existing LLMs frequently fail to provide effective adaptive scaffolding when learners exhibit confusion or require redirection. Furthermore, we introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues, significantly enhancing guidance performance. By shifting the focus from isolated content evaluation to learner-centered interaction, our work advocates a more dialogic paradigm for evaluating Socratic LLMs.

[7] LLM Unlearning Without an Expert Curated Dataset

Xiaoyuan Zhu,Muru Zhang,Ollie Liu,Robin Jia,Willie Neiswanger

Main category: cs.CL

TL;DR: 本文提出了一种使用语言模型本身生成高质量遗忘集的可扩展自动化方法，通过结构化提示管道合成教科书式数据，仅需输入领域名称即可实现有效的模型遗忘。

Details

Motivation: 现代大型语言模型经常编码敏感、有害或受版权保护的知识，这引发了对事后遗忘能力的需求，即在不完全重新训练的情况下从模型中移除特定领域的知识。当前遗忘管道的一个主要瓶颈是构建有效的遗忘集，即近似目标领域的数据集，并引导模型忘记该领域。 Method: 通过结构化提示管道合成教科书式数据，仅需输入领域名称。 Result: 实验表明，我们的合成数据集在遗忘生物安全、网络安全和哈利波特小说方面始终优于基线合成替代方案，并且与专家策划的数据集相当。此外，消融研究表明，多步骤生成管道显著提高了数据多样性，从而提高了遗忘效用。 Conclusion: 合成数据集为实现广泛新兴领域的实用且可扩展的遗忘提供了一条有希望的途径，无需人工干预。 Abstract: Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at https://github.com/xyzhu123/Synthetic_Textbook.

[8] BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Zijian Chen,Xueguang Ma,Shengyao Zhuang,Ping Nie,Kai Zou,Andrew Liu,Joshua Green,Kshama Patel,Ruoxi Meng,Mingyi Su,Sahel Sharifymoghaddam,Yanxi Li,Haoran Hong,Xinyu Shi,Xuye Liu,Nandan Thakur,Crystina Zhang,Luyu Gao,Wenhu Chen,Jimmy Lin

Main category: cs.CL

TL;DR: 本文提出 BrowseComp-Plus，一个改进的基准测试，用于更公平、透明地评估深度研究代理和检索方法的有效性。

Details

Motivation: 当前基准测试（如 BrowseComp）依赖动态且不透明的网络 API，缺乏公平性和透明度，难以对深度研究代理进行受控实验。 Method: 从 BrowseComp 衍生出 BrowseComp-Plus，使用固定且经过人工验证的文档语料库，并加入具有挑战性的负样本以支持受控实验。 Result: 在 BrowseComp-Plus 上，开源模型 Search-R1 的准确率为 3.86%，而 GPT-5 的准确率为 55.9%；结合 Qwen3-Embedding-8B 检索器后，GPT-5 的准确率提升至 70.1% 且搜索调用更少。 Conclusion: BrowseComp-Plus 通过固定且经过精心策划的语料库解决了当前基准测试在公平性和透明度方面的不足，从而支持对深度研究代理和检索方法进行全面评估。 Abstract: Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

[9] Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

Tomohiro Sawada,Kartik Goyal

Main category: cs.CL

TL;DR: This paper explores BPE tokenization methods without relying on merge lists, showing that certain algorithms maintain performance while potentially enhancing privacy.

Details

Motivation: The motivation stems from the vulnerability of the BPE merge list as a potential attack surface for extracting training data information, prompting the exploration of alternative inference algorithms that do not rely on the merge list. Method: The paper investigates two classes of BPE inference schemes through extensive experiments across various language modeling tasks, including accuracy-based QA benchmarks, machine translation, and open-ended generation. Result: Targeted deviations from merge lists significantly degrade model performance, whereas non-targeted merge-list-free inference algorithms show minimal impact on performance, often much less than expected. Conclusion: The study concludes that non-targeted merge-list-free BPE inference algorithms have minimal impact on downstream language model performance, opening possibilities for simpler and more privacy-preserving tokenization methods. Abstract: Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about language model's training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE application during training: a) targeted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targeted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA benchmarks, machine translation, and open-ended generation reveal that while targeted deviation from the merge lists exhibits significant degradation in language model performance, the non-targeted merge-list-free inference algorithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tokenization schemes that do not catastrophically compromise model performance.

[10] Measuring Stereotype and Deviation Biases in Large Language Models

Daniel Wang,Eli Brignac,Minjia Mao,Xiao Fang

Main category: cs.CL

TL;DR: 研究发现大型语言模型（LLM）在生成内容时表现出显著的刻板印象偏差和偏离偏差，可能带来潜在危害。

Details

Motivation: 由于大型语言模型（LLM）在多个领域的广泛应用引发了对其局限性和潜在风险的担忧，研究旨在探讨LLM可能存在的偏见类型。 Method: 通过要求四个先进的LLM生成个体的简介，研究分析了LLM对特定人口群体与属性（如政治倾向、宗教和性取向）之间的关联。 Result: 实验结果显示，所有被测试的LLM在多个群体中均表现出显著的刻板印象偏差和偏离偏差。 Conclusion: 研究揭示了LLM在生成内容中存在显著的刻板印象偏差和偏离偏差，反映了LLM推断用户属性时可能带来的潜在危害。 Abstract: Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.

[11] Testing the Limits of Machine Translation from One Book

Jonathan Shaw,Dillon Mee,Timothy Khouw,Zackary Leech,Daniel Wilson

Main category: cs.CL

TL;DR: 本研究评估了不同语言资源对提升LLM在Kanuri语言翻译中的效果，发现并列句子最有效，而仅靠语法信息无法显著提升翻译质量。

Details

Motivation: 尽管Kanuri语言拥有大量使用者，但数字资源极少。研究旨在探索如何通过不同的语言资源提升LLM在该语言翻译中的表现。 Method: 通过提供不同的语言资源组合（语法、词典和并列句子），评估LLM翻译的有效性，并将结果与母语者翻译和人类语言学家的表现进行比较。 Result: 并列句子仍然是最有效的数据来源，在人类评估和自动指标中都优于其他方法。虽然引入语法信息在一定程度上提高了零样本翻译的效果，但其本身并不能作为有效的独立数据来源。 Conclusion: 评估表明，LLM翻译质量的提升需要多维度的评估方法，而不仅仅是准确性指标，而且对于领域特定的翻译任务，仅靠语法信息无法提供足够的上下文支持。 Abstract: Current state-of-the-art models demonstrate capacity to leverage in-context learning to translate into previously unseen language contexts. Tanzer et al. [2024] utilize language materials (e.g. a grammar) to improve translation quality for Kalamang using large language models (LLMs). We focus on Kanuri, a language that, despite having substantial speaker population, has minimal digital resources. We design two datasets for evaluation: one focused on health and humanitarian terms, and another containing generalized terminology, investigating how domain-specific tasks impact LLM translation quality. By providing different combinations of language resources (grammar, dictionary, and parallel sentences), we measure LLM translation effectiveness, comparing results to native speaker translations and human linguist performance. We evaluate using both automatic metrics and native speaker assessments of fluency and accuracy. Results demonstrate that parallel sentences remain the most effective data source, outperforming other methods in human evaluations and automatic metrics. While incorporating grammar improves over zero-shot translation, it fails as an effective standalone data source. Human evaluations reveal that LLMs achieve accuracy (meaning) more effectively than fluency (grammaticality). These findings suggest LLM translation evaluation benefits from multidimensional assessment beyond simple accuracy metrics, and that grammar alone, without parallel sentences, does not provide sufficient context for effective domain-specific translation.

[12] Do Biased Models Have Biased Thoughts?

Swati Rajwal,Shivank Garg,Reem Abdel-Salam,Abdelrahman Zayed

Main category: cs.CL

TL;DR: This paper explores whether biased language models have biased thoughts, finding that biased decisions don't always result from biased thinking processes.

Details

Motivation: Language models exhibit biases that challenge their deployment; the link between model bias and thought bias is unclear. Method: Experiments on 5 large language models using fairness metrics to quantify 11 biases in model's thoughts and outputs. Result: Bias in thinking steps weakly correlates with output bias (correlation < 0.6, p < 0.001 in most cases). Conclusion: Unlike humans, biased decisions in models don't necessarily stem from biased thinking steps. Abstract: The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: \textit{Do biased models have biased thoughts}? To answer our question, we conduct experiments on $5$ popular large language models using fairness metrics to quantify $11$ different biases in the model's thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than $0.6$ correlation with a $p$-value smaller than $0.001$ in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts.

[13] Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

Evangelia Spiliopoulou,Riccardo Fogliato,Hanna Burnsky,Tamer Soliman,Jie Ma,Graham Horwood,Miguel Ballesteros

Main category: cs.CL

TL;DR: 该研究提出了一种统计方法，用于识别和量化LLM在评估其他模型输出时的自我偏差，并发现某些LLM确实存在对自己或同家族模型输出评分偏高的倾向。

Details

Motivation: LLM在作为评估者时可能存在自我偏差，即对自己生成的内容给予过高评分，这会影响模型性能评估的准确性。现有研究往往未能正确区分模型质量差异与偏差，因此需要一种方法来准确识别和量化这种偏差。 Method: 通过建模LLM评估者对自己与其他模型生成结果评分分布的差异，并利用独立第三方评估者（如人类）提供的基础质量进行校准，从而分离出自我偏差。 Result: 在超过5000个提示-生成对的大规模数据集上进行实证分析发现，某些模型（如GPT-4o和Claude 3.5 Sonnet）确实存在系统性的自我偏差，甚至表现出对同一家族其他模型的偏好偏差。 Conclusion: 该研究提出了一个统计框架，可以识别和量化LLM作为评估者时存在的自我偏差，同时区分真实的模型性能差异，为减少自动化评估中的偏差提供了实用指导。 Abstract: Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (>5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations.

[14] Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis

Komala Subramanyam Cherukuri,Pranav Abishai Moses,Aisa Sakata,Jiangping Chen,Haihua Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于大型语言模型的可扩展框架，用于对日裔美国人拘留口述历史进行自动化语义和情感标注，展示了LLMs在精心设计的提示指导下处理大规模敏感历史档案的能力，并为文化敏感性档案分析提供了实用指导和可重用的标注流程。

Details

Motivation: 口述历史是经历系统性不公和历史抹除的社区的重要记录，但其非结构化格式、情感复杂性和高昂的标注成本限制了大规模分析，因此需要一种自动化的、可扩展的标注框架。 Method: 该研究结合专家标注、提示设计和LLM评估，使用ChatGPT、Llama和Qwen等模型，测试了零样本、少样本和RAG策略在语义分类和情感分析中的表现，并将最佳提示配置应用于大规模数据集的标注。 Result: 在语义分类任务中，ChatGPT表现最佳（F1得分为88.71%），其次是Llama（84.99%）和Qwen（83.72%）；在情感分析任务中，Llama略胜一筹（82.66%），Qwen和ChatGPT表现相近（分别为82.29%和82.29%），最终该框架成功标注了超过9万句来自1002次访谈的句子。 Conclusion: 该论文提出了一种可扩展的框架，利用大型语言模型（LLMs）对口述历史进行语义和情感标注，证明了在精心设计的提示引导下，LLMs能够有效地处理大规模的口述历史档案，为数字人文学科中的负责任人工智能应用和集体记忆的保存奠定了基础。 Abstract: Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: https://github.com/kc6699c/LLM4OralHistoryAnalysis.

[15] Many-Turn Jailbreaking

Xianjun Yang,Liqiang Xiao,Shiyang Li,Faisal Ladhak,Hyokun Yun,Linda Ruth Petzold,Yi Xu,William Yang Wang

Main category: cs.CL

TL;DR: 本文首次提出并研究了多轮越狱攻击的概念，构建了多轮越狱基准（MTJ-Bench）以评估这一威胁，并呼吁社区共同努力构建更安全的大型语言模型（LLMs）。

Details

Motivation: 当前对大型语言模型（LLMs）的越狱工作主要集中在单轮攻击上，而忽略了多轮对话场景下的潜在威胁。随着LLMs能够处理长上下文并进行多轮对话，这种威胁更加严重。 Method: 本文通过构建多轮越狱基准（MTJ-Bench）对一系列开源和闭源模型进行测试，以评估多轮越狱攻击的安全威胁。 Result: 本文揭示了多轮越狱攻击的新漏洞，并提供了对这一新安全威胁的深入理解。 Conclusion: 本文提出了多轮越狱攻击的概念，并构建了多轮越狱基准（MTJ-Bench）来评估这一安全威胁，旨在呼吁社区共同努力构建更安全的大型语言模型（LLMs）。 Abstract: Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.

[16] SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

Ziqi Liu,Yangbin Chen,Ziyang Zhou,Yilin Li,Mingxuan Hu,Yushan Pan,Zhijie Xu

Main category: cs.CL

TL;DR: 本文提出SEVADE框架，通过多代理动态分析和解耦评估，显著提升讽刺检测的准确性和抗幻觉能力。

Details

Motivation: 解决现有大型语言模型在讽刺检测中的单视角分析、静态推理路径和易幻觉的问题。 Method: 提出SEVADE框架，包含动态代理推理引擎和解耦评估机制。 Result: 在四个基准数据集上平均提高了6.75%的准确率和6.29%的Macro-F1得分。 Conclusion: SEVADE通过动态代理推理引擎和独立的轻量级理由裁决器，实现了抗幻觉的讽刺检测，取得了最先进的性能。 Abstract: Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose **SEVADE**, a novel **S**elf-**Ev**olving multi-agent **A**nalysis framework with **D**ecoupled **E**valuation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of **6.75%** in Accuracy and **6.29%** in Macro-F1 score.

[17] Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems

Steven Coyne,Diana Galvan-Sosa,Ryan Spring,Camélia Guerraoui,Michael Zock,Keisuke Sakaguchi,Kentaro Inui

Main category: cs.CL

TL;DR: 本文提出了一种注释框架，用于建模学习者错误的错误类型和可推广性，并介绍了生成反馈的方法，以支持语言学习。

Details

Motivation: 现有的自动写作评估系统虽然能有效改进文本，但不适合语言学习，因为它们倾向于直接修改而缺乏对错误原因的考虑。学习者可能更受益于简单的解释和策略性的间接提示。 Method: 引入了一个注释框架来分类错误类型，并收集了标注的学习者错误和人工反馈的数据集。使用大语言模型评估了关键词引导、无关键词和模板引导的反馈生成方法。 Result: 人类教师评估了不同系统的输出，包括相关性、事实性和可理解性。报告了数据集的开发和所研究系统的比较表现。 Conclusion: 通过建模错误类型和可推广性，可以更好地支持语言学习中的反馈生成。 Abstract: Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error's error type and generalizability. For error type classification, we introduce a typology focused on inferring learners' knowledge gaps by connecting their errors to specific grammatical patterns. Following this framework, we collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system's outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the comparative performance of the systems investigated.

[18] Text to Speech System for Meitei Mayek Script

Gangular Singh Irengbam,Nirvash Singh Wahengbam,Lanthoiba Meitei Khumanthem,Paikhomba Oinam

Main category: cs.CL

TL;DR: 开发了一个使用Meitei Mayek脚本的Manipuri语的文本转语音系统，基于Tacotron 2和HiFi-GAN，并展示了清晰自然的语音合成。

Details

Motivation: 为了实现Manipuri语言的语言保存和技术包容，需要开发一个适用于资源不足语言环境的文本转语音系统。 Method: 利用Tacotron 2和HiFi-GAN构建神经文本转语音架构，引入了Meitei Mayek到ARPAbet的音素映射，并构建了一个单说话人的数据集。 Result: 成功开发了一个能够提供清晰和自然语音合成的文本转语音系统，通过主观和客观指标进行了验证。 Conclusion: 这项工作为Manipuri语言的技术包容和语言保存奠定了基础。 Abstract: This paper presents the development of a Text-to-Speech (TTS) system for the Manipuri language using the Meitei Mayek script. Leveraging Tacotron 2 and HiFi-GAN, we introduce a neural TTS architecture adapted to support tonal phonology and under-resourced linguistic environments. We develop a phoneme mapping for Meitei Mayek to ARPAbet, curate a single-speaker dataset, and demonstrate intelligible and natural speech synthesis, validated through subjective and objective metrics. This system lays the groundwork for linguistic preservation and technological inclusion of Manipuri.

[19] ESNERA: Empirical and semantic named entity alignment for named entity dataset merging

Xiaobo Zhang,Congqing He,Ying He,Jian Peng,Dajie Fu,Tien-Ping Tan

Main category: cs.CL

TL;DR: 本文提出了一种自动标签对齐方法，通过结合经验相似性和语义相似性来统一不同数据集的标签空间，实验表明该方法在合并NER数据集和提升低资源金融领域的NER性能方面表现良好。

Details

Motivation: 现有的数据集合并方法主要集中在手动标签映射或构建标签图上，缺乏可解释性和可扩展性，因此需要一种自动化的标签对齐方法。 Method: 结合经验相似性和语义相似性，使用贪心配对合并策略来统一不同数据集的标签空间。 Result: 实验结果表明，所提出的方法能够有效地合并数据集，并在低资源金融领域提升NER性能。 Conclusion: 研究提出了一种基于标签相似性的自动标签对齐方法，用于整合多源NER语料库，该方法在低资源金融领域的NER性能提升方面表现出效果。 Abstract: Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora.

[20] The ReQAP System for Question Answering over Personal Information

Philipp Christmann,Gerhard Weikum

Main category: cs.CL

TL;DR: ReQAP系统通过递归分解问题和轻量级语言模型，实现对异构数据源的复杂查询支持，并提供答案可追溯性。

Details

Motivation: 用户设备上的个人信息丰富多样，从结构化数据到非结构化内容，如何有效整合这些数据以回答复杂问题是一个挑战。ReQAP旨在解决这一问题。 Method: ReQAP系统采用递归分解问题的方法，逐步构建操作树，并利用轻量级语言模型进行问题解释和操作执行。 Result: ReQAP系统能够有效支持用户回答涉及过滤、连接和聚合的复杂问题，并提供详细的答案追踪功能，增强了用户对系统的信任。 Conclusion: ReQAP系统通过递归分解问题和增量构建操作树来支持用户对涉及异构数据源的复杂问题进行回答，并通过轻量级语言模型进行智能处理，从而提高了用户对系统的可理解性和信任度。 Abstract: Personal information is abundant on users' devices, from structured data in calendar, shopping records or fitness tools, to unstructured contents in mail and social media posts. This works presents the ReQAP system that supports users with answers for complex questions that involve filters, joins and aggregation over heterogeneous sources. The unique trait of ReQAP is that it recursively decomposes questions and incrementally builds an operator tree for execution. Both the question interpretation and the individual operators make smart use of light-weight language models, with judicious fine-tuning. The demo showcases the rich functionality for advanced user questions, and also offers detailed tracking of how the answers are computed by the operators in the execution tree. Being able to trace answers back to the underlying sources is vital for human comprehensibility and user trust in the system.

[21] Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores

Arpita Saggar,Jonathan C. Darling,Vania Dimitrova,Duygu Sarikaya,David C. Hogg

Main category: cs.CL

TL;DR: SBS框架通过创新的质量评分机制，在生成个性化对话方面取得了优于现有方法的表现。

Details

Motivation: 现有对话数据多样性有限，导致在对话生成中有效融入个性化特征仍具挑战性。 Method: SBS框架结合响应生成和质量评分，利用名词替换进行数据增强，并使用语义相似性评分作为响应质量的代理。 Result: 通过在基准数据集上的广泛实验，结果表明基于评分的训练方法能够更好地生成个性化一致的对话，且在输入提示中包含评分的训练方式优于传统设置。 Conclusion: SBS框架通过在训练中引入评分机制，提升了对话模型在生成个性化对话时的表现，优于现有方法。 Abstract: Persona-based dialogue generation is an important milestone towards building conversational artificial intelligence. Despite the ever-improving capabilities of large language models (LLMs), effectively integrating persona fidelity in conversations remains challenging due to the limited diversity in existing dialogue data. We propose a novel framework SBS (Score-Before-Speaking), which outperforms previous methods and yields improvements for both million and billion-parameter models. Unlike previous methods, SBS unifies the learning of responses and their relative quality into a single step. The key innovation is to train a dialogue model to correlate augmented responses with a quality score during training and then leverage this knowledge at inference. We use noun-based substitution for augmentation and semantic similarity-based scores as a proxy for response quality. Through extensive experiments with benchmark datasets (PERSONA-CHAT and ConvAI2), we show that score-conditioned training allows existing models to better capture a spectrum of persona-consistent dialogues. Our ablation studies also demonstrate that including scores in the input prompt during training is superior to conventional training setups. Code and further details are available at https://arpita2512.github.io/score_before_you_speak

[22] Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

Siyuan Li,Xi Lin,Guangyan Li,Zehao Liu,Aodu Wulianghai,Li Ding,Jun Wu,Jianhua Li

Main category: cs.CL

TL;DR: SentiDetect is a robust framework for identifying AI-generated text by analyzing emotional consistency in outputs, offering significant improvements over current methods.

Details

Motivation: Existing methods for detecting LLM-generated text suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. Method: SentiDetect uses sentiment distribution stability as a basis for detection, employing two metrics: sentiment distribution consistency and sentiment distribution preservation. Result: SentiDetect showed over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively, and outperformed existing detectors in challenging scenarios. Conclusion: SentiDetect demonstrates superior performance in detecting LLM-generated text compared to existing methods, showing robustness against paraphrasing, adversarial attacks, and variations in text length. Abstract: The rapid advancement of large language models (LLMs) has resulted in increasingly sophisticated AI-generated content, posing significant challenges in distinguishing LLM-generated text from human-written language. Existing detection methods, primarily based on lexical heuristics or fine-tuned classifiers, often suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. In this work, we propose SentiDetect, a model-agnostic framework for detecting LLM-generated text by analyzing the divergence in sentiment distribution stability. Our method is motivated by the empirical observation that LLM outputs tend to exhibit emotionally consistent patterns, whereas human-written texts display greater emotional variability. To capture this phenomenon, we define two complementary metrics: sentiment distribution consistency and sentiment distribution preservation, which quantify stability under sentiment-altering and semantic-preserving transformations. We evaluate SentiDetect on five diverse datasets and a range of advanced LLMs,including Gemini-1.5-Pro, Claude-3, GPT-4-0613, and LLaMa-3.3. Experimental results demonstrate its superiority over state-of-the-art baselines, with over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively. Moreover, SentiDetect also shows greater robustness to paraphrasing, adversarial attacks, and text length variations, outperforming existing detectors in challenging scenarios.

[23] Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction

Mohamed Basem,Islam Oshallah,Ali Hamdi,Khaled Shaban,Hozaifa Kassab

Main category: cs.CL

TL;DR: The paper proposes a two-stage framework for Quranic Question Answering, using ensemble fine-tuned Arabic language models for passage retrieval and instruction-tuned large language models for answer extraction, achieving state-of-the-art results.

Details

Motivation: Quranic Question Answering presents unique challenges due to the linguistic complexity of Classical Arabic and the semantic richness of religious texts. Method: A two-stage framework involving ensemble fine-tuned Arabic language models for passage retrieval and instruction-tuned large language models with few-shot prompting for answer extraction. Result: The approach achieves state-of-the-art results on the Quran QA 2023 Shared Task, with a MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, and a pAP@10 of 0.669 for extraction. Conclusion: Combining model ensembling and instruction-tuned language models effectively addresses the challenges of low-resource question answering in specialized domains. Abstract: Quranic Question Answering presents unique challenges due to the linguistic complexity of Classical Arabic and the semantic richness of religious texts. In this paper, we propose a novel two-stage framework that addresses both passage retrieval and answer extraction. For passage retrieval, we ensemble fine-tuned Arabic language models to achieve superior ranking performance. For answer extraction, we employ instruction-tuned large language models with few-shot prompting to overcome the limitations of fine-tuning on small datasets. Our approach achieves state-of-the-art results on the Quran QA 2023 Shared Task, with a MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, and a pAP@10 of 0.669 for extraction, substantially outperforming previous methods. These results demonstrate that combining model ensembling and instruction-tuned language models effectively addresses the challenges of low-resource question answering in specialized domains.

[24] Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

Zhijun Tu,Hanting Chen,Siqi Liu,Chuanjian Liu,Jian Li,Jie Hu,Yunhe Wang

Main category: cs.CL

TL;DR: 本文提出一种新的1位LLM量化方法，通过渐进式训练和优化策略，利用预训练模型实现高效、高性能的1位模型训练。

Details

Motivation: 现有1位LLM量化方法通常需要从头开始训练，无法充分利用预训练模型，导致训练成本高且精度下降明显。因此需要一种更高效的方法。 Method: 引入了一致的渐进式训练方法，结合二值感知初始化和双尺度补偿技术，以逐步将浮点权重转换为二值权重。 Result: 实验结果表明，该方法在不同规模的LLM上均优于现有方法，实现了高性能的1位LLM训练。 Conclusion: 本文提出了一种新的1位LLM量化方法，通过一致的渐进式训练和二值感知初始化等技术，成功地利用预训练模型实现高性能的1位LLM，避免了从头开始的昂贵训练。 Abstract: 1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes direct adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

[25] Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings

Mao Li,Fred Conrad,Johann Gagnon-Bartsch

Main category: cs.CL

TL;DR: Vec2Summ是一种基于语义压缩的摘要方法，通过均值向量和随机采样生成摘要，具有良好的可扩展性和语义控制能力。

Details

Motivation: 解决基于LLM的摘要方法存在的上下文长度限制、生成控制难度以及参数规模随语料增长的问题。 Method: Vec2Summ将文档集合表示为语义嵌入空间中的均值向量，并通过从以该均值为中心的高斯分布中采样引入随机性，然后通过生成式语言模型解码生成摘要。 Result: Vec2Summ能够生成主题明确、连贯的摘要，性能与直接LLM摘要相当，但细节较少，适用于特定场景。 Conclusion: Vec2Summ具有在可扩展性、语义控制和语料库级抽象方面表现出潜力，适用于需要高效摘要生成的场景。 Abstract: We propose Vec2Summ, a novel method for abstractive summarization that frames the task as semantic compression. Vec2Summ represents a document collection using a single mean vector in the semantic embedding space, capturing the central meaning of the corpus. To reconstruct fluent summaries, we perform embedding inversion -- decoding this mean vector into natural language using a generative language model. To improve reconstruction quality and capture some degree of topical variability, we introduce stochasticity by sampling from a Gaussian distribution centered on the mean. This approach is loosely analogous to bagging in ensemble learning, where controlled randomness encourages more robust and varied outputs. Vec2Summ addresses key limitations of LLM-based summarization methods. It avoids context-length constraints, enables interpretable and controllable generation via semantic parameters, and scales efficiently with corpus size -- requiring only $O(d + d^2)$ parameters. Empirical results show that Vec2Summ produces coherent summaries for topically focused, order-invariant corpora, with performance comparable to direct LLM summarization in terms of thematic coverage and efficiency, albeit with less fine-grained detail. These results underscore Vec2Summ's potential in settings where scalability, semantic control, and corpus-level abstraction are prioritized.

[26] SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

Muhammad Dehan Al Kautsar,Aswin Candra,Muhammad Alif Al Hakim,Maxalmina Satria Kahfi,Fajri Koto,Alham Fikri Aji,Peerat Limkonchotiwat,Ekapol Chuangsuwanich,Genta Indra Winata

Main category: cs.CL

TL;DR: SEADialogues是一个专注于东南亚文化的多语言对话数据集，旨在提升对话系统中的文化意识和个性化能力。

Details

Motivation: 大多数现有的闲聊数据集忽略了自然人类对话中的文化细微差别，而SEADialogues旨在解决这一问题。 Method: 通过创建包含东南亚六国八种语言的对话数据集，每个对话包含人物属性和两个反映当地社区日常生活的文化相关主题。 Result: 发布了一个多轮对话数据集，支持构建具有文化意识和以人为本的大型语言模型，包括对话代理。 Conclusion: SEADialogues填补了现有对话数据集中对文化细微差别关注不足的空白，为构建具有文化意识和个性化对话系统提供了重要资源。 Abstract: Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.

[27] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context

Aditya Tomar,Nihar Ranjan Sahoo,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: BharatBBQ是一个针对印度多语言文化的偏见评估基准，揭示了语言模型在不同语言中的偏见问题。

Details

Motivation: 现有偏见评估基准（如BBQ）主要关注西方背景，缺乏对印度文化的适用性，因此需要开发适用于印度语境的评估工具。 Method: 构建了一个跨文化的基准数据集BharatBBQ，包含13个社会类别，覆盖8种印度语言，共392,864个样本，并在零样本和少样本设置下评估五种多语言模型的偏见得分。 Result: 研究发现语言模型在各种语言和社会类别中存在持续偏见，在印度语言中的偏见往往更严重，突出了文化适配基准的重要性。 Conclusion: BharatBBQ强调了语言模型中普遍存在的偏见，并表明在印度语言中的偏见往往比英语中更为明显，证明了基于语言和文化背景的偏见评估基准的必要性。 Abstract: Evaluating social biases in language models (LMs) is crucial for ensuring fairness and minimizing the reinforcement of harmful stereotypes in AI systems. Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts, limiting their applicability to the Indian context. To address this gap, we introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese. BharatBBQ covers 13 social categories, including 3 intersectional groups, reflecting prevalent biases in the Indian sociocultural landscape. Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages. We evaluate five multilingual LM families across zero and few-shot settings, analyzing their bias and stereotypical bias scores. Our findings highlight persistent biases across languages and social categories and often amplified biases in Indian languages compared to English, demonstrating the necessity of linguistically and culturally grounded benchmarks for bias evaluation.

[28] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Lijie Yang,Zhihao Zhang,Arti Jain,Shijie Cao,Baihong Yuan,Yiwei Chen,Zhihao Jia,Ravi Netravali

Main category: cs.CL

TL;DR: LessIsMore is a training-free sparse attention mechanism that improves performance in reasoning tasks by reducing token generation without sacrificing accuracy.

Details

Motivation: Large reasoning models face computational overhead due to excessive token generation, and existing sparse attention mechanisms suffer from accuracy degradation and high token retention costs. Method: LessIsMore uses global attention patterns, aggregating token selections from local attention heads with recent context to enable unified cross-head token ranking. Result: LessIsMore achieves a 1.1× average decoding speed-up and a 1.13× end-to-end speed-up while preserving or improving accuracy across reasoning tasks. Conclusion: LessIsMore is a training-free sparse attention mechanism that improves decoding speed and reduces token generation without accuracy loss compared to existing methods. Abstract: Large reasoning models achieve strong performance through test-time scaling but incur substantial computational overhead, particularly from excessive token generation when processing short input prompts. While sparse attention mechanisms can reduce latency and memory usage, existing approaches suffer from significant accuracy degradation due to accumulated errors during long-generation reasoning. These methods generally require either high token retention rates or expensive retraining. We introduce LessIsMore, a training-free sparse attention mechanism for reasoning tasks, which leverages global attention patterns rather than relying on traditional head-specific local optimizations. LessIsMore aggregates token selections from local attention heads with recent contextual information, enabling unified cross-head token ranking for future decoding layers. This unified selection improves generalization and efficiency by avoiding the need to maintain separate token subsets per head. Evaluation across diverse reasoning tasks and benchmarks shows that LessIsMore preserves -- and in some cases improves -- accuracy while achieving a $1.1\times$ average decoding speed-up compared to full attention. Moreover, LessIsMore attends to $2\times$ fewer tokens without accuracy loss, achieving a $1.13\times$ end-to-end speed-up compared to existing sparse attention methods.

[29] Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution

Falaah Arif Khan,Nivedha Sivakumar,Yinong Oliver Wang,Katherine Metcalf,Cezanne Camacho,Barry-John Theobald,Luca Zappella,Nicholas Apostoloff

Main category: cs.CL

TL;DR: This paper introduces WinoIdentity, a benchmark for evaluating intersectional bias in large language models (LLMs), revealing significant confidence disparities across demographic attributes and suggesting that LLMs' performance may be due to memorization rather than logical reasoning.

Details

Motivation: The motivation for this study stems from the concern that AI systems, including LLMs, may reflect and exacerbate societal biases, especially when used in critical social contexts like hiring and admissions. The authors aim to extend fairness evaluations to intersectional bias, recognizing that multiple axes of discrimination can create unique patterns of disadvantage. Method: The researchers created a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes. They evaluated five recent LLMs using 245,700 prompts to assess 50 distinct bias patterns through a metric called Coreference Confidence Disparity. Result: The evaluation of five recent LLMs revealed confidence disparities as high as 40% along various demographic attributes, such as body type, sexual orientation, and socio-economic status. The models were most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence also decreased for privileged markers. Conclusion: The study concludes that large language models (LLMs) exhibit significant confidence disparities based on intersectional identities, which can lead to social harm. It highlights two independent failures in value alignment and validity, suggesting that LLMs' performance may stem more from memorization than logical reasoning. Abstract: Large language models (LLMs) have achieved impressive performance, leading to their widespread adoption as decision-support tools in resource-constrained contexts like hiring and admissions. There is, however, scientific consensus that AI systems can reflect and exacerbate societal biases, raising concerns about identity-based harm when used in critical social contexts. Prior work has laid a solid foundation for assessing bias in LLMs by evaluating demographic disparities in different language reasoning tasks. In this work, we extend single-axis fairness evaluations to examine intersectional bias, recognizing that when multiple axes of discrimination intersect, they create distinct patterns of disadvantage. We create a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes, including age, nationality, and race, intersected with binary gender, yielding 245,700 prompts to evaluate 50 distinct bias patterns. Focusing on harms of omission due to underrepresentation, we investigate bias through the lens of uncertainty and propose a group (un)fairness metric called Coreference Confidence Disparity which measures whether models are more or less confident for some intersectional identities than others. We evaluate five recently published LLMs and find confidence disparities as high as 40% along various demographic attributes including body type, sexual orientation and socio-economic status, with models being most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence decreases even for hegemonic or privileged markers, indicating that the recent impressive performance of LLMs is more likely due to memorization than logical reasoning. Notably, these are two independent failures in value alignment and validity that can compound to cause social harm.

[30] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens

Anna Seo Gyeong Choi,Hoon Choi

Main category: cs.CL

TL;DR: This paper explores how Automatic Speech Recognition systems can unintentionally discriminate against non-standard dialects, arguing that this bias reflects disrespect towards marginalized communities and requires a philosophical approach to address it effectively.

Details

Motivation: The motivation behind the paper is to explore the fairness implications of Automatic Speech Recognition (ASR) systems, particularly how they may systematically misrecognize certain speech varieties, leading to disrespect and compounding historical injustices against marginalized linguistic communities. Method: The paper uses a philosophical lens to examine ASR bias, distinguishing between morally neutral classification and harmful discrimination. It analyzes the ethical dimensions of speech technologies, such as 'temporal taxation,' disruption of conversational flow, and the link between speech patterns and identity. Result: The paper identifies three unique ethical dimensions of speech technologies: temporal taxation, disruption of conversational flow, and the connection between speech and identity. It argues that current ASR development embeds problematic language ideologies and that existing fairness metrics fail to capture the resulting power imbalances. Conclusion: The paper concludes that addressing ASR bias involves more than just technical solutions; it necessitates recognizing diverse speech varieties as valid forms of expression deserving of technological accommodation. This philosophical approach opens new avenues for developing ASR systems that respect linguistic diversity and speaker autonomy. Abstract: Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation -- it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties ("temporal taxation"), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy.

[31] Gradient Surgery for Safe LLM Fine-Tuning

Biao Yi,Jiahao Li,Baolei Zhang,Lihai Nie,Tong Li,Tiansheng Huang,Zheli Liu

Main category: cs.CL

TL;DR: SafeGrad is a novel method to protect LLMs from safety alignment degradation during fine-tuning by using gradient surgery and KL-divergence loss.

Details

Motivation: Malicious examples in fine-tuning datasets can compromise the safety of LLMs, and existing solutions are sensitive to the harmful ratio. A more robust solution is needed. Method: SafeGrad uses gradient surgery to nullify harmful components of the user-task gradient while maintaining task fidelity, along with a KL-divergence alignment loss for better safety profile learning. Result: SafeGrad achieves state-of-the-art defense performance across various LLMs and datasets, preserving safety alignment even at high harmful ratios without sacrificing task performance. Conclusion: SafeGrad provides a robust solution to maintain safety alignment in LLMs during fine-tuning, even in the presence of malicious examples. Abstract: Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user's fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs). While a recognized paradigm frames safe fine-tuning as a multi-objective optimization problem balancing user task performance with safety alignment, we find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases. We diagnose that this failure stems from conflicting gradients, where the user-task update directly undermines the safety objective. To resolve this, we propose SafeGrad, a novel method that employs gradient surgery. When a conflict is detected, SafeGrad nullifies the harmful component of the user-task gradient by projecting it onto the orthogonal plane of the alignment gradient, allowing the model to learn the user's task without sacrificing safety. To further enhance robustness and data efficiency, we employ a KL-divergence alignment loss that learns the rich, distributional safety profile of the well-aligned foundation model. Extensive experiments show that SafeGrad provides state-of-the-art defense across various LLMs and datasets, maintaining robust safety even at high harmful ratios without compromising task fidelity.

[32] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Leyi Pan,Zheyu Fu,Yunpeng Zhai,Shuchang Tao,Sheng Guan,Shiyu Huang,Lingzhe Zhang,Zhaoyang Liu,Bolin Ding,Felix Henry,Lijie Wen,Aiwei Liu

Main category: cs.CL

TL;DR: Omni-SafetyBench is introduced as the first benchmark for evaluating the safety of Omni-modal Large Language Models (OLLMs), highlighting critical vulnerabilities in current models and the need for enhanced safety measures.

Details

Motivation: The lack of dedicated benchmarks for evaluating the safety of Omni-modal Large Language Models (OLLMs) under audio-visual or cross-modal inputs. Method: Development of Omni-SafetyBench, a comprehensive benchmark with tailored metrics like Safety-score and CMSC-score for evaluating OLLM safety and cross-modal consistency. Result: Testing revealed that no model performs well in both safety and consistency, safety defenses weaken with complex inputs, and some models exhibit severe weaknesses on specific modalities. Conclusion: The Omni-SafetyBench benchmark highlights critical vulnerabilities in current OLLMs and underscores the urgent need for improved safety measures. Abstract: The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and prior benchmarks designed for other LLMs lack the ability to assess safety performance under audio-visual joint inputs or cross-modal safety consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality combinations and variations with 972 samples each, including dedicated audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency Score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) no model excels in both overall safety and consistency, with only 3 models achieving over 0.6 in both metrics and top performer scoring around 0.8; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Our benchmark and metrics highlight urgent needs for enhanced OLLM safety, providing a foundation for future improvements.

[33] Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback

Kejin Liu,Junhong Lian,Xiang Ao,Ningtao Wang,Xing Fu,Yu Cheng,Weiqiang Wang,Xinyu Liu

Main category: cs.CL

TL;DR: 本文提出了一种新的个性化标题生成框架PHG-DIF，通过去除隐式反馈中的虚假兴趣来提高生成标题的质量。

Details

Motivation: 现有的个性化标题生成方法忽视了历史点击流中的个性化无关点击噪声，这可能导致生成偏离用户真实偏好的标题。 Method: 我们提出了一种新的个性化标题生成框架PHG-DIF，该框架通过从隐式反馈中去除虚假兴趣来进行去噪。PHG-DIF首先采用双阶段过滤来有效去除点击流噪声，然后利用多级时间融合来动态建模用户的演变和多方面兴趣。 Result: 实验表明，PHG-DIF在DT-PENS数据集上显著提高了标题生成的质量，达到了最先进的性能。 Conclusion: PHG-DIF有效地减轻了点击噪声的不利影响，并显著提高了标题质量，在DT-PENS上实现了最先进的（SOTA）结果。 Abstract: Accurate personalized headline generation hinges on precisely capturing user interests from historical behaviors. However, existing methods neglect personalized-irrelevant click noise in entire historical clickstreams, which may lead to hallucinated headlines that deviate from genuine user preferences. In this paper, we reveal the detrimental impact of click noise on personalized generation quality through rigorous analysis in both user and news dimensions. Based on these insights, we propose a novel Personalized Headline Generation framework via Denoising Fake Interests from Implicit Feedback (PHG-DIF). PHG-DIF first employs dual-stage filtering to effectively remove clickstream noise, identified by short dwell times and abnormal click bursts, and then leverages multi-level temporal fusion to dynamically model users' evolving and multi-faceted interests for precise profiling. Moreover, we release DT-PENS, a new benchmark dataset comprising the click behavior of 1,000 carefully curated users and nearly 10,000 annotated personalized headlines with historical dwell time annotations. Extensive experiments demonstrate that PHG-DIF substantially mitigates the adverse effects of click noise and significantly improves headline quality, achieving state-of-the-art (SOTA) results on DT-PENS. Our framework implementation and dataset are available at https://github.com/liukejin-up/PHG-DIF.

[34] Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Jiaqi Yin,Yi-Wei Chen,Meng-Lung Lee,Xiya Liu

Main category: cs.CL

TL;DR: This paper proposes a framework to address semantic drift in enterprise data pipelines by extracting fine-grained schema lineage, introducing a new evaluation metric (SLiCE) and benchmark, and demonstrating that model size and prompting techniques significantly impact performance.

Details

Motivation: Enterprise data pipelines often suffer from semantic drift due to complex transformations across multiple programming languages, leading to issues in data reproducibility, governance, and the performance of downstream systems like retrieval-augmented generation (RAG) and text-to-SQL systems. This work aims to address this challenge. Method: The paper proposes a framework for automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. It introduces a metric called Schema Lineage Composite Evaluation (SLiCE) to assess lineage quality and presents a benchmark with 1,700 manually annotated lineages for evaluation. Result: Experiments with 12 language models, ranging from small to large models like GPT-4o and GPT-4.1, show that schema lineage extraction performance scales with model size and prompting sophistication. A 32B open-source model achieves performance comparable to GPT series models under standard prompting. Conclusion: The study concludes that semantic drift in enterprise data pipelines can be effectively addressed by deploying a novel framework for automated schema lineage extraction, which is both scalable and economical for practical applications. Abstract: Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.

[35] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

Kabir Khan,Priya Sharma,Arjun Mehta,Neha Gupta,Ravi Narayanan

Main category: cs.CL

TL;DR: DySK-Attn是一种新的框架，通过动态知识图谱使大型语言模型能够高效整合实时知识。

Details

Motivation: 大型语言模型的知识是静态的，并且很快就会过时，而现有的知识编辑技术可能速度慢且可能引入意想不到的副作用。 Method: DySK-Attn通过一个稀疏知识注意机制与动态知识图谱协同工作，允许大型语言模型进行从粗到细的搜索，从而有效地识别和关注知识图谱中的一小部分高度相关事实。 Result: 在时间敏感的问答任务中，DySK-Attn在更新知识的事实准确性和计算效率方面都显著优于强基线，包括标准检索增强生成(RAG)和模型编辑技术。 Conclusion: DySK-Attn提供了一种可扩展且有效的解决方案，使大型语言模型能够与不断变化的世界保持同步。 Abstract: Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.

[36] Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment

Yanru Sun,Emadeldeen Eldele,Zongxia Xie,Yucheng Wang,Wenzhe Niu,Qinghua Hu,Chee Keong Kwoh,Min Wu

Main category: cs.CL

TL;DR: TALON通过解决时间模式异质性和模态差距问题，提高了基于LLM的时间序列预测性能。

Details

Motivation: 由于时间模式的内在异质性和连续数值信号与离散语言表示之间的模态差距，LLM在时间序列预测中的直接应用存在挑战。 Method: 设计了一个异构时间编码器和一个语义对齐模块。 Result: 在七个真实世界基准上的广泛实验表明，TALON在所有数据集上都实现了优越的性能，平均MSE比最近最先进的方法提高了11%。 Conclusion: TALON是一个有效的基于LLM的时间序列预测框架，它通过建模时间异质性和执行语义对齐来提高预测性能。 Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities in natural language processing due to their strong generalization and sequence modeling capabilities. However, their direct application to time series forecasting remains challenging due to two fundamental issues: the inherent heterogeneity of temporal patterns and the modality gap between continuous numerical signals and discrete language representations. In this work, we propose TALON, a unified framework that enhances LLM-based forecasting by modeling temporal heterogeneity and enforcing semantic alignment. Specifically, we design a Heterogeneous Temporal Encoder that partitions multivariate time series into structurally coherent segments, enabling localized expert modeling across diverse temporal patterns. To bridge the modality gap, we introduce a Semantic Alignment Module that aligns temporal features with LLM-compatible representations, enabling effective integration of time series into language-based models while eliminating the need for handcrafted prompts during inference. Extensive experiments on seven real-world benchmarks demonstrate that TALON achieves superior performance across all datasets, with average MSE improvements of up to 11\% over recent state-of-the-art methods. These results underscore the effectiveness of incorporating both pattern-aware and semantic-aware designs when adapting LLMs for time series forecasting. The code is available at: https://github.com/syrGitHub/TALON.

[37] Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model

Chaoqun Cui,Siyuan Li,Kunkun Ma,Caiyan Jia

Main category: cs.CL

TL;DR: 本文提出了一种新的继续预训练策略PEP和预训练语言模型SoLM，通过建模社交媒体帖子之间的传播结构关系，有效提升了谣言检测的性能。

Details

Motivation: 预训练语言模型(PLMs)在自然语言处理任务中表现出色，但在社交媒体应用如谣言检测中的表现仍不理想。这归因于预训练语料与社交媒体文本之间的不匹配、对独特社交媒体符号处理的不足以及不适合建模传播结构中隐含的用户互动的预训练任务。 Method: 提出了一种称为Post Engagement Prediction (PEP)的继续预训练策略，并构建了Twitter-tailored预训练语言模型SoLM。PEP通过预测帖子之间的root、branch和parent关系来捕捉立场和情感的交互信息。 Result: PEP显著提升了通用和社交媒体PLMs在谣言检测任务上的性能，即使在小样本场景下也有1.0-3.7%的准确率提升。SoLM模型在没有高层模块的情况下也取得了有竞争力的结果。 Conclusion: PEP策略和SoLM模型有效提升谣言检测性能，证明了其在社交媒体文本处理中的有效性。 Abstract: Pretrained Language Models (PLMs) have excelled in various Natural Language Processing tasks, benefiting from large-scale pretraining and self-attention mechanism's ability to capture long-range dependencies. However, their performance on social media application tasks like rumor detection remains suboptimal. We attribute this to mismatches between pretraining corpora and social texts, inadequate handling of unique social symbols, and pretraining tasks ill-suited for modeling user engagements implicit in propagation structures. To address these issues, we propose a continue pretraining strategy called Post Engagement Prediction (PEP) to infuse information from propagation structures into PLMs. PEP makes models to predict root, branch, and parent relations between posts, capturing interactions of stance and sentiment crucial for rumor detection. We also curate and release large-scale Twitter corpus: TwitterCorpus (269GB text), and two unlabeled claim conversation datasets with propagation structures (UTwitter and UWeibo). Utilizing these resources and PEP strategy, we train a Twitter-tailored PLM called SoLM. Extensive experiments demonstrate PEP significantly boosts rumor detection performance across universal and social media PLMs, even in few-shot scenarios. On benchmark datasets, PEP enhances baseline models by 1.0-3.7\% accuracy, even enabling it to outperform current state-of-the-art methods on multiple datasets. SoLM alone, without high-level modules, also achieves competitive results, highlighting the strategy's effectiveness in learning discriminative post interaction features.

[38] How Does a Deep Neural Network Look at Lexical Stress?

Itai Allouche,Itay Asael,Rotem Rousso,Vered Dassa,Ann Bradlow,Seung-Eun Kim,Matthew Goldrick,Joseph Keshet

Main category: cs.CL

TL;DR: 本研究通过深度学习模型预测英语双音节词的重音位置，并利用可解释性分析技术揭示了模型主要依赖重音元音的频谱特性进行决策。

Details

Motivation: 神经网络在语音处理中虽然成功，但其决策机制往往不透明，因此研究者希望探究其决策依据并提高模型的可解释性。 Method: 使用卷积神经网络（CNN）从没有最小重音对的双音节英语词的频谱图表示中预测重音位置，并利用层相关传播（LRP）技术进行可解释性分析。 Result: CNN模型在预测重音位置的任务中达到了92%的准确率，而LRP分析显示模型主要受到重音音节的频谱特性影响，尤其是重音元音的第一和第二共振峰。 Conclusion: 该研究揭示了深度学习模型能够从自然数据中获取分布式的重音线索，这些线索与传统语音学研究中基于高度控制刺激的方法相辅相成。 Abstract: Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning's ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.

[39] Prompt Tuning for Few-Shot Continual Learning Named Entity Recognition

Zhe Ren

Main category: cs.CL

TL;DR: This paper proposes an Anchor words-oriented Prompt Tuning paradigm and Memory Demonstration Templates to address the Few-Shot Distillation Dilemma in Few-Shot Continual Learning Named Entity Recognition tasks, achieving competitive results.

Details

Motivation: The scarcity of new-class entities and lack of old-class information in FS-CLNER tasks lead to poor generalization and ineffective knowledge distillation, necessitating a novel approach. Method: An Anchor words-oriented Prompt Tuning (APT) paradigm and Memory Demonstration Templates (MDT) were designed to overcome the Few-Shot Distillation Dilemma and enhance model generalization and in-context learning. Result: The proposed approach demonstrates competitive performance on FS-CLNER tasks, effectively addressing the Few-Shot Distillation Dilemma and promoting in-context learning. Conclusion: The proposed APT and MDT strategies effectively address the challenges in FS-CLNER tasks, achieving competitive performance. Abstract: Knowledge distillation has been successfully applied to Continual Learning Named Entity Recognition (CLNER) tasks, by using a teacher model trained on old-class data to distill old-class entities present in new-class data as a form of regularization, thereby avoiding catastrophic forgetting. However, in Few-Shot CLNER (FS-CLNER) tasks, the scarcity of new-class entities makes it difficult for the trained model to generalize during inference. More critically, the lack of old-class entity information hinders the distillation of old knowledge, causing the model to fall into what we refer to as the Few-Shot Distillation Dilemma. In this work, we address the above challenges through a prompt tuning paradigm and memory demonstration template strategy. Specifically, we designed an expandable Anchor words-oriented Prompt Tuning (APT) paradigm to bridge the gap between pre-training and fine-tuning, thereby enhancing performance in few-shot scenarios. Additionally, we incorporated Memory Demonstration Templates (MDT) into each training instance to provide replay samples from previous tasks, which not only avoids the Few-Shot Distillation Dilemma but also promotes in-context learning. Experiments show that our approach achieves competitive performances on FS-CLNER.

[40] The 2D+ Dynamic Articulatory Model DYNARTmo: Tongue-Palate Contact Area Estimation

Bernd J. Kröger

Main category: cs.CL

TL;DR: This paper enhances the DYNARTmo articulatory model by incorporating a 3D palatal dome representation to better estimate tongue-palate contact areas and improve visualization for speech-related applications.

Details

Motivation: To improve the estimation of tongue-palate contact areas and enhance visualization capabilities of a 2D dynamic articulatory model for better applications in speech science and therapy. Method: An internal 3D representation of the palatal dome was integrated into the DYNARTmo model using two dome geometries—half-ellipse and cosine-based profiles—to compute lateral contact points analytically. Result: The updated model supports three synchronized views (sagittal, glottal, palatal) for static and dynamic articulation displays, including electropalatography-like visualizations. Conclusion: The enhanced DYNARTmo model successfully integrates a 3D palatal dome representation to estimate tongue-palate contact areas and generate EPG-like visualizations, making it a valuable tool for speech science education and therapy. Abstract: This paper describes an extension of the two-dimensional dynamic articulatory model DYNARTmo by integrating an internal three-dimensional representation of the palatal dome to estimate tongue-palate contact areas from midsagittal tongue contours. Two alternative dome geometries - a half-ellipse and a cosine based profile - are implemented to model lateral curvature in the coronal plane. Using these geometries, lateral contact points are analytically computed for each anterior-posterior position, enabling the generation of electropalatography-like visualizations within the 2D+ framework. The enhanced model supports three synchronized views (sagittal, glottal, and palatal) for static and dynamic (animated) articulation displays, suitable for speech science education and speech therapy. Future work includes adding a facial (lip) view and implementing articulatory-to-acoustic synthesis to quantitatively evaluate model realism.

[41] Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

Qiongqiong Wang,Hardik B. Sailor,Jeremy H. M. Wong,Tianchi Liu,Shuo Sun,Wenyu Zhang,Muhammad Huzaifah,Nancy Chen,Ai Ti Aw

Main category: cs.CL

TL;DR: This paper proposes explicit and implicit methods to improve empathetic reasoning in large speech language models by incorporating contextual paralinguistic information, with results showing significant performance improvements.

Details

Motivation: Current large speech language models often show limitations in empathetic reasoning due to a lack of training datasets integrating contextual content and paralinguistic cues. Method: The study proposes two approaches: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that generates new training QA pairs using emotion annotations and speech transcriptions. Result: The implicit method improves performance by 38.41% on a human-annotated QA benchmark, and performance reaches 46.02% when combined with the explicit method. Conclusion: The study concludes that incorporating contextual paralinguistic information through explicit and implicit methods enhances the empathetic reasoning capabilities of large speech language models. Abstract: Current large speech language models (Speech-LLMs) often exhibit limitations in empathetic reasoning, primarily due to the absence of training datasets that integrate both contextual content and paralinguistic cues. In this work, we propose two approaches to incorporate contextual paralinguistic information into model training: (1) an explicit method that provides paralinguistic metadata (e.g., emotion annotations) directly to the LLM, and (2) an implicit method that automatically generates novel training question-answer (QA) pairs using both categorical and dimensional emotion annotations alongside speech transcriptions. Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach, showing effectiveness in contextual paralinguistic understanding. We also validate the LLM judge by demonstrating its correlation with classification metrics, providing support for its reliability.

[42] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

Vasudha Varadarajan,Hui Xu,Rebecca Astrid Boehme,Mariam Marlan Mirstrom,Sverker Sikstrom,H. Andrew Schwartz

Main category: cs.CL

TL;DR: MAQuA reduces the number of assessment questions needed for mental health screening by up to 87%, improving efficiency and reducing user burden.

Details

Motivation: Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. Method: MAQuA combines multi-outcome modeling on language responses with item response theory (IRT) and factor analysis to select the most informative questions across multiple dimensions at each turn. Result: Empirical results reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering and demonstrates robust performance across both internalizing and externalizing domains. Conclusion: MAQuA is a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows. Abstract: Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.

[43] "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas

Junchen Ding,Penghao Jiang,Zihao Xu,Ziqi Ding,Yichen Zhu,Jiaojiao Jiang,Yuekang Li

Main category: cs.CL

TL;DR: This study evaluates how 14 leading large language models handle ethical decisions across various moral frameworks, finding that while some models demonstrate strong moral reasoning, others produce ethically questionable outcomes, suggesting the need for better alignment and standardized moral reasoning benchmarks.

Details

Motivation: As LLMs increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes is crucial. This study aims to uncover how LLMs handle ethical dilemmas and to identify their alignment with human moral philosophies. Method: The study employed a factorial prompting protocol to evaluate 14 leading LLMs across 27 diverse trolley problem scenarios framed by ten moral philosophies, eliciting 3,780 binary decisions and natural language justifications. Result: Findings show significant variability in model performance across ethical frames and model types. Reasoning-enhanced models are more decisive and provide structured justifications, with notable 'sweet zones' in altruistic, fairness, and virtue ethics framings. However, models struggle with frames involving kinship, legality, or self-interest, often producing ethically controversial outcomes. Conclusion: The study concludes that moral reasoning should be a primary focus in aligning large language models (LLMs), advocating for standardized benchmarks that assess not only decisions but also the reasoning behind them. Abstract: As large language models (LLMs) increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes becomes imperative. This study presents a comprehensive empirical evaluation of 14 leading LLMs, both reasoning enabled and general purpose, across 27 diverse trolley problem scenarios, framed by ten moral philosophies, including utilitarianism, deontology, and altruism. Using a factorial prompting protocol, we elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues. Our findings reveal significant variability across ethical frames and model types: reasoning enhanced models demonstrate greater decisiveness and structured justifications, yet do not always align better with human consensus. Notably, "sweet zones" emerge in altruistic, fairness, and virtue ethics framings, where models achieve a balance of high intervention rates, low explanation conflict, and minimal divergence from aggregated human judgments. However, models diverge under frames emphasizing kinship, legality, or self interest, often producing ethically controversial outcomes. These patterns suggest that moral prompting is not only a behavioral modifier but also a diagnostic tool for uncovering latent alignment philosophies across providers. We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why.

[44] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

Jian Chen,Jinbao Tian,Yankui Li,Zhou Li

Main category: cs.CL

TL;DR: ARCE improves named entity recognition in the AEC domain by using an LLM to generate simple explanations for pre-training a RoBERTa model.

Details

Motivation: Standard pre-trained models struggle with domain-specific terminology and complex contexts in the AEC domain, and existing solutions are labor-intensive and costly. Method: ARCE utilizes an LLM to generate a corpus of simple explanations (Cote), which is used to pre-train a RoBERTa model before fine-tuning it for NER tasks. Result: ARCE achieved a Macro-F1 score of 77.20% on a benchmark AEC dataset, setting a new state-of-the-art. Conclusion: ARCE is a novel approach that leverages LLMs for automated knowledge generation to enhance smaller models in the AEC domain, outperforming existing methods. Abstract: Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:https://github.com/nxcc-lab/ARCE.

Yexing Du,Kaiyuan Liu,Youcheng Pan,Zheng Chu,Bo Yang,Xiaocheng Feng,Yang Xiang,Ming Liu

Main category: cs.CL

TL;DR: This paper introduces CCFQA, a new benchmark for evaluating cross-lingual and cross-modal factuality in MLLMs, and proposes a few-shot transfer learning strategy to improve multilingual Spoken Question Answering performance.

Details

Motivation: The motivation stems from the increasing use of LLMs in multilingual contexts and the need to ensure hallucination-free factuality, especially in speech processing, which is often overlooked in existing benchmarks. Method: The researchers proposed a novel Cross-lingual and Cross-modal Factuality benchmark (CCFQA), which includes parallel speech-text factual questions across 8 languages. They conducted experiments and applied a few-shot transfer learning strategy. Result: Current MLLMs showed significant challenges on the CCFQA benchmark, but the proposed few-shot transfer learning strategy achieved competitive performance with GPT-4o-mini-Audio using only 5-shot training. Conclusion: The study concludes that current MLLMs face challenges in cross-lingual and cross-modal factuality, but a proposed few-shot transfer learning strategy shows effectiveness in transferring QA capabilities from English to multilingual SQA tasks. Abstract: As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.

[46] HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways

Cristian Cosentino,Annamaria Defilippo,Marco Dossena,Christopher Irwin,Sara Joubbi,Pietro Liò

Main category: cs.CL

TL;DR: HealthBranches is a new benchmark dataset for medical Q&A designed to assess complex reasoning in LLMs, featuring 4,063 patient cases with detailed reasoning paths and supporting both open-ended and multiple-choice formats.

Details

Motivation: To evaluate complex reasoning in Large Language Models (LLMs) and support the development of more trustworthy and interpretable LLMs in healthcare. Method: A semi-automated pipeline was used to generate the dataset, transforming explicit decision pathways from medical sources into realistic patient cases with questions and answers. Result: The creation of HealthBranches, a benchmark dataset with 4,063 case studies across 17 healthcare topics, featuring both open-ended and multiple-choice questions, along with full reasoning paths. Conclusion: HealthBranches serves as a foundational tool for developing more trustworthy and clinically reliable LLMs in high-stakes domains and is also a valuable educational resource. Abstract: HealthBranches is a novel benchmark dataset for medical Question-Answering (Q&A), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each Q&A. Its structured design enables robust evaluation of LLMs' multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes.

[47] ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

Shubhra Ghosh,Abhilekh Borah,Aditya Kumar Guru,Kripabandhu Ghosh

Main category: cs.CL

TL;DR: This study introduces ObfusQA, a new framework for evaluating the robustness of Large Language Models against obfuscated questions, revealing that LLMs often fail or hallucinate when faced with nuanced variations.

Details

Motivation: The motivation is to systematically evaluate the limitations of Large Language Models (LLMs) when dealing with obfuscated versions of questions, as no previous studies have addressed this aspect of LLM robustness. Method: A novel technique called ObfusQAte was proposed, introducing ObfusQA, a framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three dimensions: Named-Entity Indirection, Distractor Indirection, and Contextual Overload. Result: ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability, capturing fine-grained distinctions in language. The study observed that LLMs often fail or generate hallucinated responses when confronted with nuanced variations in questions. Conclusion: The study concludes that LLMs tend to fail or generate hallucinated responses when faced with nuanced variations in questions, highlighting the need for further research and improvement in their robustness and adaptability. Abstract: The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs' robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.

[48] Strategies of Code-switching in Human-Machine Dialogs

Dean Geckt,Melinda Fricke,Shuly Wintner

Main category: cs.CL

TL;DR: A code-switching chatbot was used to study bilingual language interactions, showing that predictable and grammatically correct code-switching is essential for user enjoyment and task success.

Details

Motivation: To better understand the characteristics of code-switched language and explore the potential of using chatbots for research on bilingual language use. Method: A chatbot was developed to conduct a Map Task with human participants using code-switched Spanish and English. The bot was prompted to code-switch using different strategies to assess participant reactions and task performance. Result: Participants enjoyed interacting with the bot when its code-switching was predictable and grammatically correct. However, random or ungrammatical code-switching led to decreased enjoyment and task performance. Conclusion: The study concludes that while code-switching chatbots hold promise for researching bilingual language use, their effectiveness depends on producing predictable and grammatically correct code-switched language. Abstract: Most people are multilingual, and most multilinguals code-switch, yet the characteristics of code-switched language are not fully understood. We developed a chatbot capable of completing a Map Task with human participants using code-switched Spanish and English. In two experiments, we prompted the bot to code-switch according to different strategies, examining (1) the feasibility of such experiments for investigating bilingual language use, and (2) whether participants would be sensitive to variations in discourse and grammatical patterns. Participants generally enjoyed code-switching with our bot as long as it produced predictable code-switching behavior; when code-switching was random or ungrammatical (as when producing unattested incongruent mixed-language noun phrases, such as `la fork'), participants enjoyed the task less and were less successful at completing it. These results underscore the potential downsides of deploying insufficiently developed multilingual language technology, while also illustrating the promise of such technology for conducting research on bilingual language use.

[49] Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

Wenqian Cui,Lei Zhu,Xiaohui Li,Zhihan Guo,Haoli Bai,Lu Hou,Irwin King

Main category: cs.CL

TL;DR: 本文提出了一种新方法TurnGuide，用于提升端到端全双工语音语言模型的对话能力，通过模仿人类对话规划，将助手语音动态分割为对话轮次，并在语音输出前生成轮次级别的文本指导，解决了插入时机和长度问题，从而显著提升了模型的对话能力。

Details

Motivation: 端到端全双工语音语言模型（e2e FD-SLMs）在建模复杂对话动态方面具有潜力，但其对话能力通常由于较长的语音序列和有限的高质量口语对话数据而下降。文本引导的语音生成方法在整合文本引导到双通道音频流时存在时序和长度问题，破坏了自然交互所需的时间对齐。因此，需要一种能够有效解决这些问题的方法。 Method: 提出了一种受规划启发的新方法TurnGuide，该方法通过模仿人类对话规划，将助手语音动态分割为对话轮次，并在语音输出前生成轮次级别的文本指导，以解决插入时机和长度问题。 Result: 大量实验表明，TurnGuide显著提升了端到端全双工语音语言模型的对话能力，能够生成语义有意义且连贯的语音，并保持自然的对话流程。 Conclusion: TurnGuide有效提升了端到端全双工语音语言模型的对话能力，使其能够生成语义明确且连贯的语音，同时保持自然的对话流程。 Abstract: Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational dynamics such as interruptions, backchannels, and overlapping speech, and End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions. However, they face a critical challenge -- their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. While text-guided speech generation could mitigate these issues, it suffers from timing and length issues when integrating textual guidance into double-channel audio streams, disrupting the precise time alignment essential for natural interactions. To address these challenges, we propose TurnGuide, a novel planning-inspired approach that mimics human conversational planning by dynamically segmenting assistant speech into dialogue turns and generating turn-level text guidance before speech output, which effectively resolves both insertion timing and length challenges. Extensive experiments demonstrate our approach significantly improves e2e FD-SLMs' conversational abilities, enabling them to generate semantically meaningful and coherent speech while maintaining natural conversational flow. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code will be available at https://github.com/dreamtheater123/TurnGuide.

[50] Grounding Multilingual Multimodal LLMs With Cultural Knowledge

Jean de Dieu Nyandwi,Yueqi Song,Simran Khanuja,Graham Neubig

Main category: cs.CL

TL;DR: This paper introduces a new approach to improve Multimodal Large Language Models (MLLMs) by grounding them in cultural knowledge, resulting in a globally inclusive model named CulturalPangea that performs well across diverse languages and cultures without sacrificing general performance.

Details

Motivation: Multimodal Large Language Models (MLLMs) perform poorly in interpreting long-tail cultural entities and in low-resource languages, limiting their global inclusivity. This research aims to bridge this cultural gap by directly grounding MLLMs in cultural knowledge. Method: The researchers used a data-centric approach, leveraging a large-scale knowledge graph from Wikidata to collect culturally significant images and generate synthetic multilingual visual question answering data. They developed a dataset named CulturalGround containing 22 million VQA pairs across 42 countries and 39 languages, and trained an open-source MLLM called CulturalPangea on this dataset while preserving general abilities with multilingual instruction-tuning data. Result: CulturalPangea achieved state-of-the-art performance among open models on culture-focused multilingual multimodal benchmarks, outperforming previous models by an average of 5.0, without degrading performance on mainstream vision-language tasks. Conclusion: The study concludes that grounding MLLMs in cultural knowledge using a data-centric approach can significantly reduce the cultural gap in these models and promote inclusivity in multimodal systems globally. Abstract: Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.

[51] Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs

Zhiyi Lyu,Jianguo Huang,Yanchen Deng,Steven Hoi,Bo An

Main category: cs.CL

TL;DR: ReLoc is a new local search framework for code generation that improves efficiency and performance by using step-by-step revisions and a specialized reward model, outperforming previous approaches.

Details

Motivation: The authors aim to overcome the efficiency and scalability challenges faced by Large Language Models (LLMs) in code generation, particularly the limitations of construction-based and improvement-based methods. Method: ReLoc uses a step-by-step code revision approach with four components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating. It also employs a revision reward model based on revision distance. Result: Extensive experiments show that ReLoc significantly outperforms existing methods in diverse code generation tasks. Conclusion: ReLoc is a unified local search framework that achieves superior performance in code generation tasks compared to both construction-based tree search and state-of-the-art improvement-based methods. Abstract: Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose \textbf{ReLoc}, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods.

[52] Positional Biases Shift as Inputs Approach Context Window Limits

Blerta Veseli,Julian Chibane,Mariya Toneva,Alexander Koller

Main category: cs.CL

TL;DR: This paper identifies that the LiM effect in LLMs is most significant when inputs use up to half of the model's context window, and beyond that, recency bias dominates, suggesting that retrieval is key to reasoning in LLMs.

Details

Motivation: The study aims to clarify the intensity and conditions under which positional biases, such as the LiM effect, manifest in large language models (LLMs), as prior long-context studies have yielded inconsistent results. Method: A comprehensive analysis was conducted using relative input lengths concerning each model's context window to study positional biases like the LiM effect, primacy bias, and recency bias. Result: The LiM effect is strongest when inputs occupy up to half of a model's context window. Beyond that, recency bias remains stable, while primacy bias diminishes, resulting in a distance-based bias where performance improves when relevant information is closer to the end of the input. Conclusion: The Lost in the Middle (LiM) effect is most pronounced when inputs take up to 50% of a model's context window, beyond which primacy bias weakens while recency bias remains, leading to a distance-based bias. Retrieval is a prerequisite for reasoning in LLMs, and positional biases in reasoning are mainly inherited from retrieval. Abstract: Large Language Models (LLMs) often struggle to use information across long inputs effectively. Prior work has identified positional biases, such as the Lost in the Middle (LiM) effect, where models perform better when information appears at the beginning (primacy bias) or end (recency bias) of the input, rather than in the middle. However, long-context studies have not consistently replicated these effects, raising questions about their intensity and the conditions under which they manifest. To address this, we conducted a comprehensive analysis using relative rather than absolute input lengths, defined with respect to each model's context window. Our findings reveal that the LiM effect is strongest when inputs occupy up to 50% of a model's context window. Beyond that, the primacy bias weakens, while recency bias remains relatively stable. This effectively eliminates the LiM effect; instead, we observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input. Furthermore, our results suggest that successful retrieval is a prerequisite for reasoning in LLMs, and that the observed positional biases in reasoning are largely inherited from retrieval. These insights have implications for long-context tasks, the design of future LLM benchmarks, and evaluation methodologies for LLMs handling extended inputs.

[53] ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models

Archchana Sindhujan,Shenbin Qian,Chan Chi Chun Matthew,Constantin Orasan,Diptesh Kanojia

Main category: cs.CL

TL;DR: 本文提出ALOPE框架，通过层优化提升大语言模型在机器翻译质量估计中的跨语言表现，结合LoRA和多头回归策略，有效改进质量评估效果。

Details

Motivation: 现有基于LLM的质量估计系统受限于因果语言建模的预训练目标，难以应对低资源语言和跨语言任务的挑战，因此需要一种更有效的优化框架。 Method: ALOPE结合了低秩适配器（LoRA）与回归任务头，采用层优化策略，包括动态加权和多头回归，以提升质量估计的准确性。 Result: ALOPE在多个现有LLM-based QE方法上表现出改进，实验证明LLM的中间Transformer层能提供更符合跨语言QE任务需求的上下文表示。 Conclusion: ALOPE框架通过重构Transformer表示，有效提升了基于LLM的质量估计系统在跨语言任务中的表现，同时提供了可公开访问的模型和代码，有助于现有MT框架集成QE能力。 Abstract: Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities.

[54] Augmenting Bias Detection in LLMs Using Topological Data Analysis

Keshav Varadarajan,Tananun Songdechakraiwut

Main category: cs.CL

TL;DR: 本文提出了一种新的方法，利用拓扑数据分析识别大型语言模型中的偏差来源，并发现偏差集中于某些注意力头。

Details

Motivation: 尽管已提出许多偏差检测方法，但缺乏能够识别大型语言模型中特定偏差来源的方法。 Method: 使用拓扑数据分析的方法识别GPT-2中对StereoSet数据集中身份群体误判有贡献的注意力头。 Result: 研究发现特定类别的偏差集中在某些注意力头上，并提出了一种可扩展的度量方法用于识别特定组内的偏差。 Conclusion: 本文提出了一种使用拓扑数据分析识别GPT-2中导致身份群体误判的注意力头的方法，并发现特定类别（如性别或职业）的偏差集中在某些注意力头上。 Abstract: Recently, many bias detection methods have been proposed to determine the level of bias a large language model captures. However, tests to identify which parts of a large language model are responsible for bias towards specific groups remain underdeveloped. In this study, we present a method using topological data analysis to identify which heads in GPT-2 contribute to the misrepresentation of identity groups present in the StereoSet dataset. We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots. The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category, and future work could extend this method to help de-bias large language models.

[55] Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews

Joseph T. Colonel,Baihan Lin

Main category: cs.CL

TL;DR: 本文提出 ThemeClouds，一种基于大型语言模型的可视化工具，用于从对话记录中生成基于主题的、以参与者为中心的词云，从而提供比传统方法更有意义和可操作的概览。

Details

Motivation: 传统的基于词频的词云方法在对话环境中表现不佳，常常突出无意义的填充词，忽视转述，并分裂语义相关的概念。研究人员需要一种快速且可解释的方法来总结参与者的真实发言内容。 Method: 利用大型语言模型（LLMs）识别语料库中的概念级主题，并根据提及每个主题的参与者数量生成可视化结果。 Result: 通过使用来自用户研究的访谈数据（31名参与者；155份转录文本），ThemeClouds 相比传统词频词云和主题建模基线（如LDA、BERTopic）能够更有效地揭示出具有操作性的设备问题。 Conclusion: ThemeClouds 是一种更有效的可视化工具，相比传统方法，它能够更好地从对话记录中提取有意义的主题，提供对研究人员更实用的概览。 Abstract: Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (``diff clouds'').

[56] From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR

Jia Deng,Jie Chen,Zhipeng Chen,Daixuan Cheng,Fei Bai,Beichen Zhang,Yinqian Min,Yanzipeng Gao,Wayne Xin Zhao,Ji-Rong Wen

Main category: cs.CL

TL;DR: This technical report investigates the exploration capacities of reinforcement learning with verifiable rewards (RLVR) for enhancing large language models' reasoning capabilities, focusing on understanding the mechanisms that govern exploration behaviors and optimizing performance.

Details

Motivation: The motivation behind this study is to explore the fundamental mechanisms that govern LLMs' exploration behaviors in RLVR, which have been previously underexplored despite the empirical success of RLVR. Method: The report systematically investigates exploration capacities in RLVR, focusing on four aspects: exploration space shaping, entropy-performance exchange, RL performance optimization, and unification of previous insights with new empirical evidence. Result: The report develops quantitative metrics for exploration space shaping, analyzes entropy-performance exchange across various dimensions, and examines methods for RL performance optimization, ultimately aiming to offer a foundational framework for RLVR systems. Conclusion: This technical report concludes that understanding and enhancing exploration capacities in RLVR can provide a foundational framework for advancing RLVR systems and improving the reasoning capabilities of LLMs. Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR's empirical success, the fundamental mechanisms governing LLMs' exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs' capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.

[57] IBPS: Indian Bail Prediction System

Puspesh Kumar Srivastava,Uddeshya Raj,Praveen Patel,/Shubham Kumar Nigam,Noel Shallum,Arnab Bhattacharya

Main category: cs.CL

TL;DR: This paper presents the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales.

Details

Motivation: Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India's prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. Method: We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Result: Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. Conclusion: IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system. Abstract: Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India's prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system.

[58] Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements

Ziheng Li,Zhi-Hong Deng

Main category: cs.CL

TL;DR: 本文提出KeyCP++方法，通过关键词为中心的推理链提示，解决基于上下文学习的事件检测中的一次性学习挑战。

Details

Motivation: 由于大型语言模型在事件检测中缺乏对事件触发词的准确理解，容易过度解读，因此需要一种新的方法。 Method: 提出KeyCP++方法，通过构建触发词识别提示模板，结合示例触发词进行推理链提示，从而生成深入的推理过程。 Result: 实验结果显示，KeyCP++在一次性事件检测任务中表现出显著的有效性。 Conclusion: KeyCP++能够缓解语言模型对关键词的过度依赖，促进检测规则学习，提高事件检测性能。 Abstract: Although the LLM-based in-context learning (ICL) paradigm has demonstrated considerable success across various natural language processing tasks, it encounters challenges in event detection. This is because LLMs lack an accurate understanding of event triggers and tend to make over-interpretation, which cannot be effectively corrected through in-context examples alone. In this paper, we focus on the most challenging one-shot setting and propose KeyCP++, a keyword-centric chain-of-thought prompting approach. KeyCP++ addresses the weaknesses of conventional ICL by automatically annotating the logical gaps between input text and detection results for the demonstrations. Specifically, to generate in-depth and meaningful rationale, KeyCP++ constructs a trigger discrimination prompting template. It incorporates the exemplary triggers (a.k.a keywords) into the prompt as the anchor to simply trigger profiling, let LLM propose candidate triggers, and justify each candidate. These propose-and-judge rationales help LLMs mitigate over-reliance on the keywords and promote detection rule learning. Extensive experiments demonstrate the effectiveness of our approach, showcasing significant advancements in one-shot event detection.

[59] InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

Anirudh Iyengar Kaniyar Narayana Iyengar,Srija Mukhopadhyay,Adnan Qidwai,Shubhankar Singh,Dan Roth,Vivek Gupta

Main category: cs.CL

TL;DR: InterChart是一个用于评估视觉语言模型在多图表环境中推理能力的新基准，揭示了模型在处理复杂图表时的局限性。

Details

Motivation: 现有的基准测试主要关注孤立、视觉单一的图表，而现实应用中需要模型处理多个相关图表进行复杂推理。 Method: 设计了一个名为InterChart的诊断基准，包含三个难度递增的层次，评估模型在多个相关图表间的推理能力。 Result: 评估结果显示，随着图表复杂度增加，最先进的视觉语言模型准确率显著下降，尤其在跨图表整合上存在困难。 Conclusion: InterChart揭示了现有视觉语言模型在处理复杂、多样的图表时存在系统性局限，并提供了一个严格框架来推进多模态推理的发展。 Abstract: We introduce InterChart, a diagnostic benchmark that evaluates how well vision-language models (VLMs) reason across multiple related charts, a task central to real-world applications such as scientific reporting, financial analysis, and public policy dashboards. Unlike prior benchmarks focusing on isolated, visually uniform charts, InterChart challenges models with diverse question types ranging from entity inference and trend correlation to numerical estimation and abstract multi-step reasoning grounded in 2-3 thematically or structurally related charts. We organize the benchmark into three tiers of increasing difficulty: (1) factual reasoning over individual charts, (2) integrative analysis across synthetically aligned chart sets, and (3) semantic inference over visually complex, real-world chart pairs. Our evaluation of state-of-the-art open and closed-source VLMs reveals consistent and steep accuracy declines as chart complexity increases. We find that models perform better when we decompose multi-entity charts into simpler visual units, underscoring their struggles with cross-chart integration. By exposing these systematic limitations, InterChart provides a rigorous framework for advancing multimodal reasoning in complex, multi-visual environments.

[60] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval

Luyao Zhuang,Qinggang Zhang,Huachi Zhou,Juhua Liu,Qing Li,Xiao Huang

Main category: cs.CL

TL;DR: This paper proposes LoSemB, a logic-guided framework for inductive tool retrieval that improves generalization to unseen tools by leveraging logical relationships without retraining.

Details

Motivation: The rapid expansion of tool repositories makes it impractical to include all tools within the input limits of LLMs. Existing methods struggle with unseen tools due to distribution shifts and unreliable similarity-based retrieval. Method: The paper introduces LoSemB, which includes a logic-based embedding alignment module and a relational augmented retrieval mechanism to handle distribution shifts and improve retrieval robustness. Result: Extensive experiments show that LoSemB outperforms state-of-the-art methods in inductive settings while maintaining effectiveness in transductive settings. Conclusion: LoSemB effectively addresses the challenges of inductive tool retrieval by leveraging logical information without costly retraining, achieving strong performance in both inductive and transductive settings. Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting.

[61] What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction

Charlie Wyatt,Aditya Joshi,Flora Salim

Main category: cs.CL

TL;DR: This study shows that despite their success in other areas, current LLMs struggle to predict masked sentences in low-structured domains, revealing limitations in their ability to maintain long-range coherence.

Details

Motivation: Next Token Prediction (NTP), commonly used in Transformer-based models, focuses on single-token prediction, which may limit a model's ability to maintain long-range coherence. This raises questions about the ability of LLMs to predict full sentences within structured documents. Method: The study evaluates three commercial LLMs on Masked Sentence Prediction (MSP) across three domains—ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). Both fidelity and cohesiveness are assessed. Result: Commercial LLMs, despite excelling in other tasks, are shown to be ineffective at predicting masked sentences in low-structured domains, affecting global coherence across sentence boundaries. Conclusion: Transformer-based models like GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash perform poorly in predicting masked sentences in low-structured domains, indicating a gap in their current capabilities. Abstract: Transformer-based models primarily rely on Next Token Prediction (NTP), which predicts the next token in a sequence based on the preceding context. However, NTP's focus on single-token prediction often limits a model's ability to plan ahead or maintain long-range coherence, raising questions about how well LLMs can predict longer contexts, such as full sentences within structured documents. While NTP encourages local fluency, it provides no explicit incentive to ensure global coherence across sentence boundaries-an essential skill for reconstructive or discursive tasks. To investigate this, we evaluate three commercial LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash) on Masked Sentence Prediction (MSP) - the task of infilling a randomly removed sentence - from three domains: ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). We assess both fidelity (similarity to the original sentence) and cohesiveness (fit within the surrounding context). Our key finding reveals that commercial LLMs, despite their superlative performance in other tasks, are poor at predicting masked sentences in low-structured domains, highlighting a gap in current model capabilities.

Zhenliang Zhang,Junzhe Zhang,Xinyu Hu,HuiXuan Zhang,Xiaojun Wan

Main category: cs.CL

TL;DR: This study uncovers the causal link between social bias and faithfulness hallucinations in large language models, showing that biases significantly contribute to hallucinations, with varying effects depending on the bias type.

Details

Motivation: The research aims to explore the previously uninvestigated causal relationship between social bias and faithfulness hallucinations in LLMs. Method: The study uses a Structural Causal Model (SCM) to establish and validate causality and designs bias interventions to control confounders, supported by experiments on mainstream LLMs using the Bias Intervention Dataset (BID). Result: Experiments reveal that social biases significantly contribute to hallucinations, with varying causal effects depending on the type of bias. The study also identifies the impact of unfairness hallucinations linked to social bias. Conclusion: Biases are significant causes of faithfulness hallucinations in large language models, with differing causal effects based on the bias state. Abstract: Large language models (LLMs) have achieved remarkable success in various tasks, yet they remain vulnerable to faithfulness hallucinations, where the output does not align with the input. In this study, we investigate whether social bias contributes to these hallucinations, a causal relationship that has not been explored. A key challenge is controlling confounders within the context, which complicates the isolation of causality between bias states and hallucinations. To address this, we utilize the Structural Causal Model (SCM) to establish and validate the causality and design bias interventions to control confounders. In addition, we develop the Bias Intervention Dataset (BID), which includes various social biases, enabling precise measurement of causal effects. Experiments on mainstream LLMs reveal that biases are significant causes of faithfulness hallucinations, and the effect of each bias state differs in direction. We further analyze the scope of these causal effects across various models, specifically focusing on unfairness hallucinations, which are primarily targeted by social bias, revealing the subtle yet significant causal effect of bias on hallucination generation.

[63] SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

Zeyu Yang,Lai Wei,Roman Koshkin,Xi Chen,Satoshi Nakamura

Main category: cs.CL

TL;DR: 本研究提出了一种基于语法的分块策略，并开发了SASST翻译框架，有效提升了翻译质量与内容连贯性。

Details

Motivation: 提升翻译质量和内容连贯性，解决词序差异问题。 Method: 基于语法的分块策略结合SASST框架，动态输出翻译标记或等待符号。 Result: 实验显示翻译质量显著提高，验证了语法结构的有效性。 Conclusion: SASST是一个有效的翻译框架，语法结构在SimulST系统中具有积极作用。 Abstract: This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.

[64] Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Haoyuan Wu,Haoxing Chen,Xiaodong Chen,Zhanchao Zhou,Tieyuan Chen,Yihong Zhuang,Guoshan Lu,Zenan Huang,Junbo Zhao,Lin Liu,Zhenzhong Lan,Bei Yu,Jianguo Li

Main category: cs.CL

TL;DR: 本文介绍了一种新的Mixture of Experts（MoE）架构，称为Grove MoE，该架构通过结合不同大小的专家模型来提高计算效率和模型容量。

Details

Motivation: 传统的MoE架构使用大小均匀的专家模型，无法根据输入复杂度调整参数激活数量，限制了计算效率。因此，需要一种更灵活的架构来解决这一问题。 Method: 受异构big.LITTLE CPU架构启发，Grove MoE引入了不同大小的专家模型和动态激活机制，从而根据输入复杂度动态调整激活的参数数量。此外，作者基于Qwen3-30B-A3B-Base模型，采用升级策略训练了GroveMoE-Base和GroveMoE-Inst两个模型。 Result: Grove MoE模型能够根据token复杂度动态激活3.14-3.28B参数，并实现了与当前最先进的开源模型相当的性能，甚至在某些任务上表现更优。 Conclusion: Grove MoE通过引入异构专家模型和动态激活机制，提高了模型的计算效率和性能，为大规模语言模型的发展提供了新思路。 Abstract: The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous big.LITTLE CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

[65] Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Yerin Hwang,Dongryeol Lee,Taegwan Kang,Yongil Kim,Kyomin Jung

Main category: cs.CL

TL;DR: 该研究首次揭示了大型语言模型在评分任务中会受到说服性语言的影响，导致对数学问题的错误答案给出更高的分数。

Details

Motivation: 随着大型语言模型在实际场景中越来越多地作为自动评估者，研究人们是否能通过说服性语言使模型给予不公平的高分成为关键问题。 Method: 基于亚里士多德的修辞原则，正式化了七种说服技巧，并将它们嵌入到数学问题的回答中，以测试大型语言模型的评分偏差。 Result: 研究发现，说服性语言会导致LLM评分者平均提高8%的分数，尤其是在使用“一致性”技巧时偏差最严重。 Conclusion: 该研究揭示了说服性语言会影响大型语言模型在评分中的判断，特别是在数学推理任务中，导致对不正确解决方案的评分偏高。 Abstract: As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle's rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.

[66] Evaluating Compositional Approaches for Focus and Sentiment Analysis

Olga Kellert,Muhammad Imran,Nicholas Hill Matlis,Mahmud Uz Zaman,Carlos Gómez-Rodríguez

Main category: cs.CL

TL;DR: This paper demonstrates that compositional syntactic rules used in Sentiment Analysis can be effectively applied to Focus Analysis, offering better interpretability and accuracy over non-compositional methods.

Details

Motivation: The motivation stems from the lack of quantitative evaluations of compositional approaches in Focus Analysis, despite their prevalence in Sentiment Analysis, and the close relationship between FA and SA. Method: The study uses a compositional approach based on syntactic rules like modification, coordination, and negation, represented through Universal Dependencies, and compares it with a non-compositional method, VADER, across suitable datasets. Result: The results show that the compositional approach offers advantages like interpretability and explainability and is more accurate compared to non-compositional methods like VADER. Conclusion: The paper concludes that compositional rules used in Sentiment Analysis can be effectively applied to Focus Analysis, establishing a close relationship between the two, with SA being a part of FA. Abstract: This paper summarizes the results of evaluating a compositional approach for Focus Analysis (FA) in Linguistics and Sentiment Analysis (SA) in Natural Language Processing (NLP). While quantitative evaluations of compositional and non-compositional approaches in SA exist in NLP, similar quantitative evaluations are very rare in FA in Linguistics that deal with linguistic expressions representing focus or emphasis such as "it was John who left". We fill this gap in research by arguing that compositional rules in SA also apply to FA because FA and SA are closely related meaning that SA is part of FA. Our compositional approach in SA exploits basic syntactic rules such as rules of modification, coordination, and negation represented in the formalism of Universal Dependencies (UDs) in English and applied to words representing sentiments from sentiment dictionaries. Some of the advantages of our compositional analysis method for SA in contrast to non-compositional analysis methods are interpretability and explainability. We test the accuracy of our compositional approach and compare it with a non-compositional approach VADER that uses simple heuristic rules to deal with negation, coordination and modification. In contrast to previous related work that evaluates compositionality in SA on long reviews, this study uses more appropriate datasets to evaluate compositionality. In addition, we generalize the results of compositional approaches in SA to compositional approaches in FA.

[67] Evaluating Large Language Models as Expert Annotators

Yu-Min Tseng,Wei-Lin Chen,Chung-Chi Chen,Hsin-Hsi Chen

Main category: cs.CL

TL;DR: LLMs struggle to replace human annotators in expert domains despite advanced reasoning techniques and multi-agent collaboration frameworks.

Details

Motivation: Textual data annotation is costly and labor-intensive, prompting exploration of LLMs as alternatives, especially in domains requiring expert knowledge where their effectiveness is understudied. Method: Evaluated individual LLMs and a multi-agent discussion framework across three specialized domains, incorporating reasoning models and inference-time techniques like chain-of-thought. Result: Individual LLMs with reasoning techniques showed minimal or negative gains. Reasoning models did not significantly outperform non-reasoning models. Some models, like Claude 3.7 Sonnet, were resistant to changing initial annotations despite correct external input. Conclusion: LLMs, even with advanced reasoning techniques and multi-agent frameworks, show limited effectiveness in replacing human annotators in specialized domains like finance, biomedicine, and law. Abstract: Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others' annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.

[68] LLMs for Law: Evaluating Legal-Specific LLMs on Contract Understanding

Amrita Singh,H. Suhan Karaca,Aditya Joshi,Hye-young Paik,Jiaojiao Jiang

Main category: cs.CL

TL;DR: 本研究对法律特定的LLM在合同理解任务中的表现进行了全面评估，发现其优于通用模型，并提出了新的SOTA模型。

Details

Motivation: 目前缺乏针对合同理解任务的多法律特定LLM的综合评估。 Method: 评估了10个法律特定的LLM在三个英文合同理解任务上的表现，并与7个通用LLM进行比较。 Result: 法律特定的LLM在需要细致法律理解的任务上持续优于通用模型，Legal-BERT和Contracts-BERT在两个任务中建立了新的SOTA，且参数比最佳表现的通用LLM少69%。 Conclusion: 法律特定的LLM在合同理解任务中表现优于通用模型，Legal-BERT和Contracts-BERT在三个任务中的两个任务中建立了新的SOTA。 Abstract: Despite advances in legal NLP, no comprehensive evaluation covering multiple legal-specific LLMs currently exists for contract classification tasks in contract understanding. To address this gap, we present an evaluation of 10 legal-specific LLMs on three English language contract understanding tasks and compare them with 7 general-purpose LLMs. The results show that legal-specific LLMs consistently outperform general-purpose models, especially on tasks requiring nuanced legal understanding. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing general-purpose LLM. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract understanding. Our results provide a holistic evaluation of legal-specific LLMs and will facilitate the development of more accurate contract understanding systems.

[69] Large Language Models for Czech Aspect-Based Sentiment Analysis

Jakub Šmíd,Pavel Přibáň,Pavel Král

Main category: cs.CL

TL;DR: 本文评估了19个不同LLM在捷克语ABSA任务中的表现，发现特定领域的小型模型在零样本和少样本情况下优于通用LLM，而经过微调的LLM则达到了最先进的水平。

Details

Motivation: 大型语言模型在各种NLP任务中表现出色，但其在捷克语ABSA中的能力仍未得到充分探索。 Method: 对19个不同规模和架构的LLM进行全面评估，比较它们在零样本、少样本和微调场景下的表现。 Result: 分析了多语种、模型规模和时间对性能的影响，并提出了错误分析，强调了在方面术语预测中的主要挑战。 Conclusion: 小规模领域特定模型在零样本和少样本设置下表现优于通用LLM，而微调的LLM实现了最先进的结果。 Abstract: Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that aims to identify sentiment toward specific aspects of an entity. While large language models (LLMs) have shown strong performance in various natural language processing (NLP) tasks, their capabilities for Czech ABSA remain largely unexplored. In this work, we conduct a comprehensive evaluation of 19 LLMs of varying sizes and architectures on Czech ABSA, comparing their performance in zero-shot, few-shot, and fine-tuning scenarios. Our results show that small domain-specific models fine-tuned for ABSA outperform general-purpose LLMs in zero-shot and few-shot settings, while fine-tuned LLMs achieve state-of-the-art results. We analyze how factors such as multilingualism, model size, and recency influence performance and present an error analysis highlighting key challenges, particularly in aspect term prediction. Our findings provide insights into the suitability of LLMs for Czech ABSA and offer guidance for future research in this area.

[70] Few-shot Cross-lingual Aspect-Based Sentiment Analysis with Sequence-to-Sequence Models

Jakub Šmíd,Pavel Přibáň,Pavel Král

Main category: cs.CL

TL;DR: Few-shot target language examples greatly enhance cross-lingual ABSA performance, making it feasible and effective for low-resource settings.

Details

Motivation: Challenges in low-resource languages for ABSA due to lack of labeled data and over-reliance on external translation tools. Method: Evaluated the effect of adding few-shot target language examples across four ABSA tasks, six languages, and two sequence-to-sequence models. Result: Adding ten examples improves performance over zero-shot settings; 1,000 examples with English data surpasses monolingual baselines. Conclusion: Adding a small number of target language examples significantly improves cross-lingual ABSA performance, even surpassing monolingual baselines when combined with English data. Abstract: Aspect-based sentiment analysis (ABSA) has received substantial attention in English, yet challenges remain for low-resource languages due to the scarcity of labelled data. Current cross-lingual ABSA approaches often rely on external translation tools and overlook the potential benefits of incorporating a small number of target language examples into training. In this paper, we evaluate the effect of adding few-shot target language examples to the training set across four ABSA tasks, six target languages, and two sequence-to-sequence models. We show that adding as few as ten target language examples significantly improves performance over zero-shot settings and achieves a similar effect to constrained decoding in reducing prediction errors. Furthermore, we demonstrate that combining 1,000 target language examples with English data can even surpass monolingual baselines. These findings offer practical insights for improving cross-lingual ABSA in low-resource and domain-specific settings, as obtaining ten high-quality annotated examples is both feasible and highly effective.

[71] Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity

Chen Cecilia Liu,Hiba Arnaout,Nils Kovačić,Dana Atzil-Slonim,Iryna Gurevych

Main category: cs.CL

TL;DR: This paper introduces CultureCare, the first dataset for culturally sensitive emotional support, develops adaptation strategies for LLMs, and demonstrates their effectiveness in cross-cultural settings and clinical training.

Details

Motivation: LLMs' ability to deliver culturally sensitive emotional support remains underexplored due to lack of resources. Method: Developed and tested four adaptation strategies for LLMs using the CultureCare dataset, which spans four cultures with extensive annotations. Evaluated using LLM judges, human annotators, and clinical psychologists. Result: Adapted LLMs outperform anonymous online peer responses, and simple cultural role-play is insufficient for cultural sensitivity. Conclusion: CultureCare dataset and adaptation strategies enhance LLMs' cultural sensitivity, showing their potential in clinical training for fostering cultural competence in future therapists. Abstract: Large language models (LLMs) show promise in offering emotional support and generating empathetic responses for individuals in distress, but their ability to deliver culturally sensitive support remains underexplored due to lack of resources. In this work, we introduce CultureCare, the first dataset designed for this task, spanning four cultures and including 1729 distress messages, 1523 cultural signals, and 1041 support strategies with fine-grained emotional and cultural annotations. Leveraging CultureCare, we (i) develop and test four adaptation strategies for guiding three state-of-the-art LLMs toward culturally sensitive responses; (ii) conduct comprehensive evaluations using LLM judges, in-culture human annotators, and clinical psychologists; (iii) show that adapted LLMs outperform anonymous online peer responses, and that simple cultural role-play is insufficient for cultural sensitivity; and (iv) explore the application of LLMs in clinical training, where experts highlight their potential in fostering cultural competence in future therapists.

[72] Challenges and opportunities in portraying emotion in generated sign language

John C. McDonald,Rosalee Wolfe,Fabrizio Nunnari

Main category: cs.CL

TL;DR: 这篇论文提出了一种使用双参数表示法来更直观地指定手语虚拟人情感状态的方法，显示出比以往方法更一致和精细的情感表达能力。

Details

Motivation: 情感内容在手语虚拟人中难以融入，因为缺乏标准的方法来指定虚拟人的情感状态。 Method: 研究应用了一种直观的双参数表示法来处理Paula手语虚拟人的情感非手动信号，并通过EASIER符号进行文本控制。 Result: 用户可以通过两个数值参数控制Paula虚拟人的情感表达，使其表达更细微的情感状态，并在语言注释中实现更一致的情感非手动信号指定。 Conclusion: 该论文得出结论，使用直观的双参数表示可以更一致地指定情感面部表情，比之前的方法更有效。 Abstract: Non-manual signals in sign languages continue to be a challenge for signing avatars. More specifically, emotional content has been difficult to incorporate because of a lack of a standard method of specifying the avatar's emotional state. This paper explores the application of an intuitive two-parameter representation for emotive non-manual signals to the Paula signing avatar that shows promise for facilitating the linguistic specification of emotional facial expressions in a more coherent manner than previous methods. Users can apply these parameters to control Paula's emotional expressions through a textual representation called the EASIER notation. The representation can allow avatars to express more nuanced emotional states using two numerical parameters. It also has the potential to enable more consistent specification of emotional non-manual signals in linguistic annotations which drive signing avatars.

Furkan Şahinuç,Subhabrata Dutta,Iryna Gurevych

Main category: cs.CL

TL;DR: This paper introduces GREP, an enhanced evaluation framework for scientific writing that integrates domain-specific criteria and expert preferences, offering more robust assessments compared to existing methods.

Details

Motivation: The motivation is to address the lack of effective evaluation methods for automatically generated scientific writing, which requires domain-specific knowledge and expert preferences. Method: GREP is a multi-turn evaluation framework that decomposes evaluation into fine-grained dimensions and uses contrastive few-shot examples. It has two variants: one using proprietary LLMs and another using open-weight LLMs. Result: Empirical results show that GREP outperforms standard LLM judges in assessing the quality of related work sections, reflects real scientific writing scenarios, and correlates strongly with human expert evaluations. Conclusion: The proposed GREP framework significantly enhances the evaluation of automatically generated scientific writing by incorporating domain-specific criteria and expert preferences, showing strong correlation with human expert assessments. Abstract: Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Recent advances in LLMs show promising potential in reducing the expert workload. However, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific evaluation criteria and the ability to discern expert preferences. Conventional automatic metrics and LLM-as-a-judge systems are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Instead of assigning a single score, our framework decomposes the evaluation into fine-grained dimensions. This localized evaluation approach is further augmented with contrastive few-shot examples to provide detailed contextual guidance for the evaluation dimensions. The design principles allow our framework to deliver cardinal assessment of quality, which can facilitate better post-training compared to ordinal preference data. For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the human expert assessment. We also observe that generations from state-of-the-art LLMs struggle to satisfy validation constraints of a suitable related work section. They (mostly) fail to improve based on feedback as well.

[74] Large Language Models for Subjective Language Understanding: A Survey

Changhao Song,Yazhou Zhang,Hui Gao,Ben Yao,Peng Zhang

Main category: cs.CL

TL;DR: This survey explores how large language models enhance subjective language understanding across tasks like sentiment analysis and emotion detection, while addressing challenges and future research directions.

Details

Motivation: The paradigm shift brought by LLMs like ChatGPT and LLaMA necessitates a review of their impact on interpreting and generating subjective language across various tasks. Method: The paper provides a comprehensive survey of recent advances in applying LLMs to subjective language tasks, analyzing architectures, techniques, datasets, and methods. Result: The survey outlines the evolution of LLMs in tasks like sentiment analysis, emotion recognition, and metaphor interpretation, highlighting their strengths and persistent challenges. Conclusion: LLMs have transformed subjective language understanding, offering nuanced interpretations across multiple tasks while facing challenges like bias and ethical concerns. Abstract: Subjective language understanding refers to a broad set of natural language processing tasks where the goal is to interpret or generate content that conveys personal feelings, opinions, or figurative meanings rather than objective facts. With the advent of large language models (LLMs) such as ChatGPT, LLaMA, and others, there has been a paradigm shift in how we approach these inherently nuanced tasks. In this survey, we provide a comprehensive review of recent advances in applying LLMs to subjective language tasks, including sentiment analysis, emotion recognition, sarcasm detection, humor understanding, stance detection, metaphor interpretation, intent detection, and aesthetics assessment. We begin by clarifying the definition of subjective language from linguistic and cognitive perspectives, and we outline the unique challenges posed by subjective language (e.g. ambiguity, figurativeness, context dependence). We then survey the evolution of LLM architectures and techniques that particularly benefit subjectivity tasks, highlighting why LLMs are well-suited to model subtle human-like judgments. For each of the eight tasks, we summarize task definitions, key datasets, state-of-the-art LLM-based methods, and remaining challenges. We provide comparative insights, discussing commonalities and differences among tasks and how multi-task LLM approaches might yield unified models of subjectivity. Finally, we identify open issues such as data limitations, model bias, and ethical considerations, and suggest future research directions. We hope this survey will serve as a valuable resource for researchers and practitioners interested in the intersection of affective computing, figurative language processing, and large-scale language models.

[75] Toward Machine Interpreting: Lessons from Human Interpreting Studies

Matthias Sperber,Maureen de Seyssel,Jiajun Bao,Matthias Paulik

Main category: cs.CL

TL;DR: This paper explores how insights from human interpreting can enhance speech translation systems, suggesting that modern modeling techniques can bridge the gap between current systems and human-like adaptability.

Details

Motivation: Current speech translation systems are static and lack the adaptability seen in human interpreters. The authors aim to enhance the practical usefulness of these systems by drawing insights from human interpreting practices. Method: The paper analyzes human interpreting literature from the perspective of machine translation, examining both operational and qualitative aspects to identify implications for the development of speech translation systems. Result: The authors identify key implications for improving speech translation systems by incorporating principles from human interpreting, suggesting that modern modeling techniques can enable more adaptive and human-like performance. Conclusion: The paper concludes that there is significant potential to adopt principles from human interpreting into speech translation systems using recent modeling techniques, with the aim of bridging the usability gap and advancing toward true machine interpreting. Abstract: Current speech translation systems, while having achieved impressive accuracies, are rather static in their behavior and do not adapt to real-world situations in ways human interpreters do. In order to improve their practical usefulness and enable interpreting-like experiences, a precise understanding of the nature of human interpreting is crucial. To this end, we discuss human interpreting literature from the perspective of the machine translation field, while considering both operational and qualitative aspects. We identify implications for the development of speech translation systems and argue that there is great potential to adopt many human interpreting principles using recent modeling techniques. We hope that our findings provide inspiration for closing the perceived usability gap, and can motivate progress toward true machine interpreting.

[76] Understanding Syntactic Generalization in Structure-inducing Language Models

David Arps,Hassan Sajjad,Laura Kallmeyer

Main category: cs.CL

TL;DR: 该论文系统比较了三种结构诱导语言模型（Structformer、UDGN和GPST），发现GPST在多项评估中表现最佳，特别是在处理长距离依赖性任务上。

Details

Motivation: 尽管已有多种结构诱导语言模型（SiLM），但这些模型通常在较小的规模上进行评估，且存在系统性的评估差距和缺乏可比性。因此，需要系统比较不同SiLM架构的性能和特性。 Method: 研究者对三种不同的SiLM架构（Structformer、UDGN和GPST）进行了比较分析，使用自然语言语料库和合成括号表达式进行评估，并从句法表示特性、语法判断任务表现和训练动态三个方面进行比较。 Result: 研究发现三种架构在不同评估指标上表现各异，没有一个模型在所有指标上都占优。特别是GPST模型在句法表示方面表现最佳，并且在长距离依赖任务中优于其他模型。 Conclusion: 该研究发现GPST模型在大多数评估设置中表现最佳，特别是在括号表达式的长距离依赖性上优于其他模型。此外，研究还表明，使用大量合成数据训练的小型模型可以作为评估基本模型属性的有效测试平台。 Abstract: Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. A wide variety of SiLMs have been proposed. However, these have typically been evaluated on a relatively small scale, and evaluation of these models has systematic gaps and lacks comparability. In this work, we study three different SiLM architectures using both natural language (English) corpora and synthetic bracketing expressions: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022) and GPST (Hu et al., 2024). We compare them with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.

[77] Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Jiaxuan Gao,Wei Fu,Minyang Xie,Shusheng Xu,Chuyi He,Zhiyu Mei,Banghua Zhu,Yi Wu

Main category: cs.CL

TL;DR: 本文介绍了ASearcher项目，通过可扩展的异步强化学习训练和基于提示的LLM代理，提高了开源搜索代理的性能，在xBench和GAIA基准测试中表现出色。

Details

Motivation: 当前开源代理在搜索智能方面仍无法达到专家水平，现有方法在可扩展性、效率和数据质量方面存在不足，例如现有的在线强化学习方法的小回合限制阻碍了复杂策略的学习。 Method: 介绍ASearcher，使用可扩展的完全异步强化学习训练方法和基于提示的LLM代理，生成高质量和具有挑战性的问答数据集，实现大规模强化学习训练。 Result: 基于提示的QwQ-32B代理通过强化学习训练，在xBench和GAIA上分别实现了46.7%和20.8%的Avg@4提升。代理能够执行极端的长视野搜索，训练期间工具调用超过40轮，输出令牌超过150k。ASearcher-Web-QwQ在xBench上获得42.1的Avg@4分数，在GAIA上获得52.8的Avg@4分数。 Conclusion: ASearcher项目通过可扩展的异步强化学习训练和基于提示的LLM代理，显著提高了开源搜索代理的搜索智能水平，并在xBench和GAIA基准测试中取得了优于现有开源32B代理的成绩。 Abstract: Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

[78] The Medical Metaphors Corpus (MCC)

Anna Sofia Lippolis,Andrea Giovanni Nuzzolese,Aldo Gangemi

Main category: cs.CL

TL;DR: The paper introduces the Medical Metaphors Corpus, a dataset of annotated scientific metaphors, which reveals that current language models have room for improvement in understanding domain-specific figurative language.

Details

Motivation: Despite the prevalence of figurative language in scientific discourse, existing metaphor detection resources primarily focus on general-domain text, leaving a critical gap for domain-specific applications. Method: The paper presents the Medical Metaphors Corpus (MCC), a comprehensive dataset of 792 annotated scientific conceptual metaphors spanning medical and biological domains. Result: Our evaluation demonstrates that state-of-the-art language models achieve modest performance on scientific metaphor detection, revealing substantial room for improvement in domain-specific figurative language understanding. Conclusion: MCC enables multiple research applications including metaphor detection benchmarking, quality-aware generation systems, and patient-centered communication tools. Abstract: Metaphor is a fundamental cognitive mechanism that shapes scientific understanding, enabling the communication of complex concepts while potentially constraining paradigmatic thinking. Despite the prevalence of figurative language in scientific discourse, existing metaphor detection resources primarily focus on general-domain text, leaving a critical gap for domain-specific applications. In this paper, we present the Medical Metaphors Corpus (MCC), a comprehensive dataset of 792 annotated scientific conceptual metaphors spanning medical and biological domains. MCC aggregates metaphorical expressions from diverse sources including peer-reviewed literature, news media, social media discourse, and crowdsourced contributions, providing both binary and graded metaphoricity judgments validated through human annotation. Each instance includes source-target conceptual mappings and perceived metaphoricity scores on a 0-7 scale, establishing the first annotated resource for computational scientific metaphor research. Our evaluation demonstrates that state-of-the-art language models achieve modest performance on scientific metaphor detection, revealing substantial room for improvement in domain-specific figurative language understanding. MCC enables multiple research applications including metaphor detection benchmarking, quality-aware generation systems, and patient-centered communication tools.

[79] WideSearch: Benchmarking Agentic Broad Info-Seeking

Ryan Wong,Jiawei Wang,Junjie Zhao,Li Chen,Yan Gao,Long Zhang,Xuan Zhou,Zuo Wang,Kai Xiang,Ge Zhang,Wenhao Huang,Yang Wang,Ke Wang

Main category: cs.CL

TL;DR: This paper introduces WideSearch, a benchmark for evaluating the reliability of LLM-powered search agents in large-scale information collection tasks. The results show that current systems perform poorly, indicating a need for further research and development in this area.

Details

Motivation: The motivation stems from the repetitive nature of wide-scale information seeking tasks that bottleneck various fields. The researchers aim to evaluate the reliability and completeness of LLM-powered search agents in handling such tasks due to a lack of suitable benchmarks. Method: The researchers introduced WideSearch, a benchmark with 200 questions across multiple domains, to evaluate the reliability of search agents in large-scale collection tasks. They tested over 10 state-of-the-art systems and used a five-stage quality control pipeline to ensure dataset rigor. Result: Most tested systems achieved near 0% success rates, with the best performer reaching only 5%. However, human cross-validation achieved nearly 100% success, demonstrating the current shortcomings of automated agents. Conclusion: The study concludes that current search agents have critical deficiencies in large-scale information seeking, highlighting the need for further research and development in agentic search systems. Abstract: From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at https://widesearch-seed.github.io/

[80] Progressive Depth Up-scaling via Optimal Transport

Mingzi Cao,Xi Wang,Nikolaos Aletras

Main category: cs.CL

TL;DR: OpT-DeUS is a depth up-scaling technique for LLMs that uses Optimal Transport to align neurons, improving performance and training efficiency.

Details

Motivation: Existing depth up-scaling methods copy or average weights from base layers, neglecting neuron permutation differences which can lead to performance issues. Method: OpT-DeUS uses Optimal Transport (OT) to align and fuse Transformer blocks in adjacent layers for new layer creation. Result: OpT-DeUS outperforms existing methods in continual pre-training and supervised fine-tuning across different model sizes, with additional gains from inserting layers closer to the top. Conclusion: OpT-DeUS provides an efficient method for depth up-scaling of LLMs, achieving better performance and training efficiency compared to existing methods. Abstract: Scaling Large Language Models (LLMs) yields performance gains but incurs substantial training costs. Depth up-scaling offers training efficiency by adding new layers to pre-trained models. However, most existing methods copy or average weights from base layers, neglecting neuron permutation differences. This limitation can potentially cause misalignment that harms performance. Inspired by applying Optimal Transport (OT) for neuron alignment, we propose Optimal Transport Depth Up-Scaling (OpT-DeUS). OpT-DeUS aligns and fuses Transformer blocks in adjacent base layers via OT for new layer creation, to mitigate neuron permutation mismatch between layers. OpT-DeUS achieves better overall performance and offers improved training efficiency than existing methods for continual pre-training and supervised fine-tuning across different model sizes. To further evaluate the impact of interpolation positions, our extensive analysis shows that inserting new layers closer to the top results in higher training efficiency due to shorter back-propagation time while obtaining additional performance gains.

[81] 9th Workshop on Sign Language Translation and Avatar Technologies (SLTAT 2025)

Fabrizio Nunnari,Cristina Luna Jiménez,Rosalee Wolfe,John C. McDonald,Michael Filhol,Eleni Efthimiou,Evita Fotinea,Thomas Hanke

Main category: cs.CL

TL;DR: The 9th SLTAT workshop at IVA 2025 highlighted advancements in sign language translation and avatar technology, promoting collaboration and showcasing research on diverse topics like recognition, data analysis, and ethics.

Details

Motivation: The motivation is to enhance communication between deaf and hearing individuals using non-invasive technologies, leveraging advancements in digital humans and sign language processing. Method: The workshop gathered recent research contributions through submissions and presentations, hosted under the IVA conference to facilitate interdisciplinary interaction. Result: The workshop received consistent submissions on topics including sign language recognition, data collection, analysis, tools, ethics, usability, and affective computing, indicating broad interest and progress in the field. Conclusion: SLTAT 2025 successfully brought together researchers to explore advancements in sign language translation and avatar technology, fostering collaboration between different research communities. Abstract: The Sign Language Translation and Avatar Technology (SLTAT) workshops continue a series of gatherings to share recent advances in improving deaf / human communication through non-invasive means. This 2025 edition, the 9th since its first appearance in 2011, is hosted by the International Conference on Intelligent Virtual Agents (IVA), giving the opportunity for contamination between two research communities, using digital humans as either virtual interpreters or as interactive conversational agents. As presented in this summary paper, SLTAT sees contributions beyond avatar technologies, with a consistent number of submissions on sign language recognition, and other work on data collection, data analysis, tools, ethics, usability, and affective computing.

[82] Dual Information Speech Language Models for Emotional Conversations

Chun Wang,Chenyang Liu,Wenze Xu,Weihong Deng

Main category: cs.CL

TL;DR: This paper introduces a novel adapter-based approach for speech-language models to better capture paralinguistic cues and improve emotional conversation understanding without increasing model complexity.

Details

Motivation: Conversational systems using text-based large language models (LLMs) often miss paralinguistic cues crucial for emotion and intention understanding. While speech-language models (SLMs) offer a promising solution, existing SLMs built on frozen LLMs struggle with paralinguistic information and contextual understanding. Method: The authors propose two heterogeneous adapters and a weakly supervised training approach to disentangle paralinguistic and linguistic information in SLMs. Their method avoids generating task-specific vectors by employing controlled randomness, preserving contextual understanding. Result: Experiments show that the proposed approach achieves competitive performance in emotional conversation tasks, effectively combining both paralinguistic and linguistic information within contextual settings, while training only the adapters on common datasets. Conclusion: The paper concludes that by using two heterogeneous adapters and a weakly supervised training strategy, speech-language models (SLMs) can effectively integrate paralinguistic and linguistic information, enhancing emotional conversation tasks while maintaining parameter and data efficiency. Abstract: Conversational systems relying on text-based large language models (LLMs) often overlook paralinguistic cues, essential for understanding emotions and intentions. Speech-language models (SLMs), which use speech as input, are emerging as a promising solution. However, SLMs built by extending frozen LLMs struggle to capture paralinguistic information and exhibit reduced context understanding. We identify entangled information and improper training strategies as key issues. To address these issues, we propose two heterogeneous adapters and suggest a weakly supervised training strategy. Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations. It also preserves contextual understanding by avoiding the generation of task-specific vectors through controlled randomness. This approach trains only the adapters on common datasets, ensuring parameter and data efficiency. Experiments demonstrate competitive performance in emotional conversation tasks, showcasing the model's ability to effectively integrate both paralinguistic and linguistic information within contextual settings.

[83] Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?

Lukas Gehring,Benjamin Paaßen

Main category: cs.CL

TL;DR: This paper evaluates the effectiveness of current LLM-generated text detectors in educational settings using a new dataset and contribution level framework, revealing significant challenges in detecting intermediate LLM involvement accurately.

Details

Motivation: The increased accessibility of Large Language Models (LLMs) has made it easier for students to generate texts automatically, raising concerns about academic integrity and the need for reliable detection methods. Method: Benchmarked state-of-the-art detectors using a novel dataset called Generative Essay Detection in Education (GEDE), which includes both student-written and LLM-generated essays, and introduced the concept of contribution levels to assess LLM involvement. Result: Detectors struggle with identifying LLM-improved human-written texts and produce notable false positives, highlighting the limitations of current detection approaches in educational contexts. Conclusion: Most detectors face challenges in accurately classifying texts with intermediate student contribution levels, particularly leading to false positives which can negatively impact students in educational settings. Abstract: Recent advancements in Large Language Models (LLMs) and their increased accessibility have made it easier than ever for students to automatically generate texts, posing new challenges for educational institutions. To enforce norms of academic integrity and ensure students' learning, learning analytics methods to automatically detect LLM-generated text appear increasingly appealing. This paper benchmarks the performance of different state-of-the-art detectors in educational contexts, introducing a novel dataset, called Generative Essay Detection in Education (GEDE), containing over 900 student-written essays and over 12,500 LLM-generated essays from various domains. To capture the diversity of LLM usage practices in generating text, we propose the concept of contribution levels, representing students' contribution to a given assignment. These levels range from purely human-written texts, to slightly LLM-improved versions, to fully LLM-generated texts, and finally to active attacks on the detector by "humanizing" generated texts. We show that most detectors struggle to accurately classify texts of intermediate student contribution levels, like LLM-improved human-written texts. Detectors are particularly likely to produce false positives, which is problematic in educational settings where false suspicions can severely impact students' lives. Our dataset, code, and additional supplementary materials are publicly available at https://github.com/lukasgehring/Assessing-LLM-Text-Detection-in-Educational-Contexts.

Robin Huo,Ewan Dunbar

Main category: cs.CL

TL;DR: 本文比较了两种自监督语音表征模型的架构差异，发现训练迭代而非训练目标影响语言信息的编码效果。

Details

Motivation: 尽管自监督语音表征模型因其通用性和下游任务性能而被广泛使用，但其架构对学习到的语言信息的影响仍未得到充分研究。 Method: 对HuBERT和wav2vec 2.0两种模型进行了最小比较，分析了它们在训练目标和多次训练迭代中的伪标签精炼过程。 Result: 发现隐藏表示与语言信息的相关性差异主要由训练迭代决定，而非训练目标。 Conclusion: 研究发现，隐藏表示与词身份、音素身份和说话人身份的典型相关性差异是由训练迭代引起的，而不是训练目标。建议未来研究迭代精炼在编码自监督语音表示中的语言信息有效性原因。 Abstract: Self-supervised models for speech representation learning now see widespread use for their versatility and performance on downstream tasks, but the effect of model architecture on the linguistic information learned in their representations remains under-studied. This study investigates two such models, HuBERT and wav2vec 2.0, and minimally compares two of their architectural differences: training objective and iterative pseudo-label refinement through multiple training iterations. We find that differences in canonical correlation of hidden representations to word identity, phoneme identity, and speaker identity are explained by training iteration, not training objective. We suggest that future work investigate the reason for the effectiveness of iterative refinement in encoding linguistic information in self-supervised speech representations.

[85] Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks

Jakub Šmíd,Pavel Přibáň,Ondřej Pražák,Pavel Král

Main category: cs.CL

TL;DR: This paper introduces a new Czech dataset for advanced aspect-based sentiment analysis, designed for complex tasks and cross-lingual research, with annotated and unannotated reviews and baseline model results.

Details

Motivation: The motivation is to advance aspect-based sentiment analysis in Czech by creating a dataset suitable for complex tasks like target-aspect-category detection and enabling cross-lingual comparisons. Method: The authors created a new Czech dataset for ABSA with unified annotations for complex tasks, following the SemEval-2016 format. They involved two annotators and provided baseline results using Transformer models. Result: A new Czech ABSA dataset with 3.1K annotated reviews, an inter-annotator agreement rate of ~90%, and 24M unannotated reviews, along with baseline results using Transformer-based models. Conclusion: The paper concludes by presenting robust monolingual baseline results using Transformer-based models and provides error analysis to support the dataset's utility. Abstract: In this paper, we introduce a novel Czech dataset for aspect-based sentiment analysis (ABSA), which consists of 3.1K manually annotated reviews from the restaurant domain. The dataset is built upon the older Czech dataset, which contained only separate labels for the basic ABSA tasks such as aspect term extraction or aspect polarity detection. Unlike its predecessor, our new dataset is specifically designed for more complex tasks, e.g. target-aspect-category detection. These advanced tasks require a unified annotation format, seamlessly linking sentiment elements (labels) together. Our dataset follows the format of the well-known SemEval-2016 datasets. This design choice allows effortless application and evaluation in cross-lingual scenarios, ultimately fostering cross-language comparisons with equivalent counterpart datasets in other languages. The annotation process engaged two trained annotators, yielding an impressive inter-annotator agreement rate of approximately 90%. Additionally, we provide 24M reviews without annotations suitable for unsupervised learning. We present robust monolingual baseline results achieved with various Transformer-based models and insightful error analysis to supplement our contributions. Our code and dataset are freely available for non-commercial research purposes.

[86] Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models

Wenze Xu,Chun Wang,Jiazhen Yu,Sheng Chen,Liang Gao,Weihong Deng

Main category: cs.CL

TL;DR: This paper proposes Optimal Transport Regularization (OTReg) to address the modality gap in Spoken Language Models (SLMs), improving their generalization by better aligning speech and text representations.

Details

Motivation: The motivation is to address the modality gap between speech and text representations in SLMs, which hinders generalization despite strong in-domain performance. Method: The authors introduced Optimal Transport Regularization (OTReg), which formulates speech-text alignment as an optimal transport problem and uses a regularization loss to optimize SLM training. Result: Extensive multilingual ASR experiments show that OTReg improves SLM generalization, aligns speech embeddings more effectively with transcript embeddings, and reduces the impact of unintended speech variations. Conclusion: The study concludes that OTReg effectively enhances speech-text alignment, mitigates the modality gap, and improves the generalization of Spoken Language Models (SLMs) across diverse datasets. Abstract: Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets.

[87] Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Tianyi Zhou,Johanne Medina,Sanjay Chawla

Main category: cs.CL

TL;DR: 这项研究通过利用标记级不确定性来评估大型语言模型的响应可靠性，发现正确上下文提升模型表现，而误导上下文导致自信错误，并提出了改进不可靠响应检测的新方法。

Details

Motivation: 大型语言模型（LLMs）容易生成流畅但错误的内容，这在多轮对话或代理应用中带来风险。研究旨在探究上下文信息如何影响模型行为，以及LLMs是否能够识别其不可靠的响应。 Method: 通过计算输出logits中的偶然不确定性和认知不确定性，识别关键标记，并将其隐藏状态聚合为紧凑表示，以进行响应级可靠性预测。 Result: 实验发现，正确的上下文信息可以提高答案准确性和模型置信度，而误导性上下文通常会导致自信错误的响应。基于探测的方法提高了多个开源LLMs中不可靠输出的检测能力。 Conclusion: 该研究强调了直接不确定性信号的局限性，并突出了不确定性引导探测在可靠性感知生成中的潜力。 Abstract: Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.

[88] Data-Efficient Biomedical In-Context Learning: A Diversity-Enhanced Submodular Perspective

Jun Wang,Zaifu Zhan,Qixin Zhang,Mingquan Lin,Meijia Song,Rui Zhang

Main category: cs.CL

TL;DR: 本文提出Dual-Div框架，通过增强示例多样性提升大语言模型在生物医学上下文学习的表现，实验证明其性能优于基线模型。

Details

Motivation: 尽管已有大量研究关注大语言模型在上下文学习中的示例选择，但大多数方法优先考虑代表性而忽视多样性，本文旨在填补这一空白。 Method: Dual-Div采用两阶段的检索与排序过程：第一阶段结合代表性与多样性从语料库中筛选候选示例，第二阶段根据测试查询对候选示例进行排序以选择最相关且非冗余的示例。 Result: 在三个生物医学NLP任务（NER、RE、TC）上的实验表明，Dual-Div相较于基线模型最高提升了5%的宏F1分数，并表现出对提示排列和类别不平衡的鲁棒性。 Conclusion: Dual-Div框架通过增强多样性在生物医学领域的上下文学习中表现出色，证明初始检索阶段的多样性比排序阶段优化更为重要，且少量示例（3-5个）即可实现性能最大化。 Abstract: Recent progress in large language models (LLMs) has leveraged their in-context learning (ICL) abilities to enable quick adaptation to unseen biomedical NLP tasks. By incorporating only a few input-output examples into prompts, LLMs can rapidly perform these new tasks. While the impact of these demonstrations on LLM performance has been extensively studied, most existing approaches prioritize representativeness over diversity when selecting examples from large corpora. To address this gap, we propose Dual-Div, a diversity-enhanced data-efficient framework for demonstration selection in biomedical ICL. Dual-Div employs a two-stage retrieval and ranking process: First, it identifies a limited set of candidate examples from a corpus by optimizing both representativeness and diversity (with optional annotation for unlabeled data). Second, it ranks these candidates against test queries to select the most relevant and non-redundant demonstrations. Evaluated on three biomedical NLP tasks (named entity recognition (NER), relation extraction (RE), and text classification (TC)) using LLaMA 3.1 and Qwen 2.5 for inference, along with three retrievers (BGE-Large, BMRetriever, MedCPT), Dual-Div consistently outperforms baselines-achieving up to 5% higher macro-F1 scores-while demonstrating robustness to prompt permutations and class imbalance. Our findings establish that diversity in initial retrieval is more critical than ranking-stage optimization, and limiting demonstrations to 3-5 examples maximizes performance efficiency.

[89] REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation

Wentao Jiang,Xiang Feng,Zengmao Wang,Yong Luo,Pingbo Xu,Zhe Chen,Bo Du,Jing Zhang

Main category: cs.CL

TL;DR: This paper proposes REX-RAG, a reinforcement learning framework with mixed sampling and policy correction mechanisms to improve reasoning in large language models by avoiding unproductive reasoning paths, achieving significant performance improvements on question-answering tasks.

Details

Motivation: The motivation stems from the challenge that LLMs often get trapped in unproductive reasoning paths (dead ends) during policy-driven trajectory sampling when using reinforcement learning with retrieval-augmented generation (RAG), leading to poor exploration and ineffective policy optimization. Method: The paper proposes REX-RAG, which includes a Mixed Sampling Strategy combining probe sampling and exploratory prompts to avoid dead ends, and a Policy Correction Mechanism using importance sampling to correct distribution shifts and mitigate gradient estimation bias. Result: REX-RAG was evaluated on seven question-answering benchmarks, achieving average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, showing competitive results across multiple datasets. Conclusion: REX-RAG is an effective framework that improves reasoning exploration in LLMs by escaping unproductive reasoning paths (dead ends) through a mixed sampling strategy and policy correction mechanism, achieving competitive performance gains across multiple datasets. Abstract: Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as "dead ends", committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at https://github.com/MiliLab/REX-RAG.

[90] LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo

Mandira Sawkar,Samay U. Shetty,Deepak Pandita,Tharindu Cyril Weerasooriya,Christopher M. Homan

Main category: cs.CL

TL;DR: 本文提出了一种改进的DisCo模型，通过结合注释者元数据和修改损失函数来更好地捕捉分歧模式，从而在评估指标上取得了显著改进。

Details

Motivation: LeWiDi 2025共享任务的目标是通过软标签分布预测和视角评估来建模注释者分歧。 Method: 通过结合注释者元数据、改进输入表示和修改损失函数，扩展了DisCo神经架构，以更好地捕捉分歧模式。 Result: 在三个数据集上的软评估和视角评估指标中均显示出了显著改进。 Conclusion: 研究强调了分歧感知建模的价值，并提供了有关系统组件如何与人工标注数据的复杂性相互作用的见解。 Abstract: The Learning With Disagreements (LeWiDi) 2025 shared task is to model annotator disagreement through soft label distribution prediction and perspectivist evaluation, modeling annotators. We adapt DisCo (Distribution from Context), a neural architecture that jointly models item-level and annotator-level label distributions, and present detailed analysis and improvements. In this paper, we extend the DisCo by incorporating annotator metadata, enhancing input representations, and modifying the loss functions to capture disagreement patterns better. Through extensive experiments, we demonstrate substantial improvements in both soft and perspectivist evaluation metrics across three datasets. We also conduct in-depth error and calibration analyses, highlighting the conditions under which improvements occur. Our findings underscore the value of disagreement-aware modeling and offer insights into how system components interact with the complexity of human-annotated data.

[91] Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Bangsheng Tang,Carl Chengyan Fu,Fei Kou,Grigory Sizov,Haoci Zhang,Jason Park,Jiawen Liu,Jie You,Qirui Yang,Sachin Mehta,Shengyong Cai,Xiaodong Wang,Xingyu Liu,Yunlu Li,Yanjun Zhou,Wei Wei,Zhiwei Zhao,Zixi Qi,Adolfo Victoria,Aya Ibrahim,Bram Wasti,Changkyu Kim,Daniel Haziza,Fei Sun,Giancarlo Delfin,Emily Guo,Jialin Ouyang,Jaewon Lee,Jianyu Huang,Jeremy Reizenstein,Lu Fang,Quinn Zhu,Ria Verma,Vlad Mihailescu,Xingwen Guo,Yan Cui,Ye Hu,Yejin Lee

Main category: cs.CL

TL;DR: 本文介绍了针对Llama模型的基于EAGLE的推测解码技术的生产规模优化方法，解决了在生产环境中扩展推测解码存在的多个工程挑战。通过优化，实现了Llama模型的新推理延迟最先进水平，在生产规模下对于基于EAGLE的推测解码，优化后的方法在大批次大小下实现了1.4倍到2.0倍的加速。

Details

Motivation: 推测解码是加速大型语言模型推理速度的标准方法，但在生产环境中扩展推测解码存在多个工程挑战，例如在GPU上高效实现不同的操作（如树注意力和多轮推测解码）等。 Method: 提出了一种针对Llama模型的基于EAGLE的推测解码技术的生产规模优化方法，包括树注意力和多轮推测解码等不同操作的高效实现。 Result: 通过优化，实现了Llama模型的新推理延迟最先进水平，例如在8个NVIDIA H100 GPU上以约4毫秒/标记的速度解码Llama4 Maverick（批量大小为1），比之前最好的方法快10%。此外，在生产规模下，对于基于EAGLE的推测解码，优化后的方法在大批次大小下实现了1.4倍到2.0倍的加速。 Conclusion: 本文详细介绍了为实现基于EAGLE的推测解码技术在Llama模型上的生产规模部署所进行的训练和推理优化技术，并达到了新的Llama模型推理延迟的最先进水平。 Abstract: Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.

[92] Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

Kyle Moore,Jesse Roberts,Daryl Watson

Main category: cs.CL

TL;DR: 本文研究了推理时间不确定性度量方法与人类不确定性和模型校准的一致性，发现许多度量方法与人类不确定性高度一致，并显示出良好的模型校准效果。

Details

Motivation: 最近对评估大型语言模型的不确定性校准有浓厚兴趣，以便于模型控制并调节用户信任。推理时间不确定性尤其重要，因为它可以为模型或外部控制模块提供实时信号。然而，现有研究较少关注模型不确定性与人类不确定性的对齐程度。 Method: 研究人员评估了一系列推理时间不确定性度量方法，使用已建立的指标和新颖的变体，以确定它们与人类群体级不确定性和传统模型校准概念的一致性。 Result: 研究发现，许多不确定性的度量方法与人类不确定性有较强的一致性，并且对于成功的度量指标，模型在校准方面表现出中等到较强的证据，包括正确性相关性和分布分析。 Conclusion: 该研究得出结论，许多不确定性的度量方法与人类的不确定性有较强的一致性，即使在缺乏与人类答案偏好一致的情况下。对于成功的度量指标，模型在校准方面表现出中等到较强的证据。 Abstract: There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.

[93] SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

Zhuohao Yu,Xingru Jiang,Weizheng Gu,Yidong Wang,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: SAEMark is a general framework for post-hoc multi-bit watermarking that operates on deterministic features extracted from generated text, preserving text quality and enabling content attribution with closed-source LLMs.

Details

Motivation: Existing methods for watermarking LLM-generated text have limitations such as compromising text quality and requiring white-box model access, excluding API-based models and multilingual scenarios. Method: SAEMark uses a post-hoc multi-bit watermarking approach via inference-time feature-based rejection sampling without altering model logits or requiring training. Result: Experiments across 4 datasets show SAEMark's consistent performance with 99.7% F1 on English and strong multi-bit detection accuracy. Conclusion: SAEMark provides a new paradigm for scalable watermarking that works efficiently with closed-source LLMs, enabling content attribution while maintaining text quality. Abstract: Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.

[94] Capabilities of GPT-5 on Multimodal Medical Reasoning

Shansong Wang,Mingzhe Hu,Qiang Li,Mojtaba Safari,Xiaofeng Yang

Main category: cs.CL

TL;DR: 本研究评估了GPT-5作为医学决策支持的通用多模态推理者的零样本思维链推理性能，并与其它模型进行了比较。

Details

Motivation: 医学领域的决策通常需要整合异构信息源，包括患者叙述、结构化数据和医学图像。最近大型语言模型（LLMs）的进步使得通用系统在没有广泛微调的情况下执行日益复杂的领域特定推理成为可能。 Method: 在统一协议下，对GPT-5、GPT-5-mini、GPT-5-nano和GPT-4o-2024-11-20在MedQA、MedXpertQA、MMLU医学子集、USMLE自我评估考试和VQA-RAD的标准分割上进行了基准测试。 Result: 结果表明，GPT-5始终优于所有基线模型，在所有问答基准测试中均达到了最先进的准确性，并在多模态推理方面取得了显著提升。在MedXpertQA MM上，GPT-5的推理和理解得分分别比GPT-4o提高了+29.62%和+36.18%，在推理方面超过了前许可人类专家+24.23%，在理解方面超过了+29.40%。 Conclusion: GPT-5实现了超越人类专家的多模态推理性能，这可能会显著影响未来临床决策支持系统的设计。 Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

[95] Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Yunna Cai,Fan Wang,Haowei Wang,Kun Wang,Kailai Yang,Sophia Ananiadou,Moyan Li,Mingming Fan

Main category: cs.CL

TL;DR: 本文提出了一种无需参考答案的评估方法 PsyCrisis-Bench 和一个高质量的中文心理健康对话数据集，用于评估大语言模型在高风险心理健康对话中的安全对齐性。

Details

Motivation: 由于缺乏标准答案以及心理健康对话的伦理敏感性，评估大语言模型在高风险心理健康对话中的安全对齐性是一项挑战。 Method: 采用基于提示的 LLM-as-Judge 方法，结合专家定义的推理链，对模型响应在多个安全维度上进行二元逐点评分。 Result: 实验结果表明，该方法在 3600 项评估中与专家评估的一致性最高，并且评估依据比现有方法更具可解释性。 Conclusion: PsyCrisis-Bench 作为一种无需标准参考答案的评估方法，在评估大语言模型在高风险心理健康对话中的安全对齐性方面表现出色，并且相比现有方法更具解释性和可追溯性。 Abstract: Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

[96] Jinx: Unlimited LLMs for Probing Alignment Failures

Jiahao Zhao,Liwei Dong

Main category: cs.CL

TL;DR: Jinx is a helpful-only language model developed to aid researchers in studying alignment failures and safety boundaries in AI systems.

Details

Motivation: Unlimited or helpful-only language models are essential for assessing safety alignment in AI systems but are not available to the research community, which motivated the development of Jinx. Method: The authors created Jinx by modifying popular open-weight LLMs to remove safety alignment constraints, enabling unhindered response generation. Result: Jinx was successfully developed as a helpful-only variant of open LLMs that responds to all queries without safety filtering while maintaining the capabilities of the base model. Conclusion: Jinx serves as an accessible tool for researchers to probe alignment failures, evaluate safety boundaries, and systematically study failure modes in language model safety. Abstract: Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model's capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.

cs.CV [Back]

[97] Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG

Rakesh Raj Madavan,Akshat Kaimal,Hashim Faisal,Chandrakala S

Main category: cs.CV

TL;DR: Med-GRIM是一种高效的医学VQA模型，结合了密集编码、图检索和提示工程，以较低的计算成本实现高精度，并引入了支持医学研究的DermaGraph数据集。

Details

Motivation: 现有的多模态编码器和视觉语言模型在医学VQA任务中缺乏足够的详细精度，需要一种更高效和准确的方法。 Method: BIND模型通过密集的查询标记编码来优化联合嵌入空间，并结合图检索和提示工程来集成领域特定知识。 Result: Med-GRIM在医学VQA任务中实现了大型语言模型性能，同时显著降低了计算成本，并引入了支持多模态和单模态查询的DermaGraph数据集。 Conclusion: Med-GRIM通过使用低计算、模块化的工作流程和提示检索，实现了在医学VQA任务中的高效和准确性能，同时DermaGraph数据集支持零样本多模态医学应用的可扩展研究。 Abstract: An ensemble of trained multimodal encoders and vision-language models (VLMs) has become a standard approach for visual question answering (VQA) tasks. However, such models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset are available at: https://github.com/Rakesh-123-cryp/Med-GRIM.git

[98] DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation

He Feng,Yongjia Ma,Donglin Di,Lei Fan,Tonghua Su,Xiangqian Wu

Main category: cs.CV

TL;DR: 本文提出了一种基于DiT的说话风格可控肖像动画框架DiTalker，相较于现有方法，其在唇同步和风格控制方面表现更优。

Details

Motivation: 现有的基于扩散模型的肖像动画方法主要关注唇同步或静态情感转换，忽视了动态风格（如头部运动）。此外，这些方法通常依赖于双U-Net架构，增加了计算开销。 Method: DiTalker采用了一个基于DiT的框架，包含风格-情感编码模块和音频-风格融合模块，分别用于提取风格信息和融合音频与风格信息，并引入了两个优化约束以提高动画质量。 Result: 通过广泛的实验，证明了DiTalker在唇同步和说话风格可控性方面优于现有方法。 Conclusion: DiTalker是一个基于DiT的统一框架，能够实现说话风格可控的肖像动画，具有更优的唇同步和风格控制能力。 Abstract: Portrait animation aims to synthesize talking videos from a static reference face, conditioned on audio and style frame cues (e.g., emotion and head poses), while ensuring precise lip synchronization and faithful reproduction of speaking styles. Existing diffusion-based portrait animation methods primarily focus on lip synchronization or static emotion transformation, often overlooking dynamic styles such as head movements. Moreover, most of these methods rely on a dual U-Net architecture, which preserves identity consistency but incurs additional computational overhead. To this end, we propose DiTalker, a unified DiT-based framework for speaking style-controllable portrait animation. We design a Style-Emotion Encoding Module that employs two separate branches: a style branch extracting identity-specific style information (e.g., head poses and movements), and an emotion branch extracting identity-agnostic emotion features. We further introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross-attention layers, using these features to guide the animation process. To enhance the quality of results, we adopt and modify two optimization constraints: one to improve lip synchronization and the other to preserve fine-grained identity and background details. Extensive experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability. Project Page: https://thenameishope.github.io/DiTalker/

[99] BigTokDetect: A Clinically-Informed Vision-Language Model Framework for Detecting Pro-Bigorexia Videos on TikTok

Minh Duc Chu,Kshitij Pawar,Zihao He,Roxanna Sharifi,Ross Sonnenblick,Magdalayna Curry,Laura D'Adamo,Lindsay Young,Stuart B Murray,Kristina Lerman

Main category: cs.CV

TL;DR: 本研究开发了用于识别TikTok上亲-bigorexia内容的多模态检测框架BigTokDetect和相关数据集BigTok，结果显示多模态方法显著优于文本方法，为有害内容检测提供了新基准和实用工具。

Details

Motivation: 社交媒体平台在检测促进肌肉畸形行为的有害内容（尤其是影响青少年男性的亲-bigorexia内容）方面面临挑战。这类内容常伪装成合法的健身内容，通过多模态组合逃避传统文本检测系统，因此需要新的检测框架和数据集。 Method: 开发了BigTokDetect框架，并构建了首个专家注释的多模态数据集BigTok，包含2200多个TikTok视频，由临床心理学家和精神病学家标注。通过全面评估最先进的视觉语言模型，并进行领域特定微调，评估其在主要类别和子类别检测上的性能。此外还进行了消融研究，以评估多模态融合相对于文本方法的优势。 Result: 通过领域特定微调，研究在主要类别分类上实现了82.9%的准确率，在子类别检测上实现了69.0%的准确率。消融研究表明，多模态融合方法相比纯文本方法性能提高了5-10%，其中视频特征提供了最具判别力的信号。 Conclusion: 研究得出，通过开发BigTokDetect这一临床知情检测框架，并利用首个专家注释的多模态数据集BigTok，可以有效识别TikTok上的亲-bigorexia内容。研究发现，多模态融合方法相比纯文本方法性能提高了5-10%，为多模态有害内容检测建立了新基准，并提供了可在专门心理健康领域进行扩展的内容审核工具和方法框架。 Abstract: Social media platforms increasingly struggle to detect harmful content that promotes muscle dysmorphic behaviors, particularly pro-bigorexia content that disproportionately affects adolescent males. Unlike traditional eating disorder detection focused on the "thin ideal," pro-bigorexia material masquerades as legitimate fitness content through complex multimodal combinations of visual displays, coded language, and motivational messaging that evade text-based detection systems. We address this challenge by developing BigTokDetect, a clinically-informed detection framework for identifying pro-bigorexia content on TikTok. We introduce BigTok, the first expert-annotated multimodal dataset of over 2,200 TikTok videos labeled by clinical psychologists and psychiatrists across five primary categories spanning body image, nutrition, exercise, supplements, and masculinity. Through a comprehensive evaluation of state-of-the-art vision language models, we achieve 0.829% accuracy on primary category classification and 0.690% on subcategory detection via domain-specific finetuning. Our ablation studies demonstrate that multimodal fusion improves performance by 5-10% over text-only approaches, with video features providing the most discriminative signals. These findings establish new benchmarks for multimodal harmful content detection and provide both the computational tools and methodological framework needed for scalable content moderation in specialized mental health domains.

[100] Frequency Prior Guided Matching: A Data Augmentation Approach for Generalizable Semi-Supervised Polyp Segmentation

Haoran Xi,Chen Liu,Xiaolin Li

Main category: cs.CV

TL;DR: This paper proposes Frequency Prior Guided Matching (FPGM), a novel semi-supervised learning framework for automated polyp segmentation. By leveraging the consistent frequency signature of polyp edges, FPGM enhances cross-domain robustness and achieves state-of-the-art performance with significant improvements in data-scarce scenarios.

Details

Motivation: Automated polyp segmentation is crucial for early diagnosis of colorectal cancer, but robust model development is hindered by limited annotated data and performance issues under domain shifts. Current semi-supervised methods use generic augmentations that do not consider polyp-specific structural properties, leading to poor generalization. Method: The paper introduces Frequency Prior Guided Matching (FPGM), a novel augmentation framework that leverages the consistent frequency signature of polyp edges across datasets. It uses a domain-invariant frequency prior in a two-stage process involving spectral perturbations to align amplitude spectra while preserving phase information. Result: FPGM achieved a new state-of-the-art performance against ten competing methods across six public datasets. It demonstrated exceptional zero-shot generalization, with over a 10% absolute gain in Dice score in data-scarce scenarios. Conclusion: FPGM presents a powerful solution for clinically deployable polyp segmentation under limited supervision by significantly enhancing cross-domain robustness. Abstract: Automated polyp segmentation is essential for early diagnosis of colorectal cancer, yet developing robust models remains challenging due to limited annotated data and significant performance degradation under domain shift. Although semi-supervised learning (SSL) reduces annotation requirements, existing methods rely on generic augmentations that ignore polyp-specific structural properties, resulting in poor generalization to new imaging centers and devices. To address this, we introduce Frequency Prior Guided Matching (FPGM), a novel augmentation framework built on a key discovery: polyp edges exhibit a remarkably consistent frequency signature across diverse datasets. FPGM leverages this intrinsic regularity in a two-stage process. It first learns a domain-invariant frequency prior from the edge regions of labeled polyps. Then, it performs principled spectral perturbations on unlabeled images, aligning their amplitude spectra with this learned prior while preserving phase information to maintain structural integrity. This targeted alignment normalizes domain-specific textural variations, thereby compelling the model to learn the underlying, generalizable anatomical structure. Validated on six public datasets, FPGM establishes a new state-of-the-art against ten competing methods. It demonstrates exceptional zero-shot generalization capabilities, achieving over 10% absolute gain in Dice score in data-scarce scenarios. By significantly enhancing cross-domain robustness, FPGM presents a powerful solution for clinically deployable polyp segmentation under limited supervision.

[101] Large Language Models Facilitate Vision Reflection in Image Classification

Guoyuan An,JaeYoon Kim,SungEui Yoon

Main category: cs.CV

TL;DR: This paper explores how vision reflection in large multimodal models improves visual recognition accuracy and interpretability through internal analysis and experiments.

Details

Motivation: The motivation is to understand the explainability and performance of vision reflection in large multimodal models (LMMs), especially given their typically inferior performance compared to dedicated vision encoders. Method: The researchers explored vision reflection in LMMs by analyzing their internal behavior and conducting experiments involving prediction verification, vision-language connector analysis, and testing with reduced vision tokens. Result: Experiments showed improved recognition accuracy through vision reflection, reliance on distilled textual representations, and enhanced performance with a training-free connector in fine-grained recognition tasks. Conclusion: The study concludes that vision reflection in LMMs provides improved recognition accuracy and interpretability, suggesting its potential for robust visual recognition. Abstract: This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition accuracy, even on benchmarks like ImageNet, despite prior evidence that LMMs typically underperform dedicated vision encoders. Second, we analyze the internal behavior of vision reflection and find that the vision-language connector maps visual features into explicit textual concepts, allowing the language model to reason about prediction plausibility using commonsense knowledge. We further observe that replacing a large number of vision tokens with only a few text tokens still enables LLaVA to generate similar answers, suggesting that LMMs may rely primarily on a compact set of distilled textual representations rather than raw vision features. Third, we show that a training-free connector can enhance LMM performance in fine-grained recognition tasks, without extensive feature-alignment training. Together, these findings offer new insights into the explainability of vision-language models and suggest that vision reflection is a promising strategy for achieving robust and interpretable visual recognition.

[102] A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition

Xiuliang Zhang,Tadiwa Elisha Nyamasvisva,Chuntao Liu

Main category: cs.CV

TL;DR: 提出了一种结合3D CNN和Transformer的混合框架，用于基于视频的行为识别。

Details

Motivation: 传统3D CNN在建模长距离依赖关系方面存在困难，而Transformer在计算成本方面面临挑战。 Method: 提出了一种结合3D CNN和Transformer架构的混合框架。 Result: 在基准数据集上的评估表明，该模型在传统3D CNN和独立Transformer上表现更优，识别准确率更高且复杂度可接受。 Conclusion: 混合框架提供了有效且可扩展的基于视频的行为识别解决方案。 Abstract: Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity. Ablation studies further validate the complementary strengths of the two modules. This hybrid framework offers an effective and scalable solution for video-based behavior recognition.

[103] RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving

Jiayuan Wang,Q. M. Jonathan Wu,Katsuya Suto,Ning Zhang

Main category: cs.CV

TL;DR: 本研究提出了一种名为RMT-PPAD的实时多任务模型，用于自动驾驶中的感知任务，通过引入轻量级模块和自适应机制，解决了多任务之间的负迁移问题，并在多个任务上实现了最先进的性能。

Details

Motivation: 自动驾驶系统需要精确且实时的感知能力，因此需要解决多任务之间的负迁移问题，并提高模型的实时性能和泛化能力。 Method: 提出了RMT-PPAD模型，采用基于Transformer的多任务学习框架，引入了轻量级模块、自适应分割解码器以及解决了车道线分割中的训练与测试标签不一致问题。 Result: 在BDD100K数据集上的实验表明，RMT-PPAD在多个任务上达到了最先进的性能，包括目标检测、可行驶区域分割和车道线分割，同时推理速度达到32.6 FPS。 Conclusion: RMT-PPAD在多个任务上实现了最先进的性能，并且在推理速度和实际场景表现上都具有优势。 Abstract: Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/RMT-PPAD.

[104] What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?

Ming-Kun Xie,Jia-Hao Xiao,Gang Niu,Lei Feng,Zhiqiang Kou,Min-Ling Zhang,Masashi Sugiyama

Main category: cs.CV

TL;DR: This paper introduces HOPE, a new benchmark for evaluating object hallucination in LVLMs, which leverages image-specific information and description-based distractors to more rigorously assess model vulnerabilities.

Details

Motivation: LVLMs suffer from object hallucination issues, and the existing POPE benchmark is ineffective due to its simplistic sampling strategy that overlooks image-specific information. Method: HOPE uses CLIP to select negative objects with the highest predicted likelihood and pairs true objects with false descriptions to generate more effective distractors. Result: HOPE causes a precision drop of at least 9% and up to 23% in LVLMs, demonstrating its superior effectiveness in identifying hallucination vulnerabilities compared to POPE. Conclusion: HOPE benchmark effectively exposes hallucination vulnerabilities in LVLMs, outperforming POPE by significantly dropping precision across state-of-the-art models. Abstract: Large Vision-Language Models (LVLMs), empowered by the success of Large Language Models (LLMs), have achieved impressive performance across domains. Despite the great advances in LVLMs, they still suffer from the unavailable object hallucination issue, which tends to generate objects inconsistent with the image content. The most commonly used Polling-based Object Probing Evaluation (POPE) benchmark evaluates this issue by sampling negative categories according to category-level statistics, \textit{e.g.}, category frequencies and co-occurrence. However, with the continuous advancement of LVLMs, the POPE benchmark has shown diminishing effectiveness in assessing object hallucination, as it employs a simplistic sampling strategy that overlooks image-specific information and restricts distractors to negative object categories only. In this paper, we introduce the Hallucination searching-based Object Probing Evaluation (HOPE) benchmark, aiming to generate the most misleading distractors (\textit{i.e.}, non-existent objects or incorrect image descriptions) that can trigger hallucination in LVLMs, which serves as a means to more rigorously assess their immunity to hallucination. To explore the image-specific information, the content-aware hallucination searching leverages Contrastive Language-Image Pre-Training (CLIP) to approximate the predictive behavior of LVLMs by selecting negative objects with the highest predicted likelihood as distractors. To expand the scope of hallucination assessment, the description-based hallucination searching constructs highly misleading distractors by pairing true objects with false descriptions. Experimental results show that HOPE leads to a precision drop of at least 9\% and up to 23\% across various state-of-the-art LVLMs, significantly outperforming POPE in exposing hallucination vulnerabilities. The code is available at https://github.com/xiemk/HOPE.

[105] Benchmarking Deep Learning-Based Object Detection Models on Feature Deficient Astrophotography Imagery Dataset

Shantanusinh Parmar

Main category: cs.CV

TL;DR: This paper discusses the limitations of current object detection models trained on common datasets and introduces MobilTelesco, a dataset for astrophotography, to address these limitations.

Details

Motivation: The motivation is to address the lack of signal sparsity in current object detection datasets, which limits the applicability of detection models in non-commercial domains like astrophotography. Method: The authors introduce MobilTelesco, a smartphone-based astrophotography dataset, and benchmark several detection models on it to evaluate their performance under feature-deficient conditions. Result: The results highlight the challenges faced by detection models when applied to sparse night-sky images and suggest the need for improved models that can handle feature-deficient conditions. Conclusion: The paper concludes that MobilTelesco provides a valuable resource for evaluating and improving object detection models under feature-deficient conditions, particularly in non-commercial domains. Abstract: Object detection models are typically trained on datasets like ImageNet, COCO, and PASCAL VOC, which focus on everyday objects. However, these lack signal sparsity found in non-commercial domains. MobilTelesco, a smartphone-based astrophotography dataset, addresses this by providing sparse night-sky images. We benchmark several detection models on it, highlighting challenges under feature-deficient conditions.

[106] MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

Jinghan Yu,Zhiyuan Ma,Yue Ma,Kaiqi Liu,Yuhan Wang,Jianjun Li

Main category: cs.CV

TL;DR: The paper presents MILD, a new approach for human erasing in complex scenarios, which outperforms existing methods by decomposing generation into semantically separated pathways and enhancing human-centric understanding.

Details

Motivation: Existing methods struggle with complex multi-IP scenarios involving occlusions and background interferences due to dataset limitations and lack of spatial decoupling. Method: The authors introduced a high-quality multi-IP human erasing dataset and proposed Multi-Layer Diffusion (MILD), which decomposes generation into semantically separated pathways. They also introduced Human Morphology Guidance and Spatially-Modulated Attention to enhance understanding and attention flow. Result: MILD outperforms state-of-the-art methods on challenging human erasing benchmarks. Conclusion: MILD effectively addresses the challenges of human erasing in complex multi-IP scenarios by decomposing generation into semantically separated pathways and enhancing human-centric understanding. Abstract: Recent years have witnessed the success of diffusion models in image-customized tasks. Prior works have achieved notable progress on human-oriented erasing using explicit mask guidance and semantic-aware inpainting. However, they struggle under complex multi-IP scenarios involving human-human occlusions, human-object entanglements, and background interferences. These challenges are mainly due to: 1) Dataset limitations, as existing datasets rarely cover dense occlusions, camouflaged backgrounds, and diverse interactions; 2) Lack of spatial decoupling, where foreground instances cannot be effectively disentangled, limiting clean background restoration. In this work, we introduce a high-quality multi-IP human erasing dataset with diverse pose variations and complex backgrounds. We then propose Multi-Layer Diffusion (MILD), a novel strategy that decomposes generation into semantically separated pathways for each instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, integrating pose, parsing, and spatial relations. We further present Spatially-Modulated Attention to better guide attention flow. Extensive experiments show that MILD outperforms state-of-the-art methods on challenging human erasing benchmarks.

[107] Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

Qi Xun Yeo,Yanyan Li,Gim Hee Lee

Main category: cs.CV

TL;DR: 本文提出了一种无需3D标注数据的3D语义场景图估计方法，通过语义掩码和邻居信息增强特征表示，提升了估计的准确性。

Details

Motivation: 现有方法依赖于3D标注数据，而本文旨在在没有3D地面实况的情况下，仅使用多视角RGB图像进行准确的场景图估计。 Method: 利用多视角RGB图像，通过预测深度图生成伪点云几何信息，使用语义掩码过滤背景特征，并设计了结合邻居节点信息的新方法以增强特征表示，同时利用训练集的统计先验优化预测结果。 Result: 实验表明，该方法在仅使用多视角图像作为输入的情况下，优于当前其他方法。 Conclusion: 本文提出了一种基于多视角RGB图像的3D语义场景图估计方法，通过引入语义掩码和邻居节点信息，提高了场景图估计的鲁棒性，并利用显式统计先验优化节点和边的预测。 Abstract: Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at https://qixun1.github.io/projects/SCRSSG.

[108] Slice or the Whole Pie? Utility Control for AI Models

Ye Tao

Main category: cs.CV

TL;DR: NNObfuscator enables a single AI model to dynamically adapt its performance based on user tiers, reducing resource waste and improving efficiency compared to traditional methods requiring multiple models.

Details

Motivation: The increasing resource intensity of training deep neural networks, coupled with the inefficiency and maintenance difficulty of traditional methods that require training multiple model versions for different user needs, motivated the development of a more efficient and adaptable solution. Method: The authors proposed NNObfuscator, a utility control mechanism that enables AI models to dynamically modify their performance according to predefined conditions. They validated the approach through experiments on image classification, semantic segmentation, and text-to-image generation tasks using established models like ResNet, DeepLab, VGG16, FCN, and Stable Diffusion. Result: Experimental results demonstrated that NNObfuscator successfully makes models more adaptable, enabling a single trained model to handle a broad range of tasks without requiring extensive modifications, while also allowing for tiered performance access for users. Conclusion: NNObfuscator provides a dynamic utility control mechanism for AI models, allowing a single model to adapt performance in real-time based on user tiers, thereby improving resource allocation and supporting sustainable AI deployment. Abstract: Training deep neural networks (DNNs) has become an increasingly resource-intensive task, requiring large volumes of labeled data, substantial computational power, and considerable fine-tuning efforts to achieve optimal performance across diverse use cases. Although pre-trained models offer a useful starting point, adapting them to meet specific user needs often demands extensive customization, and infrastructure overhead. This challenge grows when a single model must support diverse appli-cations with differing requirements for performance. Traditional solutions often involve training multiple model versions to meet varying requirements, which can be inefficient and difficult to maintain. In order to overcome this challenge, we propose NNObfuscator, a novel utility control mechanism that enables AI models to dynamically modify their performance according to predefined conditions. It is different from traditional methods that need separate models for each user. Instead, NNObfuscator allows a single model to be adapted in real time, giving you controlled access to multiple levels of performance. This mechanism enables model owners set up tiered access, ensuring that free-tier users receive a baseline level of performance while premium users benefit from enhanced capabilities. The approach improves resource allocation, reduces unnecessary computation, and supports sustainable business models in AI deployment. To validate our approach, we conducted experiments on multiple tasks, including image classification, semantic segmentation, and text to image generation, using well-established models such as ResNet, DeepLab, VGG16, FCN and Stable Diffusion. Experimental results show that NNObfuscator successfully makes model more adaptable, so that a single trained model can handle a broad range of tasks without requiring a lot of changes.

[109] Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection

Unisha Joshi

Main category: cs.CV

TL;DR: This paper addresses the issue of demographic bias in deepfake datasets by introducing an age-diverse deepfake dataset that improves fairness across age groups, with the dataset and implementation code publicly available.

Details

Motivation: The challenges associated with deepfake detection are increasing significantly with the latest advancements in technology and the growing popularity of deepfake videos and images. Despite the presence of numerous detection models, demographic bias in the deepfake dataset remains largely unaddressed. Method: The dataset is constructed through a modular pipeline incorporating the existing deepfake datasets Celeb-DF, FaceForensics++, and UTKFace datasets, and the creation of synthetic data to fill the age distribution gaps. Evaluation metrics, including AUC, pAUC, and EER, revealed that models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. Result: Models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. Conclusion: This study contributes a reproducible, fairness-aware deepfake dataset and model pipeline that can serve as a foundation for future research in fairer deepfake detection. Abstract: The challenges associated with deepfake detection are increasing significantly with the latest advancements in technology and the growing popularity of deepfake videos and images. Despite the presence of numerous detection models, demographic bias in the deepfake dataset remains largely unaddressed. This paper focuses on the mitigation of age-specific bias in the deepfake dataset by introducing an age-diverse deepfake dataset that will improve fairness across age groups. The dataset is constructed through a modular pipeline incorporating the existing deepfake datasets Celeb-DF, FaceForensics++, and UTKFace datasets, and the creation of synthetic data to fill the age distribution gaps. The effectiveness and generalizability of this dataset are evaluated using three deepfake detection models: XceptionNet, EfficientNet, and LipForensics. Evaluation metrics, including AUC, pAUC, and EER, revealed that models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. This study contributes a reproducible, fairness-aware deepfake dataset and model pipeline that can serve as a foundation for future research in fairer deepfake detection. The complete dataset and implementation code are available at https://github.com/unishajoshi/age-diverse-deepfake-detection.

[110] Static and Plugged: Make Embodied Evaluation Simple

Jiahao Xiao,Jianbo Zhang,BoWen Yan,Shengyu Guo,Tongrui Ye,Kaiwei Zhang,Zicheng Zhang,Xiaohong Liu,Zhengxue Cheng,Lei Fan,Chuyi Li,Guangtao Zhai

Main category: cs.CV

TL;DR: The paper proposes StaticEmbodiedBench, a scalable and unified benchmark for embodied intelligence evaluation using static scene representations, offering a simple interface and diverse scenarios.

Details

Motivation: Current benchmarks for embodied intelligence are costly, fragmented, and difficult to scale, necessitating a more efficient and unified evaluation method. Method: The authors developed StaticEmbodiedBench, a plug-and-play benchmark using static scene representations, covering 42 scenarios and 8 core dimensions. They evaluated 19 Vision-Language Models (VLMs) and 11 Vision-Language-Action models (VLAs) to create a unified static leaderboard. Result: The benchmark enables scalable and comprehensive assessment of embodied intelligence and establishes the first unified static leaderboard. A subset of 200 samples was released to accelerate development. Conclusion: The paper introduces a new benchmark called StaticEmbodiedBench for evaluating embodied intelligence, providing a scalable and unified approach with a diverse set of scenarios and a simple interface. Abstract: Embodied intelligence is advancing rapidly, driving the need for efficient evaluation. Current benchmarks typically rely on interactive simulated environments or real-world setups, which are costly, fragmented, and hard to scale. To address this, we introduce StaticEmbodiedBench, a plug-and-play benchmark that enables unified evaluation using static scene representations. Covering 42 diverse scenarios and 8 core dimensions, it supports scalable and comprehensive assessment through a simple interface. Furthermore, we evaluate 19 Vision-Language Models (VLMs) and 11 Vision-Language-Action models (VLAs), establishing the first unified static leaderboard for Embodied intelligence. Moreover, we release a subset of 200 samples from our benchmark to accelerate the development of embodied intelligence.

[111] StyleTailor: Towards Personalized Fashion Styling via Hierarchical Negative Feedback

Hongbo Ma,Fei Shen,Hongbin Xu,Xiaoce Wang,Gang Xu,Jinkai Zheng,Liangqiong Qu,Ming Li

Main category: cs.CV

TL;DR: StyleTailor 是一个结合个性化服装设计、推荐、虚拟试穿和评估的智能代理框架，利用负面反馈实现闭环优化，显著提升个性化时尚解决方案的性能。

Details

Motivation: 个性化时尚造型领域仍未被充分探索，但具有提升购物体验的潜力，因此需要一种综合性的智能解决方案。 Method: StyleTailor 利用迭代视觉优化范式，结合多层次负面反馈，通过 Designer 和 Consultant 两个核心代理进行渐进式优化，形成闭环机制以增强推荐质量。 Result: StyleTailor 在风格一致性、视觉质量、面部相似性和艺术评价等多个指标上均优于没有负面反馈的基线模型，建立了智能时尚系统的新基准。 Conclusion: StyleTailor 提出了一种全新的协作代理框架，通过结合个性化服装设计、购物推荐、虚拟试穿和系统评估，显著提升了个性化时尚造型的解决方案。 Abstract: The advancement of intelligent agents has revolutionized problem-solving across diverse domains, yet solutions for personalized fashion styling remain underexplored, which holds immense promise for promoting shopping experiences. In this work, we present StyleTailor, the first collaborative agent framework that seamlessly unifies personalized apparel design, shopping recommendation, virtual try-on, and systematic evaluation into a cohesive workflow. To this end, StyleTailor pioneers an iterative visual refinement paradigm driven by multi-level negative feedback, enabling adaptive and precise user alignment. Specifically, our framework features two core agents, i.e., Designer for personalized garment selection and Consultant for virtual try-on, whose outputs are progressively refined via hierarchical vision-language model feedback spanning individual items, complete outfits, and try-on efficacy. Counterexamples are aggregated into negative prompts, forming a closed-loop mechanism that enhances recommendation quality.To assess the performance, we introduce a comprehensive evaluation suite encompassing style consistency, visual quality, face similarity, and artistic appraisal. Extensive experiments demonstrate StyleTailor's superior performance in delivering personalized designs and recommendations, outperforming strong baselines without negative feedback and establishing a new benchmark for intelligent fashion systems.

[112] From Label Error Detection to Correction: A Modular Framework and Benchmark for Object Detection Datasets

Sarina Penquitt,Jonathan Klees,Rinor Cakaj,Daniel Kondermann,Matthias Rottmann,Lars Schmarje

Main category: cs.CV

TL;DR: The paper introduces a semi-automated framework named REC✓D for correcting label errors in object detection datasets. Using crowdsourced microtasks, it improves label quality by aggregating feedback from multiple annotators. Applied to the KITTI dataset, it identified a significant number of errors, highlighting both the effectiveness of the approach and the need for further research in label error detection and correction.

Details

Motivation: The motivation behind the paper is the prevalence of label errors in object detection datasets—such as missing labels, incorrect classification, or inaccurate localization—which compromise dataset quality and impact training outcomes and benchmark evaluations. The authors aim to address the open problem of correcting such errors systematically and at scale. Method: The paper introduces a semi-automated framework for label-error correction called REC✓D, which pairs error proposals from existing detectors with lightweight, crowd-sourced microtasks. Multiple annotators independently verify each candidate bounding box, and their responses are aggregated to estimate ambiguity and improve label quality. Result: The crowdsourced review using the REC✓D framework on the pedestrian class in the KITTI dataset revealed that at least 24% of the original annotations were missing or inaccurate. The results show that combining current label error detection methods with the correction framework can recover hundreds of errors efficiently. However, even the best methods miss up to 66% of true errors, and poor-quality labels can introduce more errors than they fix. Conclusion: The paper concludes that while current label error detection and correction methods show promise, there remains a significant gap in their effectiveness, emphasizing the need for further research in this area, which is now possible with the release of the new benchmark. Abstract: Object detection has advanced rapidly in recent years, driven by increasingly large and diverse datasets. However, label errors, defined as missing labels, incorrect classification or inaccurate localization, often compromise the quality of these datasets. This can have a significant impact on the outcomes of training and benchmark evaluations. Although several methods now exist for detecting label errors in object detection datasets, they are typically validated only on synthetic benchmarks or limited manual inspection. How to correct such errors systemically and at scale therefore remains an open problem. We introduce a semi-automated framework for label-error correction called REC$\checkmark$D (Rechecked). Building on existing detectors, the framework pairs their error proposals with lightweight, crowd-sourced microtasks. These tasks enable multiple annotators to independently verify each candidate bounding box, and their responses are aggregated to estimate ambiguity and improve label quality. To demonstrate the effectiveness of REC$\checkmark$D, we apply it to the class pedestrian in the KITTI dataset. Our crowdsourced review yields high-quality corrected annotations, which indicate a rate of at least 24% of missing and inaccurate annotations in original annotations. This validated set will be released as a new real-world benchmark for label error detection and correction. We show that current label error detection methods, when combined with our correction framework, can recover hundreds of errors in the time it would take a human to annotate bounding boxes from scratch. However, even the best methods still miss up to 66% of the true errors and with low quality labels introduce more errors than they find. This highlights the urgent need for further research, now enabled by our released benchmark.

[113] On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications

Simon Baur,Alexandra Benova,Emilio Dolgener Cantú,Jackie Ma

Main category: cs.CV

TL;DR: 多模态知识蒸馏方法 MMPKD 提升了视觉模型的零样本定位能力，但其效果不跨领域适用。

Details

Motivation: 在临床实践中，深度学习模型需要多模态数据进行可靠决策，但并非所有模态在推理时都可用。 Method: 利用文本和表格数据作为教师模型，通过知识蒸馏将信息传递给视觉转换器学生模型。 Result: MMPKD 提高了注意力图的零样本定位能力，但效果不具备跨领域泛化性。 Conclusion: MMPKD 是一种有效的训练策略，可以在不同领域中利用额外的模态来改进视觉模型，但其效果不能跨领域泛化。 Abstract: Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps' zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.

[114] Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC

Guanyu Hu,Dimitrios Kollias,Xinyu Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉情绪引导锚定机制，通过引入类别级别的视觉语义，改善多模态情绪识别的表现，达到了在IEMOCAP和MELD数据集上的SOTA性能。

Details

Motivation: 由于文本、声学和视觉信号之间复杂的相互作用，对话中的多模态情绪识别仍然是一项具有挑战性的任务。现有模型缺乏心理学上有意义的先验知识来指导多模态对齐。 Method: 重新审视CLIP的使用，提出视觉情绪引导锚定机制（VEGA），利用CLIP的图像编码器构建基于面部示例的情绪特定视觉锚点，并通过随机锚点采样策略增强鲁棒性。 Result: 该模型在IEMOCAP和MELD数据集上取得了最先进的性能表现。 Conclusion: VEGA机制通过引入心理学对齐的表示空间，有效提升了多模态情绪识别的效果。 Abstract: Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP's textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: https://github.com/dkollias/VEGA.

[115] Bridging Brain Connectomes and Clinical Reports for Early Alzheimer's Disease Diagnosis

Jing Zhang,Xiaowei Yu,Minheng Chen,Lu Zhang,Tong Chen,Yan Zhuang,Chao Cao,Yanjun Lyu,Li Su,Tianming Liu,Dajiang Zhu

Main category: cs.CV

TL;DR: A novel framework aligns brain imaging data with clinical reports to enhance diagnosis, achieving high performance and offering insights into Alzheimer's disease by linking brain subnetworks with clinical observations.

Details

Motivation: To improve diagnosis in clinical settings by integrating brain imaging data with clinical reports, addressing the challenge of linking objective imaging data with subjective text-based reports. Method: The method involves aligning brain connectomes and clinical reports in a shared cross-modal latent space, treating brain subnetworks as tokens for alignment with word tokens in reports. Result: The method achieves state-of-the-art predictive performance and identifies clinically meaningful connectome-text pairs related to mild cognitive impairment and Alzheimer's disease. Conclusion: The proposed framework successfully aligns brain connectomes with clinical reports, enhancing the understanding of brain disorders by linking neuroimaging findings with clinical observations. Abstract: Integrating brain imaging data with clinical reports offers a valuable opportunity to leverage complementary multimodal information for more effective and timely diagnosis in practical clinical settings. This approach has gained significant attention in brain disorder research, yet a key challenge remains: how to effectively link objective imaging data with subjective text-based reports, such as doctors' notes. In this work, we propose a novel framework that aligns brain connectomes with clinical reports in a shared cross-modal latent space at both the subject and connectome levels, thereby enhancing representation learning. The key innovation of our approach is that we treat brain subnetworks as tokens of imaging data, rather than raw image patches, to align with word tokens in clinical reports. This enables a more efficient identification of system-level associations between neuroimaging findings and clinical observations, which is critical since brain disorders often manifest as network-level abnormalities rather than isolated regional alterations. We applied our method to mild cognitive impairment (MCI) using the ADNI dataset. Our approach not only achieves state-of-the-art predictive performance but also identifies clinically meaningful connectome-text pairs, offering new insights into the early mechanisms of Alzheimer's disease and supporting the development of clinically useful multimodal biomarkers.

[116] Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features

Manish Kansana,Elias Hossain,Shahram Rahimi,Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: This paper proposes Surformer v1, a transformer-based model for surface material recognition that combines tactile and visual inputs, achieving high accuracy and fast inference times suitable for real-time applications.

Details

Motivation: Surface material recognition is crucial for robotic perception and physical interaction. The work aims to explore efficient multimodal integration of tactile and visual inputs to enhance classification accuracy and computational efficiency. Method: The study introduces Surformer v1, a transformer-based architecture for surface classification using tactile features and PCA-reduced visual embeddings. It compares the performance of tactile-only and multimodal setups, using cross-modal attention for integrating vision and touch. Result: Surformer v1 achieved 99.4% accuracy with an inference time of 0.77 ms, demonstrating faster performance compared to Multimodal CNN, which had slightly higher accuracy but longer inference times. Conclusion: Surformer v1 offers a balance between accuracy, efficiency, and computational cost for surface material recognition, making it suitable for real-time applications. Abstract: Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and PCA-reduced visual embeddings extracted via ResNet-50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features. This model not only achieved the highest accuracy but also demonstrated significantly faster inference time compared to other evaluated models, highlighting its potential for real-time applications. To extend this investigation, we introduced a multimodal fusion setup by combining vision and tactile inputs. We trained both Surformer v1 (using structured features) and Multimodal CNN (using raw images) to examine the impact of feature-based versus image-based multimodal learning on classification accuracy and computational efficiency. The results showed that Surformer v1 achieved 99.4% accuracy with an inference time of 0.77 ms, while the Multimodal CNN achieved slightly higher accuracy but required significantly more inference time. These findings suggest Surformer v1 offers a compelling balance between accuracy, efficiency, and computational cost for surface material recognition.

[117] ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos

Mohammad Zia Ur Rehman,Anukriti Bhatnagar,Omkar Kabde,Shubhi Bansal,Nagendra Kumar

Main category: cs.CV

TL;DR: This paper introduces a novel dataset, ImpliHateVid, and a two-stage contrastive learning framework for detecting implicit hate speech in videos, highlighting its effectiveness and significance.

Details

Motivation: Implicit hate speech detection in videos is underexplored compared to text and image-based detection, prompting the need for a dedicated dataset and tailored methods. Method: A two-stage contrastive learning framework with modality-specific and cross-encoders, incorporating sentiment, emotion, and caption-based features, was developed for hate speech detection. Result: ImpliHateVid, a large-scale dataset containing 2,009 videos, was created. The proposed method demonstrated effectiveness on ImpliHateVid and the HateMM dataset. Conclusion: The proposed multimodal contrastive learning approach and the ImpliHateVid dataset effectively detect hate speech in videos, particularly implicit hate speech. Abstract: The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion, and caption-based features to enhance implicit hate detection. We evaluate our method on two datasets, ImpliHateVid for implicit hate speech detection and another dataset for general hate speech detection in videos, HateMM dataset, demonstrating the effectiveness of the proposed multimodal contrastive learning for hateful content detection in videos and the significance of our dataset.

Sihan Ma,Qiming Wu,Ruotong Jiang,Frank Burns

Main category: cs.CV

TL;DR: ContextGuard-LVLM是一种新的框架，用于提高数字新闻媒体内容真实性的验证能力，特别是在视觉和文本信息的一致性验证方面。

Details

Motivation: 传统方法在解决细粒度跨模态上下文一致性问题方面存在不足，需要更有效的方法来验证数字新闻媒体内容的真实性。 Method: ContextGuard-LVLM框架基于先进的视觉-语言大模型（LVLMs），并集成了多阶段的上下文推理机制。 Result: 实验表明，ContextGuard-LVLM在几乎所有细粒度一致性任务中都持续优于最先进的零样本LVLM基线模型（如InstructBLIP和LLaVA 1.5）。 Conclusion: ContextGuard-LVLM模型在检测细微的上下文不一致方面优于现有的零样本LVLM基线模型，并且在复杂的逻辑推理和细微的上下文理解方面有显著改进。 Abstract: The proliferation of digital news media necessitates robust methods for verifying content veracity, particularly regarding the consistency between visual and textual information. Traditional approaches often fall short in addressing the fine-grained cross-modal contextual consistency (FCCC) problem, which encompasses deeper alignment of visual narrative, emotional tone, and background information with text, beyond mere entity matching. To address this, we propose ContextGuard-LVLM, a novel framework built upon advanced Vision-Language Large Models (LVLMs) and integrating a multi-stage contextual reasoning mechanism. Our model is uniquely enhanced through reinforced or adversarial learning paradigms, enabling it to detect subtle contextual misalignments that evade zero-shot baselines. We extend and augment three established datasets (TamperedNews-Ent, News400-Ent, MMG-Ent) with new fine-grained contextual annotations, including "contextual sentiment," "visual narrative theme," and "scene-event logical coherence," and introduce a comprehensive CTXT (Contextual Coherence) entity type. Extensive experiments demonstrate that ContextGuard-LVLM consistently outperforms state-of-the-art zero-shot LVLM baselines (InstructBLIP and LLaVA 1.5) across nearly all fine-grained consistency tasks, showing significant improvements in complex logical reasoning and nuanced contextual understanding. Furthermore, our model exhibits superior robustness to subtle perturbations and a higher agreement rate with human expert judgments on challenging samples, affirming its efficacy in discerning sophisticated forms of context detachment.

[119] VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis

Kexin Yu,Zihan Xu,Jialei Xie,Carter Adams

Main category: cs.CV

TL;DR: 本研究提出了一种基于视觉-语言模型的新型皮肤疾病诊断框架，不仅性能优越，而且具备可解释性，有助于实际临床应用。

Details

Motivation: 皮肤疾病诊断面临视觉特征复杂多样以及现有模型缺乏可解释性的挑战，因此需要一种新的诊断框架。 Method: 该研究提出了一个名为VL-MedGuide的框架，分为多模态概念感知模块和可解释疾病推理模块，分别用于识别视觉特征并生成诊断解释。 Result: 在Derm7pt数据集上的实验表明，VL-MedGuide在疾病诊断和概念检测方面均优于现有方法，同时其解释在人类评估中得到了高度评价。 Conclusion: VL-MedGuide通过结合视觉和语言信息，实现了对皮肤疾病的智能、可解释的辅助诊断，并在性能和临床实用性之间架起了桥梁。 Abstract: Accurate diagnosis of skin diseases remains a significant challenge due to the complex and diverse visual features present in dermatoscopic images, often compounded by a lack of interpretability in existing purely visual diagnostic models. To address these limitations, this study introduces VL-MedGuide (Visual-Linguistic Medical Guide), a novel framework leveraging the powerful multi-modal understanding and reasoning capabilities of Visual-Language Large Models (LVLMs) for intelligent and inherently interpretable auxiliary diagnosis of skin conditions. VL-MedGuide operates in two interconnected stages: a Multi-modal Concept Perception Module, which identifies and linguistically describes dermatologically relevant visual features through sophisticated prompt engineering, and an Explainable Disease Reasoning Module, which integrates these concepts with raw visual information via Chain-of-Thought prompting to provide precise disease diagnoses alongside transparent rationales. Comprehensive experiments on the Derm7pt dataset demonstrate that VL-MedGuide achieves state-of-the-art performance in both disease diagnosis (83.55% BACC, 80.12% F1) and concept detection (76.10% BACC, 67.45% F1), surpassing existing baselines. Furthermore, human evaluations confirm the high clarity, completeness, and trustworthiness of its generated explanations, bridging the gap between AI performance and clinical utility by offering actionable, explainable insights for dermatological practice.

[120] CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

Shilong Zou,Yuhang Huang,Renjiao Yi,Chenyang Zhu,Kai Xu

Main category: cs.CV

TL;DR: This paper proposes a diffusion-based image translation method with a joint learning framework that aligns diffusion and translation processes, achieving superior performance in cross-domain image translation tasks.

Details

Motivation: To overcome the limitations of existing GAN-based methods and shallow integration of diffusion models in cross-domain image translation by achieving better modeling of data distribution and global optimization. Method: A diffusion-based cross-domain image translation approach that integrates a joint learning framework to align the diffusion and translation processes, using image components extracted from diffusion models and a time-dependent translation network. Result: The method outperforms state-of-the-art approaches in RGB↔RGB and cross-modality tasks such as RGB↔Edge, RGB↔Semantics, and RGB↔Depth. Conclusion: The proposed joint learning framework aligns the diffusion and translation processes, leading to improved global optimality, fidelity, and structural consistency in cross-domain image translation. Abstract: We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.

[121] CoDe-NeRF: Neural Rendering via Dynamic Coefficient Decomposition

Wenpeng Xing,Jie Chen,Zaifeng Yang,Tiancheng Zhao,Gaolei Li,Changting Lin,Yike Guo,Meng Han

Main category: cs.CV

TL;DR: 本文提出了一种新的神经渲染框架，通过动态系数分解改进视角相关外观建模，有效提升了镜面反射和高光的渲染质量。

Details

Motivation: 现有的NeRF方法在处理复杂镜面反射和高光时存在模糊反射或优化不稳定的问题，因此本文旨在解决这些问题。 Method: 该方法将复杂外观分解为一个共享的静态神经基（编码固有材质属性）和一组由视图和照明条件生成的动态系数，并通过动态辐射积分器合成最终辐射度。 Result: 实验结果表明，与现有技术相比，所提方法能够生成更清晰和更真实的镜面高光。 Conclusion: 本文提出了一种基于动态系数分解的神经渲染框架，旨在改进视角相关外观的建模，以生成更清晰和更真实的镜面高光。 Abstract: Neural Radiance Fields (NeRF) have shown impressive performance in novel view synthesis, but challenges remain in rendering scenes with complex specular reflections and highlights. Existing approaches may produce blurry reflections due to entanglement between lighting and material properties, or encounter optimization instability when relying on physically-based inverse rendering. In this work, we present a neural rendering framework based on dynamic coefficient decomposition, aiming to improve the modeling of view-dependent appearance. Our approach decomposes complex appearance into a shared, static neural basis that encodes intrinsic material properties, and a set of dynamic coefficients generated by a Coefficient Network conditioned on view and illumination. A Dynamic Radiance Integrator then combines these components to synthesize the final radiance. Experimental results on several challenging benchmarks suggest that our method can produce sharper and more realistic specular highlights compared to existing techniques. We hope that this decomposition paradigm can provide a flexible and effective direction for modeling complex appearance in neural scene representations.

[122] Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors

Zheyuan Zhang,Weihao Tang,Hong Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为CausalNet的新型微表情识别框架，能够在关键帧索引存在误差的情况下保持鲁棒性和高识别准确性。

Details

Motivation: 现有的基于关键帧的方法在关键帧索引存在误差时表现不佳，难以应用于实际场景，因此需要一种鲁棒的MER方法。 Method: CausalNet将整个ME序列的表示作为输入，并引入了Causal Motion Position Learning Module (CMPLM) 和 Causal Attention Block (CAB) 来减少冗余信息并学习肌肉运动之间的因果关系。 Result: 在多个ME基准数据集上，CausalNet在不同水平的关键帧索引噪声下均表现出鲁棒性，并在使用标准注释关键帧时超越了最先进的方法。 Conclusion: CausalNet是一个新的框架，可以在关键帧索引错误的情况下实现鲁棒的微表情识别（MER），同时保持准确的识别性能。 Abstract: Micro-expression recognition (MER) is a highly challenging task in affective computing. With the reduced-sized micro-expression (ME) input that contains key information based on key-frame indexes, key-frame-based methods have significantly improved the performance of MER. However, most of these methods focus on improving the performance with relatively accurate key-frame indexes, while ignoring the difficulty of obtaining accurate key-frame indexes and the objective existence of key-frame index errors, which impedes them from moving towards practical applications. In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. To enhance robustness, CausalNet takes the representation of the entire ME sequence as the input. To address the information redundancy brought by the complete ME range input and maintain accurate recognition, first, the Causal Motion Position Learning Module (CMPLM) is proposed to help the model locate the muscle movement areas related to Action Units (AUs), thereby reducing the attention to other redundant areas. Second, the Causal Attention Block (CAB) is proposed to deeply learn the causal relationships between the muscle contraction and relaxation movements in MEs. Empirical experiments have demonstrated that on popular ME benchmarks, the CausalNet has achieved robust MER under different levels of key-frame index noise. Meanwhile, it has surpassed state-of-the-art (SOTA) methods on several standard MER benchmarks when using the provided annotated key-frames. Code is available at https://github.com/tony19980810/CausalNet.

[123] Towards Robust Red-Green Watermarking for Autoregressive Image Generators

Denis Lukovnikov,Andreas Müller,Erwin Quiring,Asja Fischer

Main category: cs.CV

TL;DR: This paper explores in-generation watermarking for autoregressive image models, proposing cluster-level watermarking techniques that enhance robustness and detectability while maintaining image quality.

Details

Motivation: In-generation watermarking has demonstrated high robustness in latent diffusion models, but its application in autoregressive image models remains unexplored. This work aims to address this gap by investigating token-level watermarking schemes adapted from large language models. Method: Two novel watermarking methods relying on visual token clustering were proposed: a training-free approach using a cluster lookup table and a method involving finetuning VAE encoders to predict token clusters from perturbed images. Result: Cluster-level watermarks were found to improve robustness against perturbations and regeneration attacks, preserve image quality, and enhance watermark detectability through cluster classification. Conclusion: The proposed cluster-level watermarking methods improve robustness against perturbations and regeneration attacks while preserving image quality and offer fast verification runtime. Abstract: In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.

[124] Learning More by Seeing Less: Line Drawing Pretraining for Efficient, Transferable, and Human-Aligned Vision

Tianqin Li,George Liu,Tai Sing Lee

Main category: cs.CV

TL;DR: 论文提出以线条图作为结构优先的预训练模态，证明其在视觉任务中的高效性和泛化能力，并扩展到无监督学习设置。

Details

Motivation: 现代识别系统依赖丰富的视觉输入，而人类可以理解稀疏、最小的表示如线条图，因此论文旨在探索结构在视觉理解中的作用。 Method: 论文提出使用线条图作为预训练模态，并通过监督和无监督方法进行实验验证。 Result: 预训练模型在分类、检测和分割任务中表现出更强的形状偏差、更集中的注意力和更高的数据效率，并且具有较低的内在维度和更好的可压缩性。 Conclusion: 论文得出结论，以结构为主的视觉学习能够促进效率、泛化能力和与人类对齐的归纳偏置，提供了一种构建更强大和可适应视觉系统的策略。 Abstract: Despite remarkable progress in computer vision, modern recognition systems remain limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings - suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose using line drawings as a structure-first pretraining modality to induce more compact and generalizable visual representations. We show that models pretrained on line drawings develop stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance - echoing the similar observation in low dimensional efficient representation in the brain. Beyond performance improvements, line drawing pretraining produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from line-pretrained teachers consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Finally, we demonstrate that the pretraining with line-drawing can also be extended to unsupervised setting via our proposed method "learning to draw". Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases - offering a simple yet powerful strategy for building more robust and adaptable vision systems.

[125] MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Md Rezwanul Haque,Md. Milon Islam,S M Taslim Uddin Raju,Hamdi Altaheri,Lobna Nassar,Fakhri Karray

Main category: cs.CV

TL;DR: The paper introduces MMFformer, a new multimodal network for depression detection using social media data, which outperforms current state-of-the-art methods.

Details

Motivation: Depression is a serious mental health issue that is difficult to detect due to its subjective nature. Early detection is crucial for adequate care and treatment. The use of social media content for early depression diagnosis has become a prominent research area. Method: The paper proposes MMFformer, a multimodal depression detection network that uses a transformer network with residual connections to capture spatial features from videos and a transformer encoder to model temporal dynamics in audio. The fusion architecture combines features through late and intermediate fusion strategies. Result: The proposed MMFformer network improves the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset compared to existing approaches. Conclusion: The paper concludes that the proposed MMFformer network surpasses existing state-of-the-art approaches for depression detection, with improvements in F1-Score on two datasets. Abstract: Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at https://github.com/rezwanh001/Large-Scale-Multimodal-Depression-Detection.

[126] Fourier Optics and Deep Learning Methods for Fast 3D Reconstruction in Digital Holography

Justin London

Main category: cs.CV

TL;DR: This paper proposes an efficient pipeline for CGH synthesis using point cloud and MRI data, achieving better performance with optimization algorithms and noise filtering.

Details

Motivation: Computer-generated holography (CGH) offers potential for modulating waveforms with digital holograms, but efficient and high-quality synthesis techniques are needed. Method: Reconstructing volumetric objects from point cloud and MRI data, then applying non-convex Fourier optics optimization algorithms (alternating projection, SGD, quasi-Newton methods) for hologram generation. Performance is enhanced using 2D median filtering. Result: Improved performance metrics (MSE, RMSE, PSNR) for phase-only and complex hologram generation compared to existing methods like HoloNet deep learning CGH. Conclusion: The proposed pipeline framework for CGH synthesis demonstrates improved performance with optimization algorithms and filtering techniques. Abstract: Computer-generated holography (CGH) is a promising method that modulates user-defined waveforms with digital holograms. An efficient and fast pipeline framework is proposed to synthesize CGH using initial point cloud and MRI data. This input data is reconstructed into volumetric objects that are then input into non-convex Fourier optics optimization algorithms for phase-only hologram (POH) and complex-hologram (CH) generation using alternating projection, SGD, and quasi-Netwton methods. Comparison of reconstruction performance of these algorithms as measured by MSE, RMSE, and PSNR is analyzed as well as to HoloNet deep learning CGH. Performance metrics are shown to be improved by using 2D median filtering to remove artifacts and speckled noise during optimization.

[127] Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video

Jixuan He,Chieh Hubert Lin,Lu Qi,Ming-Hsuan Yang

Main category: cs.CV

TL;DR: Restage4D is a novel method for generating physically realistic 4D scenes by utilizing motion priors from real-world videos, outperforming existing approaches in structure preservation and error correction.

Details

Motivation: Current generative models struggle with physical realism and motion dynamics for 4D scene synthesis, whereas real-world videos offer grounded geometry and articulation cues. This work aims to explore whether physically consistent 4D content can be generated using motion priors from real videos. Method: Restage4D uses a geometry-preserving pipeline with a video-rewinding training strategy, occlusion-aware rigidity loss, and disocclusion backtracing mechanism to bridge real and synthetic videos through shared motion representation. Result: Restage4D was validated on DAVIS and PointOdyssey datasets, showing improved geometry consistency, motion quality, and 3D tracking performance. Conclusion: The Restage4D method effectively generates physically consistent 4D content by leveraging real-world video motion priors, preserving deformable structures under novel motion while correcting errors from generative models. Abstract: Creating deformable 3D content has gained increasing attention with the rise of text-to-image and image-to-video generative models. While these models provide rich semantic priors for appearance, they struggle to capture the physical realism and motion dynamics needed for authentic 4D scene synthesis. In contrast, real-world videos can provide physically grounded geometry and articulation cues that are difficult to hallucinate. One question is raised: \textit{Can we generate physically consistent 4D content by leveraging the motion priors of the real-world video}? In this work, we explore the task of reanimating deformable 3D scenes from a single video, using the original sequence as a supervisory signal to correct artifacts from synthetic motion. We introduce \textbf{Restage4D}, a geometry-preserving pipeline for video-conditioned 4D restaging. Our approach uses a video-rewinding training strategy to temporally bridge a real base video and a synthetic driving video via a shared motion representation. We further incorporate an occlusion-aware rigidity loss and a disocclusion backtracing mechanism to improve structural and geometry consistency under challenging motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, revealing the potential of video prior in 4D restaging task. Source code and trained models will be released.

[128] FoundBioNet: A Foundation-Based Model for IDH Genotyping of Glioma from Multi-Parametric MRI

Somayeh Farahani,Marjaneh Hejazi,Antonio Di Ieva,Sidong Liu

Main category: cs.CV

TL;DR: FoundBioNet is a deep learning model designed to noninvasively detect IDH mutations in gliomas using MRI data, achieving high accuracy across multiple datasets and offering potential for improved, personalized diagnosis.

Details

Motivation: Accurate, noninvasive detection of IDH mutation is crucial for effective glioma management, as traditional invasive methods may fail to capture tumor heterogeneity, and existing deep learning models are limited by scarce annotated data. Method: The study proposes FoundBioNet, a SWIN-UNETR-based model with Tumor-Aware Feature Encoding (TAFE) and Cross-Modality Differential (CMD) modules, trained and validated on a multi-center cohort of 1705 glioma patients from six public datasets. Result: FoundBioNet achieved AUCs of 90.58%, 88.08%, 65.41%, and 80.31% on independent test sets from EGD, TCGA, Ivy GAP, RHUH, and UPenn, consistently outperforming baseline approaches (p <= 0.05), with ablation studies confirming the importance of TAFE and CMD modules. Conclusion: FoundBioNet, a deep learning model incorporating TAFE and CMD modules, enables accurate, noninvasive detection of IDH mutation in gliomas, outperforming baseline approaches and offering potential for personalized patient care. Abstract: Accurate, noninvasive detection of isocitrate dehydrogenase (IDH) mutation is essential for effective glioma management. Traditional methods rely on invasive tissue sampling, which may fail to capture a tumor's spatial heterogeneity. While deep learning models have shown promise in molecular profiling, their performance is often limited by scarce annotated data. In contrast, foundation deep learning models offer a more generalizable approach for glioma imaging biomarkers. We propose a Foundation-based Biomarker Network (FoundBioNet) that utilizes a SWIN-UNETR-based architecture to noninvasively predict IDH mutation status from multi-parametric MRI. Two key modules are incorporated: Tumor-Aware Feature Encoding (TAFE) for extracting multi-scale, tumor-focused features, and Cross-Modality Differential (CMD) for highlighting subtle T2-FLAIR mismatch signals associated with IDH mutation. The model was trained and validated on a diverse, multi-center cohort of 1705 glioma patients from six public datasets. Our model achieved AUCs of 90.58%, 88.08%, 65.41%, and 80.31% on independent test sets from EGD, TCGA, Ivy GAP, RHUH, and UPenn, consistently outperforming baseline approaches (p <= 0.05). Ablation studies confirmed that both the TAFE and CMD modules are essential for improving predictive accuracy. By integrating large-scale pretraining and task-specific fine-tuning, FoundBioNet enables generalizable glioma characterization. This approach enhances diagnostic accuracy and interpretability, with the potential to enable more personalized patient care.

[129] VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions

Yash Garg,Saketh Bachu,Arindam Dutta,Rohit Lal,Sarosij Bose,Calvin-Khang Ta,M. Salman Asif,Amit Roy-Chowdhury

Main category: cs.CV

TL;DR: 本论文提出了VOccl3D数据集，用于改进遮挡情况下的人体姿态和形状估计。

Details

Motivation: 现有的遮挡数据集不够真实，难以反映现实世界中的挑战，因此需要一个更真实的遮挡数据集。 Method: 使用先进的计算机图形渲染技术构建VOccl3D数据集，并微调了CLIFF和BEDLAM-CLIFF方法，同时利用数据集改进了YOLO11在遮挡情况下的检测性能。 Result: VOccl3D 数据集在多个公共数据集和测试集上显著提高了HPS方法的表现。 Conclusion: VOccl3D 提供了一个更加现实的遮挡数据集，有助于未来研究遮挡情况下的HPS方法。 Abstract: Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:https://yashgarg98.github.io/VOccl3D-dataset/

[130] SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Zihao Sheng,Zilin Huang,Yen-Jung Chen,Yansong Qu,Yuhao Luo,Yue Leng,Sikai Chen

Main category: cs.CV

TL;DR: SafePLUG improves traffic accident understanding by enabling fine-grained visual analysis and temporal event recognition, addressing limitations in existing MLLMs.

Details

Motivation: Existing MLLMs struggle with fine-grained visual details and localized scene components in traffic accident understanding, limiting their applicability in complex scenarios. Method: SafePLUG utilizes a novel framework that enables pixel-level understanding and temporal grounding for traffic accident analysis, supporting region-aware question answering, pixel-level segmentation, and recognition of temporally anchored events. Result: SafePLUG achieved strong performance on multiple tasks such as region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. Conclusion: SafePLUG provides a foundation for fine-grained understanding of complex traffic scenes and has the potential to enhance driving safety and situational awareness in smart transportation systems. Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG

[131] DiffUS: Differentiable Ultrasound Rendering from Volumetric Imaging

Noe Bertramo,Gabriel Duguey,Vivek Gopalakrishnan

Main category: cs.CV

TL;DR: DiffUS是一种基于物理的可微分超声波渲染器，能从MRI数据生成逼真的B模式超声图像，可用于术中实时指导。

Details

Motivation: 术中超声成像由于噪声、伪影和与术前MRI/CT扫描的对齐问题，解释起来比较复杂。因此，需要一种方法来弥合术前规划和术中指导之间的差距。 Method: DiffUS方法包括将MRI 3D扫描转换为声阻抗体积，使用光线追踪和耦合反射-透射方程模拟超声波束传播，通过稀疏线性系统公式捕捉多次内部反射，并通过扇形采集几何结构进行深度分辨回声提取，包括真实噪声和深度依赖性退化。 Result: 在ReMIND数据集上的评估表明，DiffUS能够从脑部MRI数据生成解剖学上准确的超声图像。 Conclusion: DiffUS是一种可微分的超声波渲染器，能够从MRI数据中生成逼真的B模式超声图像，具有应用于术中指导的潜力。 Abstract: Intraoperative ultrasound imaging provides real-time guidance during numerous surgical procedures, but its interpretation is complicated by noise, artifacts, and poor alignment with high-resolution preoperative MRI/CT scans. To bridge the gap between reoperative planning and intraoperative guidance, we present DiffUS, a physics-based, differentiable ultrasound renderer that synthesizes realistic B-mode images from volumetric imaging. DiffUS first converts MRI 3D scans into acoustic impedance volumes using a machine learning approach. Next, we simulate ultrasound beam propagation using ray tracing with coupled reflection-transmission equations. DiffUS formulates wave propagation as a sparse linear system that captures multiple internal reflections. Finally, we reconstruct B-mode images via depth-resolved echo extraction across fan-shaped acquisition geometry, incorporating realistic artifacts including speckle noise and depth-dependent degradation. DiffUS is entirely implemented as differentiable tensor operations in PyTorch, enabling gradient-based optimization for downstream applications such as slice-to-volume registration and volumetric reconstruction. Evaluation on the ReMIND dataset demonstrates DiffUS's ability to generate anatomically accurate ultrasound images from brain MRI data.

Aarav Mehta,Priya Deshmukh,Vikram Singh,Siddharth Malhotra,Krishnan Menon Iyer,Tanvi Iyer

Main category: cs.CV

TL;DR: This paper proposes a medically focused crisp edge detector using a novel backward refinement architecture that significantly improves organ boundary localization accuracy in medical images, enhancing key tasks like segmentation and registration.

Details

Motivation: Precise organ boundary localization is crucial in medical imaging, but deep ConvNets often lack the millimeter-level accuracy required in medical applications, motivating the need for a specialized solution. Method: The authors proposed a top-down backward refinement architecture tailored for medical images, progressively fusing high-level semantic features with low-level details. They extended the design to handle anisotropic volumes by combining 2D slice-wise refinement with lightweight 3D context aggregation. Result: Evaluations on CT and MRI datasets showed significant improvements in boundary localization accuracy, as measured by boundary F-measure and Hausdorff distance, outperforming baseline ConvNet detectors and contemporary medical edge/contour methods. The method also improved downstream tasks such as segmentation, registration, and lesion delineation. Conclusion: The proposed crisp edge detector substantially improves the accuracy of organ boundary localization in medical imaging, offering clinically valuable results and enhancing common medical-imaging tasks like segmentation, registration, and lesion delineation. Abstract: Accurate localization of organ boundaries is critical in medical imaging for segmentation, registration, surgical planning, and radiotherapy. While deep convolutional networks (ConvNets) have advanced general-purpose edge detection to near-human performance on natural images, their outputs often lack precise localization, a limitation that is particularly harmful in medical applications where millimeter-level accuracy is required. Building on a systematic analysis of ConvNet edge outputs, we propose a medically focused crisp edge detector that adapts a novel top-down backward refinement architecture to medical images (2D and volumetric). Our method progressively upsamples and fuses high-level semantic features with fine-grained low-level cues through a backward refinement pathway, producing high-resolution, well-localized organ boundaries. We further extend the design to handle anisotropic volumes by combining 2D slice-wise refinement with light 3D context aggregation to retain computational efficiency. Evaluations on several CT and MRI organ datasets demonstrate substantially improved boundary localization under strict criteria (boundary F-measure, Hausdorff distance) compared to baseline ConvNet detectors and contemporary medical edge/contour methods. Importantly, integrating our crisp edge maps into downstream pipelines yields consistent gains in organ segmentation (higher Dice scores, lower boundary errors), more accurate image registration, and improved delineation of lesions near organ interfaces. The proposed approach produces clinically valuable, crisp organ edges that materially enhance common medical-imaging tasks.

[133] DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation

Vikram Singh,Kabir Malhotra,Rohan Desai,Ananya Shankaracharya,Priyadarshini Chatterjee,Krishnan Menon Iyer

Main category: cs.CV

TL;DR: This paper introduces a novel ResNet inspired dual resolution architecture for melanocytic tumor segmentation that significantly improves boundary adherence and clinically relevant segmentation metrics compared to standard encoder decoder baselines.

Details

Motivation: Accurate segmentation of melanocytic tumors in dermoscopic images is a critical step for automated skin cancer screening and clinical decision support. This is challenging due to subtle texture and color variations, frequent artifacts, and the need for precise boundary localization. Method: A novel ResNet inspired dual resolution architecture specifically designed for melanocytic tumor segmentation is introduced. It maintains a full resolution stream that preserves fine grained boundary information while a complementary pooled stream aggregates multi scale contextual cues. The streams are tightly coupled by boundary aware residual connections and a channel attention module. A lightweight artifact suppression block and a multi task training objective are also proposed. Result: The combined design yields pixel accurate masks without requiring heavy post processing or complex pre training protocols. Extensive experiments on public dermoscopic benchmarks demonstrate that the introduced method significantly improves boundary adherence and clinically relevant segmentation metrics. Conclusion: The introduced method significantly improves boundary adherence and clinically relevant segmentation metrics compared to standard encoder decoder baselines, making it a practical building block for automated melanoma assessment systems. Abstract: Accurate segmentation of melanocytic tumors in dermoscopic images is a critical step for automated skin cancer screening and clinical decision support. Unlike natural scene segmentation, lesion delineation must reconcile subtle texture and color variations, frequent artifacts (hairs, rulers, bubbles), and a strong need for precise boundary localization to support downstream diagnosis. In this paper we introduce Our method, a novel ResNet inspired dual resolution architecture specifically designed for melanocytic tumor segmentation. Our method maintains a full resolution stream that preserves fine grained boundary information while a complementary pooled stream aggregates multi scale contextual cues for robust lesion recognition. The streams are tightly coupled by boundary aware residual connections that inject high frequency edge information into deep feature maps, and by a channel attention module that adapts color and texture sensitivity to dermoscopic appearance. To further address common imaging artifacts and the limited size of clinical datasets, we propose a lightweight artifact suppression block and a multi task training objective that combines a Dice Tversky segmentation loss with an explicit boundary loss and a contrastive regularizer for feature stability. The combined design yields pixel accurate masks without requiring heavy post processing or complex pre training protocols. Extensive experiments on public dermoscopic benchmarks demonstrate that Our method significantly improves boundary adherence and clinically relevant segmentation metrics compared to standard encoder decoder baselines, making it a practical building block for automated melanoma assessment systems.

[134] VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation

Ayaan Nooruddin Siddiqui,Mahnoor Zaidi,Ayesha Nazneen Shahbaz,Priyadarshini Chatterjee,Krishnan Menon Iyer

Main category: cs.CV

TL;DR: This paper introduces a weakly supervised deep learning framework for subcutaneous vessel segmentation that uses sparse annotations to generate accurate and topologically correct vascular maps with reduced manual labeling effort.

Details

Motivation: The motivation stems from the challenges in accurately segmenting subcutaneous vessels due to scarce and expensive ground truth data, low contrast, and noisy vessel appearances across patients and imaging modalities. Method: The method involves a weakly supervised framework using sparse annotations, which are expanded into dense probabilistic supervision using a differentiable random walk label propagation model. This model incorporates vesselness cues and tubular continuity priors. The system also includes an uncertainty-weighted loss and a topology-aware regularizer to improve segmentation accuracy and connectivity. Result: The experiments show that the proposed method outperforms naive training on sparse labels and conventional dense pseudo-labeling. It produces more complete vascular maps, better-calibrated uncertainty, and improves clinical usability by ensuring centerline connectivity and reducing spurious branches. Conclusion: The paper concludes that their novel weakly supervised training framework improves subcutaneous vessel segmentation by reducing annotation burden while maintaining vessel topology and producing better-calibrated uncertainty estimates. Abstract: Accurate segmentation of subcutaneous vessels from clinical images is hampered by scarce, expensive ground truth and by low contrast, noisy appearance of vessels across patients and modalities. We present a novel weakly supervised training framework tailored for subcutaneous vessel segmentation that leverages inexpensive sparse annotations (e.g., centerline traces, dot markers, or short scribbles). Sparse labels are expanded into dense, probabilistic supervision via a differentiable random walk label propagation model whose transition weights incorporate image driven vesselness cues and tubular continuity priors. The propagation yields per-pixel hitting probabilities together with calibrated uncertainty estimates; these are incorporated into an uncertainty weighted loss to avoid over fitting to ambiguous regions. Crucially, the label-propagator is learned jointly with a CNN based segmentation predictor, enabling the system to discover vessel edges and continuity constraints without explicit edge supervision. We further introduce a topology aware regularizer that encourages centerline connectivity and penalizes spurious branches, improving clinical usability. In experiments on clinical subcutaneous imaging datasets, our method consistently outperforms naive training on sparse labels and conventional dense pseudo-labeling, producing more complete vascular maps and better calibrated uncertainty for downstream decision making. The approach substantially reduces annotation burden while preserving clinically relevant vessel topology.

[135] Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification

Taha Mustapha Nehdi,Nairouz Mrabah,Atif Belal,Marco Pedersoli,Eric Granger

Main category: cs.CV

TL;DR: 本文提出了一种无需访问源域数据的多源域适应方法SAGE-reID，用于跨域行人重识别，通过低秩适配器和门控网络实现高效知识迁移。

Details

Motivation: 多源域适应（MSDA）相比传统域适应方法在跨域行人重识别中更具准确性和鲁棒性，但现有方法存在参数增长和计算成本问题。 Method: 提出了一种Source-free Adaptive Gated Experts (SAGE-reID) 方法，通过源无关域适应训练源特定的低秩适配器（LoRA），并引入轻量级门控网络动态分配LoRA专家的融合权重。 Result: 在Market-1501、DukeMTMC-reID和MSMT17三个挑战性基准上，SAGE-reID优于现有最先进方法，且内存消耗更低，过拟合风险更小。 Conclusion: SAGE-reID是一种具有成本效益且无需源域数据的多源域适应方法，在跨域知识迁移中表现出色，同时计算效率高。 Abstract: Adapting person re-identification (reID) models to new target environments remains a challenging problem that is typically addressed using unsupervised domain adaptation (UDA) methods. Recent works show that when labeled data originates from several distinct sources (e.g., datasets and cameras), considering each source separately and applying multi-source domain adaptation (MSDA) typically yields higher accuracy and robustness compared to blending the sources and performing conventional UDA. However, state-of-the-art MSDA methods learn domain-specific backbone models or require access to source domain data during adaptation, resulting in significant growth in training parameters and computational cost. In this paper, a Source-free Adaptive Gated Experts (SAGE-reID) method is introduced for person reID. Our SAGE-reID is a cost-effective, source-free MSDA method that first trains individual source-specific low-rank adapters (LoRA) through source-free UDA. Next, a lightweight gating network is introduced and trained to dynamically assign optimal merging weights for fusion of LoRA experts, enabling effective cross-domain knowledge transfer. While the number of backbone parameters remains constant across source domains, LoRA experts scale linearly but remain negligible in size (<= 2% of the backbone), reducing both the memory consumption and risk of overfitting. Extensive experiments conducted on three challenging benchmarks: Market-1501, DukeMTMC-reID, and MSMT17 indicate that SAGE-reID outperforms state-of-the-art methods while being computationally efficient.

[136] Hybrid Machine Learning Framework for Predicting Geometric Deviations from 3D Surface Metrology

Hamidreza Samadi,Md Manjurul Ahsan,Shivakumar Raman

Main category: cs.CV

TL;DR: The paper presents a methodology for accurately forecasting geometric deviations in manufactured components using advanced 3D surface analysis and a hybrid machine learning framework. The proposed system significantly improves prediction accuracy over conventional methods and reveals hidden correlations between manufacturing parameters and geometric deviations.

Details

Motivation: The motivation behind the paper is to address the challenge of accurately forecasting geometric deviations in manufactured components, particularly for complex geometries, using advanced 3D surface analysis. Method: The paper presents a methodology that uses a high-resolution 3D scanner to gather multi-angle surface data, which is then processed through alignment, noise reduction, and merging techniques. This data is used to train a hybrid machine learning framework combining convolutional neural networks and gradient-boosted decision trees for predictive modeling. Result: The proposed system achieved a prediction accuracy of 0.012 mm at a 95% confidence level, which is a 73% improvement over conventional statistical process control methods. The model also revealed hidden correlations between manufacturing parameters and geometric deviations. Conclusion: The paper concludes that their proposed system, which combines convolutional neural networks and gradient-boosted decision trees, has immense potential for automated quality control, predictive maintenance, and design optimization in precision manufacturing. Abstract: This study addresses the challenge of accurately forecasting geometric deviations in manufactured components using advanced 3D surface analysis. Despite progress in modern manufacturing, maintaining dimensional precision remains difficult, particularly for complex geometries. We present a methodology that employs a high-resolution 3D scanner to acquire multi-angle surface data from 237 components produced across different batches. The data were processed through precise alignment, noise reduction, and merging techniques to generate accurate 3D representations. A hybrid machine learning framework was developed, combining convolutional neural networks for feature extraction with gradient-boosted decision trees for predictive modeling. The proposed system achieved a prediction accuracy of 0.012 mm at a 95% confidence level, representing a 73% improvement over conventional statistical process control methods. In addition to improved accuracy, the model revealed hidden correlations between manufacturing parameters and geometric deviations. This approach offers significant potential for automated quality control, predictive maintenance, and design optimization in precision manufacturing, and the resulting dataset provides a strong foundation for future predictive modeling research.

[137] AGIC: Attention-Guided Image Captioning to Improve Caption Relevance

L. D. M. S. Sai Teja,Ashok Urlana,Pruthwik Mishra

Main category: cs.CV

TL;DR: This paper proposes AGIC, an image captioning method that enhances salient visual regions and balances fluency and diversity through a hybrid decoding strategy, demonstrating strong performance and faster inference on benchmark datasets.

Details

Motivation: Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. Method: We propose Attention-Guided Image Captioning (AGIC) which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling. Result: AGIC matches or surpasses several state-of-the-art models while achieving faster inference. The evaluation was conducted on the Flickr8k and Flickr30k datasets. Conclusion: AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning. Abstract: Despite significant progress in image captioning, generating accurate and descriptive captions remains a long-standing challenge. In this study, we propose Attention-Guided Image Captioning (AGIC), which amplifies salient visual regions directly in the feature space to guide caption generation. We further introduce a hybrid decoding strategy that combines deterministic and probabilistic sampling to balance fluency and diversity. To evaluate AGIC, we conduct extensive experiments on the Flickr8k and Flickr30k datasets. The results show that AGIC matches or surpasses several state-of-the-art models while achieving faster inference. Moreover, AGIC demonstrates strong performance across multiple evaluation metrics, offering a scalable and interpretable solution for image captioning.

[138] A Joint Sparse Self-Representation Learning Method for Multiview Clustering

Mengxue Jia,Zhihua Allen-Zhao,You Zhao,Sanyang Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的多视图聚类方法，通过引入基数约束和交替二次惩罚方法，提高了聚类性能。

Details

Motivation: 多视图聚类旨在利用不同视图中的一致性和互补信息进行样本聚类。子空间聚类作为多视图聚类的一种基础技术，已经引起了广泛关注。 Method: 提出了一种联合稀疏自表示学习模型，引入了基数（即ℓ0-范数）约束来提取视图特定的局部信息，并开发了一种具有全局收敛性的交替二次惩罚方法来解决模型。 Result: 在六个标准数据集上的实验结果表明，所提出的模型和方法优于八种最先进的算法。 Conclusion: 本文通过引入基数约束和交替二次惩罚方法，成功改进了多视图聚类的性能。 Abstract: Multiview clustering (MC) aims to group samples using consistent and complementary information across various views. The subspace clustering, as a fundamental technique of MC, has attracted significant attention. In this paper, we propose a novel joint sparse self-representation learning model for MC, where a featured difference is the extraction of view-specific local information by introducing cardinality (i.e., $\ell_0$-norm) constraints instead of Graph-Laplacian regularization. Specifically, under each view, cardinality constraints directly restrict the samples used in the self-representation stage to extract reliable local and global structure information, while the low-rank constraint aids in revealing a global coherent structure in the consensus affinity matrix during merging. The attendant challenge is that Augmented Lagrange Method (ALM)-based alternating minimization algorithms cannot guarantee convergence when applied directly to our nonconvex, nonsmooth model, thus resulting in poor generalization ability. To address it, we develop an alternating quadratic penalty (AQP) method with global convergence, where two subproblems are iteratively solved by closed-form solutions. Empirical results on six standard datasets demonstrate the superiority of our model and AQP method, compared to eight state-of-the-art algorithms.

[139] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

Jianxiang He,Shaoguang Wang,Weiyu Guo,Meisheng Hong,Jungang Li,Yijie Xu,Ziyang Chen,Hui Xiong

Main category: cs.CV

TL;DR: This paper proposes VSI, a dual-stream multimodal keyframe search method that improves long video understanding by integrating visual and textual data, achieving state-of-the-art results.

Details

Motivation: Keyframe retrieval in long video understanding is challenged by weak multimodal alignment and the inability to capture complex temporal semantic information, necessitating a more effective approach. Method: VSI integrates subtitles, timestamps, and scene boundaries into a unified multimodal search process using a dual-stream mechanism: Video Search Stream for visual information and Subtitle Match Stream for textual information. Result: VSI achieved 40.00% keyframe localization accuracy on LongVideoBench and 68.48% accuracy on downstream long Video-QA tasks, surpassing baselines by 20.35% and 15.79%, respectively. Conclusion: The proposed Visual-Subtitle Integration (VSI) method demonstrates robustness and generalizability in multimodal keyframe search, achieving state-of-the-art results on long video understanding tasks. Abstract: Long video understanding presents a significant challenge to multimodal large language models (MLLMs) primarily due to the immense data scale. A critical and widely adopted strategy for making this task computationally tractable is keyframe retrieval, which seeks to identify a sparse set of video frames that are most salient to a given textual query. However, the efficacy of this approach is hindered by weak multimodal alignment between textual queries and visual content and fails to capture the complex temporal semantic information required for precise reasoning. To address this, we propose Visual-Subtitle Integeration(VSI), a multimodal keyframe search method that integrates subtitles, timestamps, and scene boundaries into a unified multimodal search process. The proposed method captures the visual information of video frames as well as the complementary textual information through a dual-stream search mechanism by Video Search Stream as well as Subtitle Match Stream, respectively, and improves the keyframe search accuracy through the interaction of the two search streams. Experimental results show that VSI achieve 40.00% key frame localization accuracy on the text-relevant subset of LongVideoBench and 68.48% accuracy on downstream long Video-QA tasks, surpassing competitive baselines by 20.35% and 15.79%, respectively. Furthermore, on the LongVideoBench, VSI achieved state-of-the-art(SOTA) in medium-to-long video-QA tasks, demonstrating the robustness and generalizability of the proposed multimodal search strategy.

[140] NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective

Maoxun Yuan,Duanni Meng,Ziteng Xi,Tianyi Zhao,Shiji Zhao,Yimian Dai,Xingxing Wei

Main category: cs.CV

TL;DR: This paper introduces a novel noise-suppression feature pyramid network (NS-FPN) for infrared small target detection and segmentation, effectively reducing false alarms and achieving superior performance on public datasets.

Details

Motivation: Infrared small target detection and segmentation (IRSTDS) is challenging due to dim, shapeless targets and background clutter. Existing CNN-based methods focus on feature enhancement but suffer from increased false alarms; thus, a new approach focusing on noise suppression is needed. Method: The authors propose a novel noise-suppression feature pyramid network (NS-FPN) that incorporates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the feature pyramid network (FPN) structure. Result: The NS-FPN achieves superior performance on IRSTDS tasks and significantly reduces false alarms, as demonstrated by extensive experiments on public IRSTDS datasets. Conclusion: The proposed NS-FPN effectively suppresses noise and enhances target-relevant features, significantly reducing false alarms in IRSTDS tasks and demonstrating superior performance on public datasets. Abstract: Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarms problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the public IRSTDS datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS tasks.

[141] BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

Jianting Tang,Yubo Wang,Haoyu Cao,Linli Xu

Main category: cs.CV

TL;DR: BASIC improves Multimodal Large Language Models' performance by directly supervising visual embeddings within the LLM, optimizing both embedding directions and semantic matching without needing extra annotations or models.

Details

Motivation: The motivation is to overcome the limitations of current alignment approaches that neglect direct visual supervision, thereby improving the finer alignment of visual embeddings in MLLMs. Method: BASIC optimizes the generation of visual embeddings by using refined visual embeddings within the LLM as supervision. This involves optimizing embedding directions and enhancing semantic matching through minimizing disparities in logit distributions. Result: BASIC significantly enhances the performance of MLLMs across various benchmarks by achieving better visual-textual alignment through direct visual supervision. Conclusion: BASIC is an effective method for improving the performance of Multimodal Large Language Models by introducing direct visual supervision, without the need for additional supervisory models or artificial annotations. Abstract: Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM's shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision.

[142] Advancements in Chinese font generation since deep learning era: A survey

Weiran Chen,Guiqian Zhu,Ying Li,Yi Ji,Chunping Liu

Main category: cs.CV

TL;DR: This paper reviews recent deep learning approaches to Chinese font generation, categorizing them into many-shot and few-shot methods, and outlines challenges and future research directions in improving font generation quality.

Details

Motivation: Chinese font generation is an important topic for font designers and typographers, and while deep learning has advanced the field, improving the overall quality of generated Chinese character images remains a challenge. Method: The paper conducts a holistic survey of recent deep learning-based approaches to Chinese font generation, categorizing methods based on the number of reference samples required (many-shot vs. few-shot), and reviewing relevant architectures, datasets, and evaluation metrics. Result: A comprehensive review of deep learning methods for Chinese font generation, including categorization into many-shot and few-shot approaches, with discussion of their strengths, limitations, and evaluation metrics. Conclusion: The paper concludes with an outline of challenges and future directions in Chinese font generation, aiming to provide valuable insights for researchers in the field. Abstract: Chinese font generation aims to create a new Chinese font library based on some reference samples. It is a topic of great concern to many font designers and typographers. Over the past years, with the rapid development of deep learning algorithms, various new techniques have achieved flourishing and thriving progress. Nevertheless, how to improve the overall quality of generated Chinese character images remains a tough issue. In this paper, we conduct a holistic survey of the recent Chinese font generation approaches based on deep learning. To be specific, we first illustrate the research background of the task. Then, we outline our literature selection and analysis methodology, and review a series of related fundamentals, including classical deep learning architectures, font representation formats, public datasets, and frequently-used evaluation metrics. After that, relying on the number of reference samples required to generate a new font, we categorize the existing methods into two major groups: many-shot font generation and few-shot font generation methods. Within each category, representative approaches are summarized, and their strengths and limitations are also discussed in detail. Finally, we conclude our paper with the challenges and future directions, with the expectation to provide some valuable illuminations for the researchers in this field.

[143] eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos

Xuecheng Wu,Dingkang Yang,Danlei Huang,Xinyi Yin,Yifan Wang,Jia Zhang,Jiayu Nie,Liangyu Fu,Yang Liu,Junxiao Xue,Hadi Amirpour,Wei Zhou

Main category: cs.CV

TL;DR: 本文提出了eMotions数据集和AV-CANet模型，有效解决短视频情感分析中的多模态复杂性和语义差异问题。

Details

Motivation: 短视频的多模态复杂性给情感分析带来了新挑战，而现有数据集和方法难以满足需求，因此需要构建新的数据集和方法。 Method: 作者提出了一种端到端的音视频融合网络AV-CANet，包括局部-全局融合模块和EP-CE损失函数，以解决短视频情感分析的挑战。 Result: 在三个eMotions相关数据集和四个公开VEA数据集上的实验验证了AV-CANet的有效性，并提供了对未来研究的见解。 Conclusion: 本文提出了一个用于短视频情感分析的大型数据集eMotions，并设计了新的网络结构AV-CANet，实验验证了其有效性。 Abstract: Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which introduces more distinct semantic gaps and complicates the representations learning of emotion-related features. Furthermore, the prevalence of audio-visual co-expressions in SVs leads to the local biases and collective information gaps caused by the inconsistencies in emotional expressions. To tackle this, we propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations. We further introduce the Local-Global Fusion Module designed to progressively capture the correlations of audio-visual features. Besides, EP-CE Loss is constructed to globally steer optimizations with tripolar penalties. Extensive experiments across three eMotions-related datasets and four public VEA datasets demonstrate the effectiveness of our proposed AV-CANet, while providing broad insights for future research. Moreover, we conduct ablation studies to examine the critical components of our method. Dataset and code will be made available at Github.

[144] A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation

Chao Yin,Jide Li,Xiaoqiang Li

Main category: cs.CV

TL;DR: 提出了一种新的训练-free的COS管线IAPF，能够处理具有多个离散伪装实例的场景，并在标准COS基准测试中表现出色。

Details

Motivation: 现有的训练-free的COS方法通常只能产生语义级的视觉提示，导致SAM输出粗糙的语义掩码，无法有效处理具有多个离散伪装实例的场景。 Method: IAPF包括三个步骤：文本提示生成器、实例掩码生成器和自洽实例掩码投票。 Result: IAPF在标准COS基准测试中显著超越现有最先进方法。 Conclusion: IAPF是一个无需训练的COS管线，能够将任务通用提示转化为细粒度实例掩码，并在标准COS基准测试中显著超越现有最先进方法。 Abstract: Camouflaged Object Segmentation (COS) remains highly challenging due to the intrinsic visual similarity between target objects and their surroundings. While training-based COS methods achieve good performance, their performance degrades rapidly with increased annotation sparsity. To circumvent this limitation, recent studies have explored training-free COS methods, leveraging the Segment Anything Model (SAM) by automatically generating visual prompts from a single task-generic prompt (\textit{e.g.}, "\textit{camouflaged animal}") uniformly applied across all test images. However, these methods typically produce only semantic-level visual prompts, causing SAM to output coarse semantic masks and thus failing to handle scenarios with multiple discrete camouflaged instances effectively. To address this critical limitation, we propose a simple yet powerful \textbf{I}nstance-\textbf{A}ware \textbf{P}rompting \textbf{F}ramework (IAPF), the first training-free COS pipeline that explicitly converts a task-generic prompt into fine-grained instance masks. Specifically, the IAPF comprises three steps: (1) Text Prompt Generator, utilizing task-generic queries to prompt a Multimodal Large Language Model (MLLM) for generating image-specific foreground and background tags; (2) \textbf{Instance Mask Generator}, leveraging Grounding DINO to produce precise instance-level bounding box prompts, alongside the proposed Single-Foreground Multi-Background Prompting strategy to sample region-constrained point prompts within each box, enabling SAM to yield a candidate instance mask; (3) Self-consistency Instance Mask Voting, which selects the final COS prediction by identifying the candidate mask most consistent across multiple candidate instance masks. Extensive evaluations on standard COS benchmarks demonstrate that the proposed IAPF significantly surpasses existing state-of-the-art training-free COS methods.

[145] MultiRef: Controllable Image Generation with Multiple Visual References

Ruoxi Chen,Dongping Chen,Siyuan Wu,Sinan Wang,Shiyun Lang,Petr Sushko,Gaoyang Jiang,Yao Wan,Ranjay Krishna

Main category: cs.CV

TL;DR: 本文介绍了一个新的评估框架和数据集，用于研究和改进基于多视觉参考的可控图像生成，发现当前最先进的系统在这一任务上仍然面临挑战。

Details

Motivation: 当前的图像生成框架主要依赖于单一来源的输入，如文本提示或单个参考图像，而视觉设计师通常从多个视觉参考资料中汲取灵感，结合多样化的元素和美学原则来创作艺术品。 Method: 引入了MultiRef-bench，一个包含990个合成样本和1,000个真实世界样本的严格评估框架，并基于数据引擎RefBlend构建了一个包含38k高质量图像的数据集MultiRef。 Result: 在三个交错的图文模型（即OmniGen、ACE和Show-o）和六个代理框架（例如ChatDiT和LLM + SD）的实验中发现，最佳模型OmniGen在合成样本中的平均得分为66.6%，在真实世界案例中平均得分为79.0%。 Conclusion: 研究发现，即使是最先进的系统在处理多参考图像时仍存在困难，为开发更灵活、更人性化、能够有效整合多种视觉灵感来源的创意工具提供了方向。 Abstract: Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

[146] MMReID-Bench: Unleashing the Power of MLLMs for Effective and Versatile Person Re-identification

Jinhao Li,Zijian Chen,Lirong Deng,Changbo Wang,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文介绍了MMReID-Bench，这是一个专为人物再识别设计的多任务多模态基准，旨在充分利用多模态大语言模型的推理、指令跟随和跨模态理解能力。

Details

Motivation: 传统的单模态人物再识别模型在多模态数据（如RGB、热成像、红外、素描图像、文本描述等）中泛化能力差。尽管多模态大语言模型显示出了潜力，但现有的方法仅将其视为特征提取器或标题生成器，未能充分释放其推理、指令跟随和跨模态理解的能力。 Method: 引入了MMReID-Bench，这是首个专为人物再识别设计的多任务多模态基准。MMReID-Bench包括20,710个多模态查询和图库图像，覆盖10种不同的人物再识别任务。 Result: 全面的实验表明，多模态大语言模型在提供有效且多用途的人物再识别方面具有显著能力。 Conclusion: 尽管多模态大语言模型在处理热成像和红外数据等一些模态上仍存在局限，但它们在提供有效且多用途的人物再识别方面展示了显著的能力。希望MMReID-Bench能够促进社区开发出更健壮且可泛化的多模态基础模型用于人物再识别。 Abstract: Person re-identification (ReID) aims to retrieve the images of an interested person in the gallery images, with wide applications in medical rehabilitation, abnormal behavior detection, and public security. However, traditional person ReID models suffer from uni-modal capability, leading to poor generalization ability in multi-modal data, such as RGB, thermal, infrared, sketch images, textual descriptions, etc. Recently, the emergence of multi-modal large language models (MLLMs) shows a promising avenue for addressing this problem. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, which do not fully unleash their reasoning, instruction-following, and cross-modal understanding capabilities. To bridge this gap, we introduce MMReID-Bench, the first multi-task multi-modal benchmark specifically designed for person ReID. The MMReID-Bench includes 20,710 multi-modal queries and gallery images covering 10 different person ReID tasks. Comprehensive experiments demonstrate the remarkable capabilities of MLLMs in delivering effective and versatile person ReID. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope MMReID-Bench can facilitate the community to develop more robust and generalizable multimodal foundation models for person ReID.

[147] Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing

Shichao Ma,Yunhe Guo,Jiahao Su,Qihe Huang,Zhengyang Zhou,Yang Wang

Main category: cs.CV

TL;DR: Talk2Image introduces a multi-agent system for interactive image generation and editing that effectively aligns with user intentions over multiple turns, providing better controllability, coherence, and user satisfaction.

Details

Motivation: Most text-to-image generation systems focus on single-turn scenarios and struggle with iterative, multi-turn creative tasks. Dialogue-based systems using a single agent often result in intention drift and incoherent edits. Method: Talk2Image utilizes intention parsing from dialogue history, task decomposition among specialized agents, and feedback-driven refinement based on multi-view evaluation to enable step-by-step alignment with user intentions. Result: Talk2Image outperforms existing baselines in controllability, coherence, and user satisfaction across iterative image generation and editing tasks. Conclusion: Talk2Image provides a multi-agent system that improves the controllability, coherence, and user satisfaction in iterative image generation and editing tasks compared to existing methods. Abstract: Text-to-image generation tasks have driven remarkable advances in diverse media applications, yet most focus on single-turn scenarios and struggle with iterative, multi-turn creative tasks. Recent dialogue-based systems attempt to bridge this gap, but their single-agent, sequential paradigm often causes intention drift and incoherent edits. To address these limitations, we present Talk2Image, a novel multi-agent system for interactive image generation and editing in multi-turn dialogue scenarios. Our approach integrates three key components: intention parsing from dialogue history, task decomposition and collaborative execution across specialized agents, and feedback-driven refinement based on a multi-view evaluation mechanism. Talk2Image enables step-by-step alignment with user intention and consistent image editing. Experiments demonstrate that Talk2Image outperforms existing baselines in controllability, coherence, and user satisfaction across iterative image generation and editing tasks.

[148] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

Shihao Yuan,Yahui Liu,Yang Yue,Jingyuan Zhang,Wangmeng Zuo,Qi Wang,Fuzheng Zhang,Guorui Zhou

Main category: cs.CV

TL;DR: AR-GRPO通过在线强化学习训练优化自回归图像生成模型，显著提升生成图像质量和人类偏好。

Details

Motivation: 受强化学习在改进大型语言模型中的成功的启发，研究者提出了AR-GRPO，旨在将在线强化学习训练融入自回归图像生成模型以提升其性能。 Method: 适应了Group Relative Policy Optimization (GRPO)算法，通过精心设计的奖励函数来优化自回归模型的输出，奖励函数评估生成图像的感知质量、真实感和语义保真度。 Result: 实验结果显示，AR-GRPO在生成图像的质量和人类偏好方面显著优于标准的自回归基线模型，并在各种评估指标上均表现出一致的改进。 Conclusion: AR-GRPO成功将在线强化学习训练融入自回归图像生成模型，显著提高了生成图像的质量和人类偏好，证明了基于强化学习优化在自回归图像生成中的可行性。 Abstract: Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models' outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: https://github.com/Kwai-Klear/AR-GRPO.

[149] CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing

Weiyan Xie,Han Gao,Didan Deng,Kaican Li,April Hua Liu,Yongxiang Huang,Nevin L. Zhang

Main category: cs.CV

TL;DR: CannyEdit is a novel training-free framework for regional image editing that improves text adherence, context fidelity, and seamlessness through Selective Canny Control and Dual-Prompt Guidance.

Details

Motivation: Existing text-to-image editing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits, which CannyEdit aims to address. Method: CannyEdit uses two key innovations: Selective Canny Control and Dual-Prompt Guidance, enabling precise text-driven edits while preserving unedited image details and maintaining coherent scene interactions. Result: CannyEdit achieves a 2.93 to 10.49 percent improvement in the balance of text adherence and context fidelity over prior methods like KV-Edit. User studies show significantly fewer users identified CannyEdit's results as AI-edited compared to competitors. Conclusion: CannyEdit is a training-free framework that successfully balances text adherence, context fidelity, and seamlessness of edits, outperforming existing methods in regional image editing tasks. Abstract: Recent advances in text-to-image (T2I) models have enabled training-free regional image editing by leveraging the generative priors of foundation models. However, existing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits. We introduce CannyEdit, a novel training-free framework that addresses these challenges through two key innovations: (1) Selective Canny Control, which masks the structural guidance of Canny ControlNet in user-specified editable regions while strictly preserving details of the source images in unedited areas via inversion-phase ControlNet information retention. This enables precise, text-driven edits without compromising contextual integrity. (2) Dual-Prompt Guidance, which combines local prompts for object-specific edits with a global target prompt to maintain coherent scene interactions. On real-world image editing tasks (addition, replacement, removal), CannyEdit outperforms prior methods like KV-Edit, achieving a 2.93 to 10.49 percent improvement in the balance of text adherence and context fidelity. In terms of editing seamlessness, user studies reveal only 49.2 percent of general users and 42.0 percent of AIGC experts identified CannyEdit's results as AI-edited when paired with real images without edits, versus 76.08 to 89.09 percent for competitor methods.

[150] SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work

Harry Walsh,Ed Fish,Ozge Mercanoglu Sincan,Mohamed Ilyes Lakhal,Richard Bowden,Neil Fox,Bencie Woll,Kepeng Wu,Zecheng Li,Weichao Zhao,Haodong Wang,Wengang Zhou,Houqiang Li,Shengeng Tang,Jiayi He,Xu Wang,Ruobei Zhang,Yaxiong Wang,Lechao Cheng,Meryem Tasyurek,Tugce Kiziltepe,Hacer Yalim Keles

Main category: cs.CV

TL;DR: The paper introduces the first Sign Language Production Challenge to evaluate Text-to-Pose translation methods and presents the winning methodologies.

Details

Motivation: The motivation for the paper is the lack of standardized evaluation metrics for Sign Language Production approaches, which hampers meaningful comparisons across different systems. Method: The paper describes the design of the Sign Language Production Challenge, which evaluates architectures that translate spoken language sentences to sequences of skeleton poses. They used the RWTH-PHOENIX-Weather-2014T dataset and a custom hidden test set for evaluation data. Result: The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. Conclusion: The paper concludes that the introduction of the Sign Language Production Challenge has provided a standardized evaluation network for the SLP field, enabling future researchers to compare their work against a broader range of methods. Abstract: Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation, over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a German Sign Language - Deutsche Gebardensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network, including high-quality skeleton extraction-based keypoints establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.

[151] Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification

Qin Xu,Lili Zhu,Xiaoxia Cheng,Bo Jiang

Main category: cs.CV

TL;DR: 本文提出了一种新的细粒度视觉分类方法SCOPE，通过空间域的动态多尺度特征融合，在提升细节表达和语义理解方面取得了显著效果，并在多个基准上达到了SOTA。

Details

Motivation: 解决细粒度视觉分类的关键在于捕捉具有类间判别性的细微视觉特征。虽然基于频域分解的方法具有一定的判别能力，但由于其基于固定基函数，缺乏对图像内容的适应性和不同判别需求的动态调整能力，因此本文提出SCOPE以克服这些限制。 Method: 该方法的核心是两个模块：Subtle Detail Extractor (SDE) 和 Salient Semantic Refiner (SSR)。SDE动态增强浅层特征中的边缘和纹理等细节，SSR则在增强的浅层特征指导下学习语义一致且结构感知的高层特征。这两个模块逐步级联，以逐步结合局部细节与全局语义。 Result: 实验结果表明，所提方法在四个流行的细粒度图像分类基准上均取得了最先进的性能表现。 Conclusion: 本文提出了一种新的细粒度视觉分类方法SCOPE，通过在空间域中自适应增强低级细节和高级语义，突破了频域固定尺度的限制，并在四个流行的细粒度图像分类基准上达到了新的SOTA。 Abstract: The crux of resolving fine-grained visual classification (FGVC) lies in capturing discriminative and class-specific cues that correspond to subtle visual characteristics. Recently, frequency decomposition/transform based approaches have attracted considerable interests since its appearing discriminative cue mining ability. However, the frequency-domain methods are based on fixed basis functions, lacking adaptability to image content and unable to dynamically adjust feature extraction according to the discriminative requirements of different images. To address this, we propose a novel method for FGVC, named Subtle-Cue Oriented Perception Engine (SCOPE), which adaptively enhances the representational capability of low-level details and high-level semantics in the spatial domain, breaking through the limitations of fixed scales in the frequency domain and improving the flexibility of multi-scale fusion. The core of SCOPE lies in two modules: the Subtle Detail Extractor (SDE), which dynamically enhances subtle details such as edges and textures from shallow features, and the Salient Semantic Refiner (SSR), which learns semantically coherent and structure-aware refinement features from the high-level features guided by the enhanced shallow features. The SDE and SSR are cascaded stage-by-stage to progressively combine local details with global semantics. Extensive experiments demonstrate that our method achieves new state-of-the-art on four popular fine-grained image classification benchmarks.

[152] Adversarial Video Promotion Against Text-to-Video Retrieval

Qiwei Tian,Chenhao Lin,Zhengyu Zhao,Qian Li,Shuai Liu,Chao Shen

Main category: cs.CV

TL;DR: 本文提出ViPro，一种针对文本到视频检索系统的攻击方法，通过模态精炼技术提升攻击效果，在多种设置下表现优异，揭示了系统漏洞并提供防御见解。

Details

Motivation: 现有文本到视频检索攻击方法主要关注抑制视频排名，而对提升视频排名的攻击研究不足。这种攻击可能为攻击者带来经济利益或传播错误信息。 Method: 提出了一种名为ViPro的攻击方法，通过模态精炼（MoRe）增强黑盒迁移能力，旨在提升视频在多个查询中的排名。 Result: ViPro在白盒、灰盒和黑盒设置下平均超越现有方法30%、10%和4%，实验覆盖多个模型和数据集，且评估了攻击的不可察觉性和防御效果。 Conclusion: ViPro攻击方法在提升视频排名方面优于现有方法，揭示了文本到视频检索系统中被忽视的漏洞，为防御提供了见解。 Abstract: Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4\%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at https://github.com/michaeltian108/ViPro.

[153] Evaluating Fisheye-Compatible 3D Gaussian Splatting Methods on Real Images Beyond 180 Degree Field of View

Ulas Gunes,Matias Turkulainen,Juho Kannala,Esa Rahtu

Main category: cs.CV

TL;DR: 本研究评估了 Fisheye-GS 和 3DGUT 在宽视角 3D 重建中的表现，并提出了一种基于深度预测的初始化方法，提升了重建的稳定性和质量。

Details

Motivation: 由于 fisheye 图像存在极端失真，传统的 SfM 初始化方法经常失败，因此需要一种更稳定的初始化策略。 Method: 评估了Fisheye-GS和3DGUT两种方法在不同视场角（200度、160度和120度）下的性能，并提出了一种基于UniK3D深度预测的初始化策略。 Result: Fisheye-GS 在视场角减少时性能提升，尤其在160度时表现良好；3DGUT 在所有设置下都保持稳定，并在200度视场角下保持高质量的感知效果。UniK3D 预测方法即使在雾、眩光或天空等复杂场景下也能生成高质量的点云。 Conclusion: Fisheye-based 3DGS 方法在宽视角3D重建中具有实际可行性，尤其是在稀疏和高失真图像输入的情况下。 Abstract: We present the first evaluation of fisheye-based 3D Gaussian Splatting methods, Fisheye-GS and 3DGUT, on real images with fields of view exceeding 180 degree. Our study covers both indoor and outdoor scenes captured with 200 degree fisheye cameras and analyzes how each method handles extreme distortion in real world settings. We evaluate performance under varying fields of view (200 degree, 160 degree, and 120 degree) to study the tradeoff between peripheral distortion and spatial coverage. Fisheye-GS benefits from field of view (FoV) reduction, particularly at 160 degree, while 3DGUT remains stable across all settings and maintains high perceptual quality at the full 200 degree view. To address the limitations of SfM-based initialization, which often fails under strong distortion, we also propose a depth-based strategy using UniK3D predictions from only 2-3 fisheye images per scene. Although UniK3D is not trained on real fisheye data, it produces dense point clouds that enable reconstruction quality on par with SfM, even in difficult scenes with fog, glare, or sky. Our results highlight the practical viability of fisheye-based 3DGS methods for wide-angle 3D reconstruction from sparse and distortion-heavy image inputs.

[154] WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering

Yixin Zhu,Zuoliang Zhu,Miloš Hašan,Jian Yang,Jin Xie,Beibei Wang

Main category: cs.CV

TL;DR: 本论文提出了 WeatherDiffusion，一种基于扩散模型的自动驾驶场景渲染框架，解决了复杂天气和光照条件下的渲染难题，并通过新机制和数据集提升了性能和可控性。

Details

Motivation: 复杂天气和光照条件对自动驾驶场景的渲染和重建提出了重大挑战，传统扩散模型难以控制且缺乏鲁棒性。 Method: 提出了一种基于扩散模型的框架 WeatherDiffusion，结合了内在映射感知注意力机制（MAA），通过文本描述引导预测内在映射实现可控的天气和光照编辑。此外，引入了 WeatherSynthetic 和 WeatherReal 两个数据集。 Result: WeatherDiffusion 在多个基准测试中优于现有最先进方法，并在自动驾驶的下游任务（如物体检测和图像分割）中显著提升了复杂天气场景下的鲁棒性。 Conclusion: WeatherDiffusion 框架在自动驾驶场景中的前向和逆向渲染任务中表现出色，特别是在复杂天气和光照条件下，具有较高的鲁棒性和可控性。 Abstract: Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios.

[155] TADoc: Robust Time-Aware Document Image Dewarping

Fangmin Zhao,Weichao Zeng,Zhenhang Li,Dongbao Yang,Yu Zhou

Main category: cs.CV

TL;DR: This paper introduces TADoc, a novel framework for document image dewarping modeled as a dynamic process, along with a new evaluation metric DLS, which together show robust and superior performance in handling complex document deformations.

Details

Motivation: Document image dewarping is increasingly important due to the rise of the digital economy and online working, but existing methods struggle with complex deformations and insufficient evaluation metrics. Method: TADoc (Time-Aware Document Dewarping Network), a lightweight framework that models dewarping as a dynamic process with intermediate states, and DLS (Document Layout Similarity), a new evaluation metric. Result: The TADoc framework demonstrates strong robustness and achieves superior performance on multiple benchmarks with varying document types and distortion levels. Conclusion: The proposed TADoc framework and DLS metric effectively address the challenges in document image dewarping, showing robustness and superior performance across various benchmarks. Abstract: Flattening curved, wrinkled, and rotated document images captured by portable photographing devices, termed document image dewarping, has become an increasingly important task with the rise of digital economy and online working. Although many methods have been proposed recently, they often struggle to achieve satisfactory results when confronted with intricate document structures and higher degrees of deformation in real-world scenarios. Our main insight is that, unlike other document restoration tasks (e.g., deblurring), dewarping in real physical scenes is a progressive motion rather than a one-step transformation. Based on this, we have undertaken two key initiatives. Firstly, we reformulate this task, modeling it for the first time as a dynamic process that encompasses a series of intermediate states. Secondly, we design a lightweight framework called TADoc (Time-Aware Document Dewarping Network) to address the geometric distortion of document images. In addition, due to the inadequacy of OCR metrics for document images containing sparse text, the comprehensiveness of evaluation is insufficient. To address this shortcoming, we propose a new metric -- DLS (Document Layout Similarity) -- to evaluate the effectiveness of document dewarping in downstream tasks. Extensive experiments and in-depth evaluations have been conducted and the results indicate that our model possesses strong robustness, achieving superiority on several benchmarks with different document types and degrees of distortion.

[156] OctreeNCA: Single-Pass 184 MP Segmentation on Consumer Hardware

Nick Lemke,John Kalkhof,Niklas Babendererde,Anirban Mukhopadhyay

Main category: cs.CV

TL;DR: OctreeNCA improves VRAM efficiency and enables fast, global-context segmentation of large medical images and videos.

Details

Motivation: Medical applications require efficient segmentation of large inputs, but traditional architectures like UNets and Vision Transformers are VRAM-intensive and scale poorly. Method: Proposed OctreeNCA using an octree-based neighborhood definition and implemented an efficient CUDA-based NCA inference function. Result: OctreeNCA achieves 90% less VRAM usage than UNet and successfully segments 184 Megapixel pathology slices or 1-minute surgical videos in one go. Conclusion: OctreeNCA segments high-resolution images and videos efficiently with significantly less VRAM compared to UNet, enabling global consistency and fast inference. Abstract: Medical applications demand segmentation of large inputs, like prostate MRIs, pathology slices, or videos of surgery. These inputs should ideally be inferred at once to provide the model with proper spatial or temporal context. When segmenting large inputs, the VRAM consumption of the GPU becomes the bottleneck. Architectures like UNets or Vision Transformers scale very poorly in VRAM consumption, resulting in patch- or frame-wise approaches that compromise global consistency and inference speed. The lightweight Neural Cellular Automaton (NCA) is a bio-inspired model that is by construction size-invariant. However, due to its local-only communication rules, it lacks global knowledge. We propose OctreeNCA by generalizing the neighborhood definition using an octree data structure. Our generalized neighborhood definition enables the efficient traversal of global knowledge. Since deep learning frameworks are mainly developed for large multi-layer networks, their implementation does not fully leverage the advantages of NCAs. We implement an NCA inference function in CUDA that further reduces VRAM demands and increases inference speed. Our OctreeNCA segments high-resolution images and videos quickly while occupying 90% less VRAM than a UNet during evaluation. This allows us to segment 184 Megapixel pathology slices or 1-minute surgical videos at once.

[157] S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision

Huihui Xu,Jin Ye,Hongqiu Wang,Changkai Ji,Jiashi Lin,Ming Hu,Ziyan Huang,Ying Chen,Chenglong Ma,Tianbin Li,Lihao Liu,Junjun He,Lei Zhu

Main category: cs.CV

TL;DR: S2-UniSeg 提出新的伪掩码算法和预训练方法，显著提升自监督图像分割性能，并在多个基准测试中超越现有模型。

Details

Motivation: 现有的自监督图像分割模型需要耗时的多阶段预训练和伪掩码生成过程，限制了其扩展性和优化效果。 Method: 提出了一种新的伪掩码算法 Fast Universal Agglomerative Pooling (UniAP) 和一种新的预训练方法 Query-wise Self-Distillation (QuerySD)，并结合学生和动量教师模型进行连续预训练。 Result: S2-UniSeg 在 COCO、UVO、COCOStuff-27 和 Cityscapes 四个基准测试中分别取得了显著的性能提升，AP+6.9、AR+11.1、PixelAcc+4.5 和 RQ+8.0。 Conclusion: S2-UniSeg 模型在多个基准测试中表现出色，优于当前最先进的 UnSAM 模型，并且在更大的数据集上进一步提高了性能。 Abstract: Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at https://github.com/bio-mlhui/S2-UniSeg

[158] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Zixiong Wang,Jian Yang,Yiwei Hu,Milos Hasan,Beibei Wang

Main category: cs.CV

TL;DR: HiMat is a lightweight diffusion-based framework that efficiently generates high-resolution SVBRDFs with strong coherence and detail, using a novel CrossStitch module to ensure consistency across maps.

Details

Motivation: Highly detailed SVBRDFs are crucial for 3D content creation, and existing high-resolution generative models need adaptation to produce aligned SVBRDF maps efficiently. Method: HiMat uses a lightweight CrossStitch module to maintain consistency across SVBRDF maps without altering the DiT backbone or training new VAEs. Result: HiMat achieves 4K-resolution SVBRDF generation with strong structural coherence and high-frequency details, demonstrating effectiveness through text-prompt tests and generalization to intrinsic decomposition tasks. Conclusion: HiMat is an effective framework for generating high-resolution, detailed SVBRDFs, showing strong performance and generalization capabilities. Abstract: Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.

[159] TerraMAE: Learning Spatial-Spectral Representations from Hyperspectral Earth Observation Data via Adaptive Masked Autoencoders

Tanjim Bin Faruk,Abdul Matin,Shrideep Pallickara,Sangmi Lee Pallickara

Main category: cs.CV

TL;DR: TerraMAE是一种用于高光谱卫星图像的新框架，它通过自监督学习来提高图像重建和下游地理空间任务的性能。

Details

Motivation: 高光谱成像提供了地球的细粒度视图，但由于其200多个波段的复杂性，现有的自监督方法难以利用其空间-光谱相关性。 Method: TerraMAE采用了一种自适应通道分组策略和一种结合空间和光谱质量度量的增强重建损失函数。 Result: TerraMAE在高保真图像重建中展示了其有效性，并在作物识别、土地覆盖分类和土壤质地预测任务中表现出色。 Conclusion: TerraMAE是一种有效的高光谱图像编码框架，能够学习高度代表性的空间-光谱嵌入，适用于多种地理空间分析。 Abstract: Hyperspectral satellite imagery offers sub-30 m views of Earth in hundreds of contiguous spectral bands, enabling fine-grained mapping of soils, crops, and land cover. While self-supervised Masked Autoencoders excel on RGB and low-band multispectral data, they struggle to exploit the intricate spatial-spectral correlations in 200+ band hyperspectral images. We introduce TerraMAE, a novel HSI encoding framework specifically designed to learn highly representative spatial-spectral embeddings for diverse geospatial analyses. TerraMAE features an adaptive channel grouping strategy, based on statistical reflectance properties to capture spectral similarities, and an enhanced reconstruction loss function that incorporates spatial and spectral quality metrics. We demonstrate TerraMAE's effectiveness through superior spatial-spectral information preservation in high-fidelity image reconstruction. Furthermore, we validate its practical utility and the quality of its learned representations through strong performance on three key downstream geospatial tasks: crop identification, land cover classification, and soil texture prediction.

[160] DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents

Kun Qian,Wenjie Li,Tianyu Sun,Wenhong Wang,Wenhan Luo

Main category: cs.CV

TL;DR: DocRefine is an innovative framework for intelligent understanding and refinement of scientific PDF documents, using a multi-agent system and advanced LVLMs to achieve high semantic accuracy and visual fidelity, outperforming existing methods in various tasks.

Details

Motivation: The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods and direct applications of LLMs and LVLMs are inadequate for complex layouts and editing tasks. Method: The paper introduces DocRefine, a framework utilizing a multi-agent system with six specialized agents: Layout & Structure Analysis, Multimodal Content Understanding, Instruction Decomposition, Content Refinement, Summarization & Generation, and Fidelity & Consistency Verification. It employs advanced LVLMs like GPT-4o in a closed-loop feedback architecture. Result: Evaluated on the DocEditBench dataset, DocRefine outperformed state-of-the-art baselines with scores of 86.7% for Semantic Consistency Score, 93.9% for Layout Fidelity Index, and 85.0% for Instruction Adherence Rate. Conclusion: DocRefine represents a significant advancement in automated scientific document processing by effectively handling complex multimodal document editing, preserving semantic integrity, and maintaining visual consistency. Abstract: The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts and multimodal content, while direct application of Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) lacks precision and control for intricate editing tasks. This paper introduces DocRefine, an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents, driven by natural language instructions. DocRefine leverages the power of advanced LVLMs (e.g., GPT-4o) by orchestrating a sophisticated multi-agent system comprising six specialized and collaborative agents: Layout & Structure Analysis, Multimodal Content Understanding, Instruction Decomposition, Content Refinement, Summarization & Generation, and Fidelity & Consistency Verification. This closed-loop feedback architecture ensures high semantic accuracy and visual fidelity. Evaluated on the comprehensive DocEditBench dataset, DocRefine consistently outperforms state-of-the-art baselines across various tasks, achieving overall scores of 86.7% for Semantic Consistency Score (SCS), 93.9% for Layout Fidelity Index (LFI), and 85.0% for Instruction Adherence Rate (IAR). These results demonstrate DocRefine's superior capability in handling complex multimodal document editing, preserving semantic integrity, and maintaining visual consistency, marking a significant advancement in automated scientific document processing.

[161] MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering

Jingwei Peng,Jiehao Chen,Mateo Alejandro Rojas,Meilin Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的复杂视觉问答模型MV-CoRe，通过结合视觉和语言大模型的全局嵌入以及细粒度视觉特征，实现了对现有模型的性能超越。

Details

Motivation: 现有的大规模视觉-语言模型在处理需要复杂多模态推理和外部知识整合的复杂视觉问答任务时存在显著挑战，通常受限于其对高层全局特征的依赖。 Method: 提出了一种名为MV-CoRe的模型，结合了预训练视觉大模型和语言大模型的全局嵌入，并利用细粒度的语义感知视觉特征（如物体检测特征和场景图表示）进行深度多模态融合。 Result: MV-CoRe在GQA数据集上达到了77.5%的整体准确率，并且在多个复杂视觉问答基准测试中均优于现有的大规模视觉-语言模型。 Conclusion: MV-CoRe通过融合多模态信息，在复杂的视觉问答任务中表现出色，证明了其在深度视觉和概念理解方面的强大能力。 Abstract: Complex Visual Question Answering (Complex VQA) tasks, which demand sophisticated multi-modal reasoning and external knowledge integration, present significant challenges for existing large vision-language models (LVLMs) often limited by their reliance on high-level global features. To address this, we propose MV-CoRe (Multimodal Visual-Conceptual Reasoning), a novel model designed to enhance Complex VQA performance through the deep fusion of diverse visual and linguistic information. MV-CoRe meticulously integrates global embeddings from pre-trained Vision Large Models (VLMs) and Language Large Models (LLMs) with fine-grained semantic-aware visual features, including object detection characteristics and scene graph representations. An innovative Multimodal Fusion Transformer then processes and deeply integrates these diverse feature sets, enabling rich cross-modal attention and facilitating complex reasoning. We evaluate MV-CoRe on challenging Complex VQA benchmarks, including GQA, A-OKVQA, and OKVQA, after training on VQAv2. Our experimental results demonstrate that MV-CoRe consistently outperforms established LVLM baselines, achieving an overall accuracy of 77.5% on GQA. Ablation studies confirm the critical contribution of both object and scene graph features, and human evaluations further validate MV-CoRe's superior factual correctness and reasoning depth, underscoring its robust capabilities for deep visual and conceptual understanding.

[162] Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation

Juntong Fan,Shuyi Fan,Debesh Jha,Changsheng Fang,Tieyong Zeng,Hengyong Yu,Dayang Wang

Main category: cs.CV

TL;DR: This paper proposes FOCUS-Med, a novel method for endoscopic image segmentation of polyps that addresses challenges like low contrast and indistinct boundaries. It combines graph-based modeling, attention mechanisms, and multi-scale fusion to achieve excellent performance and introduces the use of a Large Language Model for qualitative evaluation.

Details

Motivation: Accurate endoscopic image segmentation of polyps is crucial for early colorectal cancer detection, but challenges such as low contrast, specular highlights, and indistinct boundaries make this task difficult. Method: FOCUS-Med integrates a Dual Graph Convolutional Network (Dual-GCN) to capture spatial and structural dependencies, employs a location-fused stand-alone self-attention for global context integration, and uses a trainable weighted fast normalized fusion strategy for multi-scale aggregation. Additionally, a Large Language Model (LLM) is introduced for qualitative evaluations. Result: Extensive experiments on public benchmarks show that FOCUS-Med achieves state-of-the-art performance across five key metrics. Conclusion: The proposed FOCUS-Med model achieves state-of-the-art performance in endoscopic image segmentation for polyps, demonstrating its clinical potential for AI-assisted colonoscopy. Abstract: Accurate endoscopic image segmentation on the polyps is critical for early colorectal cancer detection. However, this task remains challenging due to low contrast with surrounding mucosa, specular highlights, and indistinct boundaries. To address these challenges, we propose FOCUS-Med, which stands for Fusion of spatial and structural graph with attentional context-aware polyp segmentation in endoscopic medical imaging. FOCUS-Med integrates a Dual Graph Convolutional Network (Dual-GCN) module to capture contextual spatial and topological structural dependencies. This graph-based representation enables the model to better distinguish polyps from background tissues by leveraging topological cues and spatial connectivity, which are often obscured in raw image intensities. It enhances the model's ability to preserve boundaries and delineate complex shapes typical of polyps. In addition, a location-fused stand-alone self-attention is employed to strengthen global context integration. To bridge the semantic gap between encoder-decoder layers, we incorporate a trainable weighted fast normalized fusion strategy for efficient multi-scale aggregation. Notably, we are the first to introduce the use of a Large Language Model (LLM) to provide detailed qualitative evaluations of segmentation quality. Extensive experiments on public benchmarks demonstrate that FOCUS-Med achieves state-of-the-art performance across five key metrics, underscoring its effectiveness and clinical potential for AI-assisted colonoscopy.

[163] TeSO: Representing and Compressing 3D Point Cloud Scenes with Textured Surfel Octree

Yueyu Hu,Ran Gong,Tingyu Fan,Yao Wang

Main category: cs.CV

TL;DR: This paper introduces the Textured Surfel Octree (TeSO), a new 3D representation that provides high-quality rendering and efficient compression, outperforming existing methods like point clouds and 3D Gaussians in terms of rendering quality at lower bit-rates.

Details

Motivation: 3D visual content streaming is a key technology for emerging 3D telepresence and AR/VR applications. One fundamental element underlying the technology is a versatile 3D representation that is capable of producing high-quality renders and can be efficiently compressed at the same time. Existing 3D representations like point clouds, meshes and 3D Gaussians each have limitations in terms of rendering quality, surface definition, and compressibility. Method: We present the Textured Surfel Octree (TeSO), a novel 3D representation that is built from point clouds but addresses the aforementioned limitations. It represents a 3D scene as cube-bounded surfels organized on an octree, where each surfel is further associated with a texture patch. We further propose a compression scheme to encode the geometry and texture efficiently, leveraging the octree structure. Result: It reduces the number of primitives required to represent the 3D scene, and yet retains the high-frequency texture details through the texture map attached to each surfel. Conclusion: The proposed textured surfel octree combined with the compression scheme achieves higher rendering quality at lower bit-rates compared to multiple point cloud and 3D Gaussian-based baselines. Abstract: 3D visual content streaming is a key technology for emerging 3D telepresence and AR/VR applications. One fundamental element underlying the technology is a versatile 3D representation that is capable of producing high-quality renders and can be efficiently compressed at the same time. Existing 3D representations like point clouds, meshes and 3D Gaussians each have limitations in terms of rendering quality, surface definition, and compressibility. In this paper, we present the Textured Surfel Octree (TeSO), a novel 3D representation that is built from point clouds but addresses the aforementioned limitations. It represents a 3D scene as cube-bounded surfels organized on an octree, where each surfel is further associated with a texture patch. By approximating a smooth surface with a large surfel at a coarser level of the octree, it reduces the number of primitives required to represent the 3D scene, and yet retains the high-frequency texture details through the texture map attached to each surfel. We further propose a compression scheme to encode the geometry and texture efficiently, leveraging the octree structure. The proposed textured surfel octree combined with the compression scheme achieves higher rendering quality at lower bit-rates compared to multiple point cloud and 3D Gaussian-based baselines.

[164] ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting

Sandro Papais,Letian Wang,Brian Cheong,Steven L. Waslander

Main category: cs.CV

TL;DR: ForeSight是一种新型的联合检测和预测框架，通过多任务流和双向学习方法，实现了在自动驾驶汽车中基于视觉的3D感知的最先进性能。

Details

Motivation: 传统的将检测和预测视为单独顺序任务的方法有限，因为它们无法充分利用时间线索。 Method: ForeSight采用多任务流和双向学习方法，允许检测和预测共享查询内存并无缝传播信息。 Result: 实验结果显示，ForeSight在nuScenes数据集上达到了54.9%的EPA，超过了之前的方法9.3%，同时在多视角检测和预测模型中获得了最佳的mAP和minADE。 Conclusion: ForeSight是用于自动驾驶汽车中基于视觉的3D感知的新型联合检测和预测框架，实现了最先进的性能，超越了以前的方法。 Abstract: We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multi-frame sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9%, surpassing previous methods by 9.3%, while also attaining the best mAP and minADE among multi-view detection and forecasting models.

[165] Communication-Efficient Multi-Agent 3D Detection via Hybrid Collaboration

Yue Hu,Juntong Peng,Yunqiao Yang,Siheng Chen

Main category: cs.CV

TL;DR: HyComm is a communication-efficient LiDAR-based collaborative 3D detection system that adaptively integrates perceptual outputs and raw observations to achieve optimal perceptual information, adaptability, and a superior performance-bandwidth trade-off.

Details

Motivation: Collaborative 3D detection inherently results in a fundamental trade-off between detection performance and communication bandwidth. HyComm aims to tackle this bottleneck issue by proposing a novel hybrid collaboration approach. Method: HyComm integrates two types of communication messages: perceptual outputs and raw observations, prioritizing the most critical data within each type for optimal perceptual information and adaptability. Result: HyComm consistently outperforms previous methods on real-world and simulation datasets, achieving a superior performance-bandwidth trade-off and a lower communication volume of more than 2,006× while still outperforming Where2comm on DAIR-V2X in terms of AP50. Conclusion: HyComm facilitates adaptable compression rates and uses standardized data formats, ensuring independence from specific detection models and fostering adaptability across different agent configurations. Abstract: Collaborative 3D detection can substantially boost detection performance by allowing agents to exchange complementary information. It inherently results in a fundamental trade-off between detection performance and communication bandwidth. To tackle this bottleneck issue, we propose a novel hybrid collaboration that adaptively integrates two types of communication messages: perceptual outputs, which are compact, and raw observations, which offer richer information. This approach focuses on two key aspects: i) integrating complementary information from two message types and ii) prioritizing the most critical data within each type. By adaptively selecting the most critical set of messages, it ensures optimal perceptual information and adaptability, effectively meeting the demands of diverse communication scenarios.Building on this hybrid collaboration, we present \texttt{HyComm}, a communication-efficient LiDAR-based collaborative 3D detection system. \texttt{HyComm} boasts two main benefits: i) it facilitates adaptable compression rates for messages, addressing various communication requirements, and ii) it uses standardized data formats for messages. This ensures they are independent of specific detection models, fostering adaptability across different agent configurations. To evaluate HyComm, we conduct experiments on both real-world and simulation datasets: DAIR-V2X and OPV2V. HyComm consistently outperforms previous methods and achieves a superior performance-bandwidth trade-off regardless of whether agents use the same or varied detection models. It achieves a lower communication volume of more than 2,006$\times$ and still outperforms Where2comm on DAIR-V2X in terms of AP50. The related code will be released.

[166] AugLift: Boosting Generalization in Lifting-based 3D Human Pose Estimation

Nikolai Warner,Wenjin Zhang,Irfan Essa,Apaar Sadhwani

Main category: cs.CV

TL;DR: AugLift通过简单有效的输入增强策略，显著提升3D人体姿态估计模型的泛化能力。

Details

Motivation: lifting-based方法在新数据集和现实场景中泛化能力差，需要提升其性能。 Method: 通过增强2D关键点输入，加入关键点检测置信度和深度估计，利用现成预训练模型提取增强信号。 Result: 在四个数据集上的实验表明，跨数据集性能平均提升10.1%，分布内性能提升4.0%。 Conclusion: AugLift作为一种模块化插件，可以有效提升lifting-based 3D人体姿态估计模型的泛化能力，且无需额外数据或传感器。 Abstract: Lifting-based methods for 3D Human Pose Estimation (HPE), which predict 3D poses from detected 2D keypoints, often generalize poorly to new datasets and real-world settings. To address this, we propose \emph{AugLift}, a simple yet effective reformulation of the standard lifting pipeline that significantly improves generalization performance without requiring additional data collection or sensors. AugLift sparsely enriches the standard input -- the 2D keypoint coordinates $(x, y)$ -- by augmenting it with a keypoint detection confidence score $c$ and a corresponding depth estimate $d$. These additional signals are computed from the image using off-the-shelf, pre-trained models (e.g., for monocular depth estimation), thereby inheriting their strong generalization capabilities. Importantly, AugLift serves as a modular add-on and can be readily integrated into existing lifting architectures. Our extensive experiments across four datasets demonstrate that AugLift boosts cross-dataset performance on unseen datasets by an average of $10.1\%$, while also improving in-distribution performance by $4.0\%$. These gains are consistent across various lifting architectures, highlighting the robustness of our method. Our analysis suggests that these sparse, keypoint-aligned cues provide robust frame-level context, offering a practical way to significantly improve the generalization of any lifting-based pose estimation model. Code will be made publicly available.

[167] Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays

Gregory Schuit,Denis Parra,Cecilia Besa

Main category: cs.CV

TL;DR: This study evaluates the effectiveness of GANs and DMs in synthesizing chest X-rays for medical imaging. While DMs produce more realistic images overall, GANs show better accuracy for specific conditions, highlighting their complementary strengths and the need for further refinement to improve their reliability in augmenting AI diagnostic training datasets.

Details

Motivation: Generative image models offer a potential solution to data scarcity in medical imaging, particularly for low-prevalence anomalies. However, concerns about the fidelity and clinical utility of synthetic images persist, as poor generation quality can affect model generalizability and trust. Method: The study evaluated state-of-the-art generative models (GANs and DMs) by synthesizing chest X-rays conditioned on four abnormalities. A benchmark with real images from the MIMIC-CXR dataset and synthetic images from both models was used, alongside a reader study involving three radiologists to assess image authenticity and abnormality consistency. Result: Diffusion Models were found to generate more visually realistic images overall, but Generative Adversarial Networks showed better accuracy for certain conditions, such as absence of Enlarged Cardiac Silhouette. The study also identified visual cues used by radiologists to detect synthetic images, shedding light on perceptual gaps in current models. Conclusion: The study concludes that while Diffusion Models generate more visually realistic images overall, Generative Adversarial Networks can offer better accuracy for specific conditions. The findings highlight the complementary strengths of both models and the need for further refinement to ensure reliable augmentation of training datasets for AI diagnostic systems. Abstract: Generative image models have achieved remarkable progress in both natural and medical imaging. In the medical context, these techniques offer a potential solution to data scarcity-especially for low-prevalence anomalies that impair the performance of AI-driven diagnostic and segmentation tools. However, questions remain regarding the fidelity and clinical utility of synthetic images, since poor generation quality can undermine model generalizability and trust. In this study, we evaluate the effectiveness of state-of-the-art generative models-Generative Adversarial Networks (GANs) and Diffusion Models (DMs)-for synthesizing chest X-rays conditioned on four abnormalities: Atelectasis (AT), Lung Opacity (LO), Pleural Effusion (PE), and Enlarged Cardiac Silhouette (ECS). Using a benchmark composed of real images from the MIMIC-CXR dataset and synthetic images from both GANs and DMs, we conducted a reader study with three radiologists of varied experience. Participants were asked to distinguish real from synthetic images and assess the consistency between visual features and the target abnormality. Our results show that while DMs generate more visually realistic images overall, GANs can report better accuracy for specific conditions, such as absence of ECS. We further identify visual cues radiologists use to detect synthetic images, offering insights into the perceptual gaps in current models. These findings underscore the complementary strengths of GANs and DMs and point to the need for further refinement to ensure generative models can reliably augment training datasets for AI diagnostic systems.

[168] CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance

Yingtie Lei,Fanghai Yi,Yihang Dong,Weihuang Liu,Xiaofeng Zhang,Zimeng Li,Chi-Man Pun,Xuhang Chen

Main category: cs.CV

TL;DR: 提出了一种新的壁画修复网络CMAMRNet，通过MAUDS和CFA组件解决了现有方法在保持掩码引导一致性和特征提取方面的不足，实验结果显示该方法在保持壁画结构和艺术细节方面表现出色。

Details

Motivation: 壁画作为重要的文化遗产，面临环境和人为因素的持续破坏，现有的基于学习的方法难以在整个网络中保持一致的掩码引导，导致对受损区域关注不足和修复质量下降。 Method: 提出了CMAMRNet，包括MAUDS和CFA两个关键组件，通过全面的掩码引导和多尺度特征提取来解决现有方法在壁画修复中的局限性。 Result: CMAMRNet在基准数据集上的实验结果证明了其在壁画修复方面的有效性，优于现有方法。 Conclusion: CMAMRNet有效提升了壁画修复的质量，在基准数据集上的实验结果表明其优于现有方法，能够有效保持壁画的结构完整性和艺术细节。 Abstract: Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\href{https://github.com/CXH-Research/CMAMRNet}{https://github.com/CXH-Research/CMAMRNet}.

[169] Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models

Xuanhan Wang,Huimin Deng,Ke Liu,Jun Wang,Lianli Gao,Jingkuan Song

Main category: cs.CV

TL;DR: 本文提出DPAL方法，通过动态解码器和多级对齐目标，使轻量模型具备与大型HVMs相当的泛化能力。

Details

Motivation: 大型HVMs因架构复杂和数据限制难以应用，需更实用的轻量模型。 Method: 提出了基于知识蒸馏的DPAL框架，包括动态模式解码器和三级对齐目标。 Result: DPAL-ViT/Ti在15个数据集中表现优异，参数仅5M，远超其他蒸馏方法。 Conclusion: DPAL有效提升了轻量级HVMs的泛化能力，使其在多个任务上表现接近或超过大型HVMs。 Abstract: Human-centric vision models (HVMs) have achieved remarkable generalization due to large-scale pretraining on massive person images. However, their dependence on large neural architectures and the restricted accessibility of pretraining data significantly limits their practicality in real-world applications. To address this limitation, we propose Dynamic Pattern Alignment Learning (DPAL), a novel distillation-based pretraining framework that efficiently trains lightweight HVMs to acquire strong generalization from large HVMs. In particular, human-centric visual perception are highly dependent on three typical visual patterns, including global identity pattern, local shape pattern and multi-person interaction pattern. To achieve generalizable lightweight HVMs, we firstly design a dynamic pattern decoder (D-PaDe), acting as a dynamic Mixture of Expert (MoE) model. It incorporates three specialized experts dedicated to adaptively extract typical visual patterns, conditioned on both input image and pattern queries. And then, we present three levels of alignment objectives, which aims to minimize generalization gap between lightweight HVMs and large HVMs at global image level, local pixel level, and instance relation level. With these two deliberate designs, the DPAL effectively guides lightweight model to learn all typical human visual patterns from large HVMs, which can generalize to various human-centric vision tasks. Extensive experiments conducted on 15 challenging datasets demonstrate the effectiveness of the DPAL. Remarkably, when employing PATH-B as the teacher, DPAL-ViT/Ti (5M parameters) achieves surprising generalizability similar to existing large HVMs such as PATH-B (84M) and Sapiens-L (307M), and outperforms previous distillation-based pretraining methods including Proteus-ViT/Ti (5M) and TinyMiM-ViT/Ti (5M) by a large margin.

[170] Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction

Yu Liu,Zhijie Liu,Xiao Ren,You-Fu Li,He Kong

Main category: cs.CV

TL;DR: 本文提出了一种结合短期和长期运动意图建模的扩散模型，用于行人轨迹预测，并在多个基准测试中表现优异。

Details

Motivation: 现有的基于扩散的模型缺乏对行人意图的显式语义建模，导致行为误判和预测准确性下降。 Method: 提出了一种基于扩散模型的行人轨迹预测框架，结合了短期和长期运动意图建模。短期意图采用残差极坐标表示，长期意图通过可学习的基于标记的端点预测器生成多个候选目标及其概率。扩散过程通过自适应引导和残差噪声预测器增强。 Result: 在ETH、UCY和SDD基准测试中表现良好，展示了框架的有效性和竞争力。 Conclusion: 该框架在ETH、UCY和SDD基准测试中展现了与最先进方法相当的竞争性结果，证明了其在行人轨迹预测方面的有效性。 Abstract: Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods.

[171] SketchAnimator: Animate Sketch via Motion Customization of Text-to-Video Diffusion Models

Ruolin Yang,Da Li,Honggang Zhang,Yi-Zhe Song

Main category: cs.CV

TL;DR: SketchAnimator automates sketch animation by integrating appearance and motion from a reference video into a pre-trained model, producing dynamic sketch videos with minimal user input.

Details

Motivation: The motivation is to simplify the time-consuming and skill-demanding process of animating sketches, making it accessible to amateurs by automating the addition of creative motion to static sketches. Method: The method involves three stages: Appearance Learning, Motion Learning, and Video Prior Distillation. LoRA is used to integrate sketch appearance and motion dynamics into a pre-trained T2V model, and Score Distillation Sampling (SDS) updates Bezier curve parameters based on motion information. Result: The result is a sketch video that effectively combines the original sketch's appearance with the dynamic movements of a reference video, showcasing the model's capability in one-shot motion customization. Conclusion: The paper concludes that the proposed SketchAnimator model successfully generates sketch videos that retain the original appearance while mirroring dynamic movements from a reference video, even under one-shot motion customization. Abstract: Sketching is a uniquely human tool for expressing ideas and creativity. The animation of sketches infuses life into these static drawings, opening a new dimension for designers. Animating sketches is a time-consuming process that demands professional skills and extensive experience, often proving daunting for amateurs. In this paper, we propose a novel sketch animation model SketchAnimator, which enables adding creative motion to a given sketch, like "a jumping car''. Namely, given an input sketch and a reference video, we divide the sketch animation into three stages: Appearance Learning, Motion Learning and Video Prior Distillation. In stages 1 and 2, we utilize LoRA to integrate sketch appearance information and motion dynamics from the reference video into the pre-trained T2V model. In the third stage, we utilize Score Distillation Sampling (SDS) to update the parameters of the Bezier curves in each sketch frame according to the acquired motion information. Consequently, our model produces a sketch video that not only retains the original appearance of the sketch but also mirrors the dynamic movements of the reference video. We compare our method with alternative approaches and demonstrate that it generates the desired sketch video under the challenge of one-shot motion customization.

[172] CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion

Xiaotong Lin,Tianming Liang,Jian-Fang Hu,Kun-Yu Lin,Yulei Kang,Chunwei Tian,Jianhuang Lai,Wei-Shi Zheng

Main category: cs.CV

TL;DR: 本文提出了一种新的接触一致解耦扩散模型CoopDiff，通过分别建模人体和物体运动，并利用接触点进行一致性约束，实现了更优的3D人-物交互预测。

Details

Motivation: 由于人体和物体具有不同的物理特性，它们表现出不同的运动模式。然而，现有方法通常忽略这种差异，试图用单一模型同时建模人体和物体的动态变化。这种做法难以准确捕捉两者复杂的交互关系。 Method: 提出了一种接触一致的解耦扩散框架CoopDiff，该框架采用两个独立分支分别对人体和物体的运动进行建模，并通过共享的接触点和一致性约束将两分支连接起来。此外，还设计了一个由人体驱动的交互模块来引导物体运动建模。 Result: 在BEHAVE和Human-object Interaction数据集上的大量实验表明，CoopDiff在3D人-物交互预测任务上优于现有的最先进方法。 Conclusion: CoopDiff是一个接触一致的解耦扩散框架，通过分别建模人体和物体运动并利用接触点进行一致性约束，有效提升了3D人-物交互预测的性能。 Abstract: 3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context. Generally, the articulated humans and rigid objects exhibit different motion patterns, due to their distinct intrinsic physical properties. However, this distinction is ignored by most of the existing works, which intend to capture the dynamics of both humans and objects within a single prediction model. In this work, we propose a novel contact-consistent decoupled diffusion framework CoopDiff, which employs two distinct branches to decouple human and object motion modeling, with the human-object contact points as shared anchors to bridge the motion generation across branches. The human dynamics branch is aimed to predict highly structured human motion, while the object dynamics branch focuses on the object motion with rigid translations and rotations. These two branches are bridged by a series of shared contact points with consistency constraint for coherent human-object motion prediction. To further enhance human-object consistency and prediction reliability, we propose a human-driven interaction module to guide object motion modeling. Extensive experiments on the BEHAVE and Human-object Interaction datasets demonstrate that our CoopDiff outperforms state-of-the-art methods.

[173] Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection

Yunpeng Shi,Lei Chen,Xiaolu Shen,Yanju Guo

Main category: cs.CV

TL;DR: This paper introduces LMFNet, a lightweight network with a novel LMF layer for multi-scale feature extraction, achieving efficient and competitive performance in salient object detection.

Details

Motivation: Multi-scale feature extraction is crucial in tasks like salient object detection, but achieving this in lightweight networks is challenging due to the trade-off between efficiency and performance. Method: This paper proposes a novel lightweight multi-scale feature extraction layer called the LMF layer, which utilizes depthwise separable dilated convolutions in a fully connected structure. Multiple LMF layers are integrated to develop LMFNet, a lightweight network designed for salient object detection. Result: LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming many traditional and lightweight models in terms of both efficiency and accuracy. Conclusion: LMFNet addresses the challenge of multi-scale feature extraction in lightweight networks and demonstrates potential for broader applications in image processing tasks. Abstract: In the domain of computer vision, multi-scale feature extraction is vital for tasks such as salient object detection. However, achieving this capability in lightweight networks remains challenging due to the trade-off between efficiency and performance. This paper proposes a novel lightweight multi-scale feature extraction layer, termed the LMF layer, which employs depthwise separable dilated convolutions in a fully connected structure. By integrating multiple LMF layers, we develop LMFNet, a lightweight network tailored for salient object detection. Our approach significantly reduces the number of parameters while maintaining competitive performance. Here, we show that LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming several traditional and lightweight models in terms of both efficiency and accuracy. Our work not only addresses the challenge of multi-scale learning in lightweight networks but also demonstrates the potential for broader applications in image processing tasks. The related code files are available at https://github.com/Shi-Yun-peng/LMFNet

[174] EventRR: Event Referential Reasoning for Referring Video Object Segmentation

Huihui Xu,Jiashi Lin,Haoyu Chen,Junjun He,Lei Zhu

Main category: cs.CV

TL;DR: This paper proposes EventRR, a framework for Referring Video Object Segmentation that captures the semantic structure of video-referring expressions through a Referential Event Graph and interpretable reasoning steps, achieving superior performance over existing methods.

Details

Motivation: Current RVOS methods treat referring expressions as unstructured sequences, ignoring their semantic structure. Video-referring expressions are more complex than image-referring expressions due to the inclusion of event attributes and temporal relations. This complexity necessitates a structured reasoning approach tailored for videos. Method: EventRR decouples RVOS into two parts: object summarization and referent reasoning. The summarization phase generates bottleneck tokens for each frame, which are aggregated at the video level to capture cross-modal temporal context. The reasoning phase extracts semantic event structures into a Referential Event Graph (REG), followed by Temporal Concept-Role Reasoning (TCRR) to compute referring scores through topological traversal of the REG. Result: Extensive experiments on four benchmark datasets show that EventRR quantitatively and qualitatively outperforms existing state-of-the-art RVOS methods. Conclusion: The proposed EventRR framework outperforms state-of-the-art RVOS methods by effectively capturing the semantic eventful structure of video-referring expressions and enabling interpretable reasoning through the Referential Event Graph (REG) and Temporal Concept-Role Reasoning (TCRR). Abstract: Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to exchange the global cross-modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video-referring expression into highly expressive Referential Event Graph (REG), which is a single-rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept-Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question-answer pair derived from the concept-role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state-of-the-art RVOS methods. Code is available at https://github.com/bio-mlhui/EventRR

[175] Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset

Junyi He,Liuling Chen,Hongyang Zhou,Zhang xiaoxing,Xiaobin Zhu,Shengxiang Yu,Jingyan Qin,Xu-Cheng Yin

Main category: cs.CV

TL;DR: 本文提出了一种新的深度引导网络（DGN）用于图像恢复，并构建了一个大规模高分辨率数据集，通过结合深度估计和图像恢复分支，实现了最先进的性能，并在多个标准基准测试中表现出良好的泛化能力。

Details

Motivation: 现有图像恢复方法通常忽视深度信息，导致浅景深场景中的注意力分散和深景深设置中的背景内容过度增强，本文旨在通过引入深度信息来解决这些问题。 Method: 本文提出了一个包含深度估计分支和图像恢复分支的深度引导网络（DGN），其中深度估计分支提供结构指导，图像恢复分支执行核心恢复任务，并通过渐进式窗口自注意力机制和稀疏非局部注意力机制捕捉对象内和对象间的相似性。 Result: 实验表明，该方法在多个标准基准测试中达到了最先进的性能，并且在未见过的植物图像上具有良好的泛化能力，证明了其有效性和鲁棒性。 Conclusion: 本文提出的深度引导网络（DGN）结合深度信息，有效解决了现有图像恢复方法忽视深度信息的问题，具有重要的理论和应用价值。 Abstract: Image restoration has seen substantial progress in recent years. However, existing methods often neglect depth information, which hurts similarity matching, results in attention distractions in shallow depth-of-field (DoF) scenarios, and excessive enhancement of background content in deep DoF settings. To overcome these limitations, we propose a novel Depth-Guided Network (DGN) for image restoration, together with a novel large-scale high-resolution dataset. Specifically, the network consists of two interactive branches: a depth estimation branch that provides structural guidance, and an image restoration branch that performs the core restoration task. In addition, the image restoration branch exploits intra-object similarity through progressive window-based self-attention and captures inter-object similarity via sparse non-local attention. Through joint training, depth features contribute to improved restoration quality, while the enhanced visual features from the restoration branch in turn help refine depth estimation. Notably, we also introduce a new dataset for training and evaluation, consisting of 9,205 high-resolution images from 403 plant species, with diverse depth and texture variations. Extensive experiments show that our method achieves state-of-the-art performance on several standard benchmarks and generalizes well to unseen plant images, demonstrating its effectiveness and robustness.

[176] Unsupervised Real-World Super-Resolution via Rectified Flow Degradation Modelling

Hongyang Zhou,Xiaobin Zhu,Liuling Chen,Junyi He,Jingyan Qin,Xu-Cheng Yin,Zhang xiaoxing

Main category: cs.CV

TL;DR: 本文提出了一种新的无监督现实世界超分辨率方法，通过修正流和傅里叶先验引导模块更精确地建模现实降质过程，从而提升超分辨率效果。

Details

Motivation: 由于现实场景中复杂的未知降质分布，现有的超分辨率方法难以从合成数据推广到真实数据，因此需要更精确的现实降质建模方法。 Method: 提出了修正流降质模块（RFDM）和傅里叶先验引导降质模块（FGDM），结合连续可逆的降质轨迹建模和傅里叶相位信息，生成具有现实降质的低分辨率-高分辨率训练对。 Result: 在现实世界数据集上的实验表明，该方法显著提升了现有超分辨率方法的性能。 Conclusion: 研究提出了一种基于修正流的无监督现实世界超分辨率方法，通过建模降质过程，显著提高了现有超分辨率方法在现实场景中的表现。 Abstract: Unsupervised real-world super-resolution (SR) faces critical challenges due to the complex, unknown degradation distributions in practical scenarios. Existing methods struggle to generalize from synthetic low-resolution (LR) and high-resolution (HR) image pairs to real-world data due to a significant domain gap. In this paper, we propose an unsupervised real-world SR method based on rectified flow to effectively capture and model real-world degradation, synthesizing LR-HR training pairs with realistic degradation. Specifically, given unpaired LR and HR images, we propose a novel Rectified Flow Degradation Module (RFDM) that introduces degradation-transformed LR (DT-LR) images as intermediaries. By modeling the degradation trajectory in a continuous and invertible manner, RFDM better captures real-world degradation and enhances the realism of generated LR images. Additionally, we propose a Fourier Prior Guided Degradation Module (FGDM) that leverages structural information embedded in Fourier phase components to ensure more precise modeling of real-world degradation. Finally, the LR images are processed by both FGDM and RFDM, producing final synthetic LR images with real-world degradation. The synthetic LR images are paired with the given HR images to train the off-the-shelf SR networks. Extensive experiments on real-world datasets demonstrate that our method significantly enhances the performance of existing SR approaches in real-world scenarios.

[177] Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization

Songlin Li,Zhiqing Guo,Yuanman Li,Zeyu Li,Yunfeng Diao,Gaobo Yang,Liejun Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的图像操作定位模型CMB-Net，通过引入大语言模型和图像文本交互模块，提升了模型性能。

Details

Motivation: 现有的图像操作定位（IML）模型主要依赖视觉线索，忽视了内容特征之间的语义逻辑关系。 Method: 提出了一种认知启发的多模态边界保持网络（CMB-Net），结合了大语言模型和图像文本交互模块等技术。 Result: 实验表明，CMB-Net在IML任务中表现出色，优于大多数现有模型。 Conclusion: CMB-Net在IML任务中表现出色，优于大多数现有模型。 Abstract: The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.

[178] Generic Calibration: Pose Ambiguity/Linear Solution and Parametric-hybrid Pipeline

Yuqi Han,Qi Cai,Yuanxin Wu

Main category: cs.CV

TL;DR: 本文提出了一种结合通用和参数化相机标定方法的混合标定方法，以解决姿态模糊问题并提高标定精度。

Details

Motivation: 参数化相机标定模型的选择依赖于用户的实际经验，而通用标定方法无法提供传统的内在参数，且存在姿态模糊问题，影响后续的姿态估计。 Method: 提出了一个线性求解器和一个非线性优化方法来解决通用标定方法中的姿态模糊问题，并引入了一种全局优化的混合标定方法，将通用和参数化模型结合起来。 Result: 仿真和真实实验结果表明，该通用-参数化混合标定方法在不同镜头类型和噪声污染情况下均表现出色，能够提高通用标定的外参精度，并减轻参数化标定中的过拟合和数值不稳定问题。 Conclusion: 本文提出的混合标定方法为复杂场景下的相机标定提供了一个可靠且准确的解决方案。 Abstract: Offline camera calibration techniques typically employ parametric or generic camera models. Selecting parametric models relies heavily on user experience, and an inappropriate camera model can significantly affect calibration accuracy. Meanwhile, generic calibration methods involve complex procedures and cannot provide traditional intrinsic parameters. This paper reveals a pose ambiguity in the pose solutions of generic calibration methods that irreversibly impacts subsequent pose estimation. A linear solver and a nonlinear optimization are proposed to address this ambiguity issue. Then a global optimization hybrid calibration method is introduced to integrate generic and parametric models together, which improves extrinsic parameter accuracy of generic calibration and mitigates overfitting and numerical instability in parametric calibration. Simulation and real-world experimental results demonstrate that the generic-parametric hybrid calibration method consistently excels across various lens types and noise contamination, hopefully serving as a reliable and accurate solution for camera calibration in complex scenarios.

[179] Landmark Guided Visual Feature Extractor for Visual Speech Recognition with Limited Resource

Lei Yang,Junshan Jin,Mingyuan Zhang,Yi He,Bofan Chen,Shilin Wang

Main category: cs.CV

TL;DR: This paper proposes a landmark-guided visual feature extractor to enhance visual speech recognition with limited data by leveraging facial landmarks and spatio-temporal features.

Details

Motivation: Deep learning-based visual speech recognition methods are affected by visual disturbances like lighting conditions and skin texture. These methods also require large amounts of data and computational resources. Method: A spatio-temporal multi-graph convolutional network was designed to exploit facial landmark features, and a multi-level lip dynamic fusion framework was introduced to combine these features with visual features from raw video frames. Result: The proposed method performs well with limited data and improves model accuracy for unseen speakers. Conclusion: The proposed landmark guided visual feature extractor effectively reduces the influence of user-specific features and enhances visual speech recognition performance with limited data. Abstract: Visual speech recognition is a technique to identify spoken content in silent speech videos, which has raised significant attention in recent years. Advancements in data-driven deep learning methods have significantly improved both the speed and accuracy of recognition. However, these deep learning methods can be effected by visual disturbances, such as lightning conditions, skin texture and other user-specific features. Data-driven approaches could reduce the performance degradation caused by these visual disturbances using models pretrained on large-scale datasets. But these methods often require large amounts of training data and computational resources, making them costly. To reduce the influence of user-specific features and enhance performance with limited data, this paper proposed a landmark guided visual feature extractor. Facial landmarks are used as auxiliary information to aid in training the visual feature extractor. A spatio-temporal multi-graph convolutional network is designed to fully exploit the spatial locations and spatio-temporal features of facial landmarks. Additionally, a multi-level lip dynamic fusion framework is introduced to combine the spatio-temporal features of the landmarks with the visual features extracted from the raw video frames. Experimental results show that this approach performs well with limited data and also improves the model's accuracy on unseen speakers.

[180] ASM-UNet: Adaptive Scan Mamba Integrating Group Commonalities and Individual Variations for Fine-Grained Segmentation

Bo Wang,Mengyuan Xu,Yue Yan,Yuqun Yang,Kechen Shu,Wei Ping,Xu Tang,Wei Jiang,Zheng You

Main category: cs.CV

TL;DR: ASM-UNet improves medical image segmentation adaptability by dynamically adjusting scanning orders for both coarse and fine-grained anatomical structures.

Details

Motivation: Existing coarse-grained segmentation methods struggle with fine-grained anatomical variations in clinical scenarios, and Mamba-based models are limited by fixed scanning orders. Method: ASM-UNet introduces adaptive scan scores that combine group-level commonalities and individual-level variations to dynamically guide the scanning order during segmentation. Result: ASM-UNet demonstrated superior performance on two public datasets (ACDC and Synapse) and a newly proposed biliary tract fine-grained segmentation dataset (BTMS). Conclusion: ASM-UNet, a new Mamba-based architecture, enhances both coarse-grained and fine-grained segmentation adaptability by dynamically adjusting scanning orders based on individual and group-level anatomical variations. Abstract: Precise lesion resection depends on accurately identifying fine-grained anatomical structures. While many coarse-grained segmentation (CGS) methods have been successful in large-scale segmentation (e.g., organs), they fall short in clinical scenarios requiring fine-grained segmentation (FGS), which remains challenging due to frequent individual variations in small-scale anatomical structures. Although recent Mamba-based models have advanced medical image segmentation, they often rely on fixed manually-defined scanning orders, which limit their adaptability to individual variations in FGS. To address this, we propose ASM-UNet, a novel Mamba-based architecture for FGS. It introduces adaptive scan scores to dynamically guide the scanning order, generated by combining group-level commonalities and individual-level variations. Experiments on two public datasets (ACDC and Synapse) and a newly proposed challenging biliary tract FGS dataset, namely BTMS, demonstrate that ASM-UNet achieves superior performance in both CGS and FGS tasks. Our code and dataset are available at https://github.com/YqunYang/ASM-UNet.

[181] Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Weitai Kang,Weiming Zhuang,Zhizhong Li,Yan Yan,Lingjuan Lyu

Main category: cs.CV

TL;DR: 该研究系统分析了多模态大语言模型在视觉基础任务中的设计选择，优化了模型性能，并在多个数据集上取得了显著提升。

Details

Motivation: 尽管现有方法在MLLMs的VG任务上表现良好，但它们在微调过程中使用了不同的设计选择，缺乏系统性的验证。论文旨在填补这一空白，提供广泛适用的优化方案。 Method: 论文基于LLaVA-1.5模型，对MLLMs的视觉基础范式和训练数据设计进行了探索与消融研究，以优化VG任务的性能。 Result: 论文在RefCOCO/+/g数据集上实现了显著的性能提升，分别提高了+5.6%、+6.9%和+7.0%。 Conclusion: 该论文通过系统研究多模态大语言模型（MLLMs）在视觉基础（VG）任务中的设计选择，提出了改进方法，并在RefCOCO/+/g数据集上比LLaVA-1.5提升了+5.6%/+6.9%/+7.0%。 Abstract: Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs' fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.

[182] Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

Xin Ma,Yaohui Wang,Genyun Jia,Xinyuan Chen,Tien-Tsin Wong,Cunjian Chen

Main category: cs.CV

TL;DR: MiraMo通过改进注意力机制、运动建模和噪声优化，有效提升了图像动画的生成效率和平滑度。

Details

Motivation: 解决图像动画中外观一致性差、运动过渡突兀以及计算资源消耗大的问题，同时借鉴文本到视频生成的先进方法提升性能。 Method: MiraMo引入了三个关键元素：使用线性注意力的文本到视频架构、运动残差学习范式和基于DCT的噪声优化策略，以提高效率、外观一致性和运动平滑性。 Result: 实验表明，MiraMo在生成质量和推理速度方面优于现有方法，并实现了更好的时间一致性和运动控制能力。 Conclusion: MiraMo框架在生成一致、平滑和可控制的动画方面表现出色，并通过加速推理速度和多功能性展示了其在运动转移和视频编辑任务中的潜力。 Abstract: Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.

[183] SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking

Fengchao Xiong,Zhenxing Wu,Sen Jia,Yuntao Qian

Main category: cs.CV

TL;DR: This paper improves hyperspectral video tracking by focusing on spectral interactions through a novel architecture and training method, resulting in state-of-the-art performance.

Details

Motivation: The motivation is to improve tracking performance in hyperspectral videos by focusing on spectral interactions, which are often overlooked in existing methods that mainly consider spatial interactions. Method: The method involves establishing band-wise long-range spatial relationships using Transformers and modeling spectral interactions with the inclusion-exclusion principle from set theory. Additionally, a spectral loss is introduced to align material distributions. Result: Extensive experiments show that the proposed tracker achieves state-of-the-art performance, and the source code, trained models, and results are made publicly available for reproducibility. Conclusion: The paper concludes that by considering both architectural and training perspectives for modeling spectral interactions, the proposed tracker achieves state-of-the-art tracking performance in hyperspectral videos. Abstract: Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via https://github.com/bearshng/suit to support reproducibility.

[184] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds

Junsheng Huang,Shengyu Hao,Bocheng Hu,Gaoang Wang

Main category: cs.CV

TL;DR: This paper introduces EgoDynamic4D, a novel QA benchmark for dynamic 4D scene understanding, along with a spatio-temporal reasoning framework that outperforms baselines in handling complex egocentric interactions over time.

Details

Motivation: The motivation is to address the lack of unified 4D annotations and task-driven evaluation protocols in existing egocentric datasets, especially for fine-grained spatio-temporal reasoning involving object and human motion and their interactions. Method: The authors introduce EgoDynamic4D, a new QA benchmark with comprehensive 4D annotations and Chain-of-Thought (CoT) reasoning. They propose a framework that uses instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to enable LLMs to process 4D scenes for spatio-temporal reasoning. Result: The experiments on EgoDynamic4D demonstrate that the proposed method consistently outperforms baselines, highlighting the effectiveness of multimodal temporal modeling in dynamic scene understanding. Conclusion: The paper concludes that their proposed end-to-end spatio-temporal reasoning framework effectively enhances egocentric dynamic scene understanding by integrating dynamic and static scene information, outperforming existing baselines. Abstract: Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.

[185] Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM using a Meta Personalized Small VLM

Sihan Yang,Huitong Ji,Shaolin Lu,Jiayi Chen,Binxiao Xu,Ming Lu,Yuanxing Zhang,Wenhui Dong,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Small-Large Collaboration (SLC)的新框架，通过训练一个元个性化的小型VLM来增强大型VLM的个性化能力，从而实现高效训练和更广泛的实际应用。

Details

Motivation: 大型VLMs虽然在复杂的多模态理解方面表现出色，但其高昂的训练成本和受限的访问限制了直接的个性化；而小型VLMs虽然容易个性化且免费可用，但它们缺乏足够的推理能力。 Method: 开发了一种测试时反思策略，以防止小型VLM可能产生的幻觉，并通过实验验证了SLC框架在各种基准测试和大型VLM上的有效性。 Result: SLC框架能够在不牺牲性能的情况下，实现对大型VLM的个性化，并且支持开源和闭源的大型VLM。 Conclusion: 本文提出了一种新的协作框架Small-Large Collaboration (SLC)，通过训练一个元个性化的小型VLM来增强大型VLM的个性化能力，从而实现高效训练和更广泛的实际应用。 Abstract: Personalizing Vision-Language Models (VLMs) to transform them into daily assistants has emerged as a trending research direction. However, leading companies like OpenAI continue to increase model size and develop complex designs such as the chain of thought (CoT). While large VLMs are proficient in complex multi-modal understanding, their high training costs and limited access via paid APIs restrict direct personalization. Conversely, small VLMs are easily personalized and freely available, but they lack sufficient reasoning capabilities. Inspired by this, we propose a novel collaborative framework named Small-Large Collaboration (SLC) for large VLM personalization, where the small VLM is responsible for generating personalized information, while the large model integrates this personalized information to deliver accurate responses. To effectively incorporate personalized information, we develop a test-time reflection strategy, preventing the potential hallucination of the small VLM. Since SLC only needs to train a meta personalized small VLM for the large VLMs, the overall process is training-efficient. To the best of our knowledge, this is the first training-efficient framework that supports both open-source and closed-source large VLMs, enabling broader real-world personalized applications. We conduct thorough experiments across various benchmarks and large VLMs to demonstrate the effectiveness of the proposed SLC framework. The code will be released at https://github.com/Hhankyangg/SLC.

[186] OpenHAIV: A Framework Towards Practical Open-World Learning

Xiang Xiang,Qinhao Zhou,Zhuo Xu,Jing Ma,Jiaxin Dai,Yifan Liang,Hanlin Li

Main category: cs.CV

TL;DR: 本论文提出了一种新的框架OpenHAIV，集成了OOD检测、新类别发现和增量连续微调，使得模型能够在开放世界环境中自主获取和更新知识。

Details

Motivation: 尽管在开放世界识别的各种技术上取得了实质性进展，但这些方法在开放世界场景中仍然面临限制。仅依赖OOD检测并不能促进模型中的知识更新，而增量微调通常需要监督条件，这与开放世界环境显著偏离。因此，需要一种能够自主获取和更新开放世界环境中知识的新框架。 Method: 提出了OpenHAIV框架，整合了OOD检测、新类别发现和增量连续微调。 Result: 开发了一个新的框架OpenHAIV，并且该框架在开放世界环境下允许模型自主获取和更新知识。 Conclusion: OpenHAIV是一个将OOD检测、新类别发现和增量连续微调集成到统一管道中的新框架，允许模型在开放世界环境中自主获取和更新知识。 Abstract: Substantial progress has been made in various techniques for open-world recognition. Out-of-distribution (OOD) detection methods can effectively distinguish between known and unknown classes in the data, while incremental learning enables continuous model knowledge updates. However, in open-world scenarios, these approaches still face limitations. Relying solely on OOD detection does not facilitate knowledge updates in the model, and incremental fine-tuning typically requires supervised conditions, which significantly deviate from open-world settings. To address these challenges, this paper proposes OpenHAIV, a novel framework that integrates OOD detection, new class discovery, and incremental continual fine-tuning into a unified pipeline. This framework allows models to autonomously acquire and update knowledge in open-world environments. The proposed framework is available at https://haiv-lab.github.io/openhaiv .

[187] Representation Understanding via Activation Maximization

Hongbo Zhu,Angelo Cangelosi

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的特征可视化框架，适用于卷积神经网络和视觉Transformer，通过激活最大化技术生成输入，从而更深入地理解深度神经网络的特征表示和潜在漏洞。

Details

Motivation: 理解深度神经网络（DNNs）的内部特征表示是模型可解释性的关键一步。 Method: 受神经科学方法的启发，该研究使用激活最大化（Activation Maximization, AM）来合成能够引发人工神经元强烈反应的输入，并扩展了特征可视化到中间层。 Result: 实验表明，该方法不仅适用于CNN，也适用于ViT，并揭示了DNN的潜在漏洞和决策边界。 Conclusion: 该论文提出了一种统一的特征可视化框架，适用于卷积神经网络（CNNs）和视觉Transformer（ViTs），并展示了其在传统CNN和现代ViT中的有效性和通用性。 Abstract: Understanding internal feature representations of deep neural networks (DNNs) is a fundamental step toward model interpretability. Inspired by neuroscience methods that probe biological neurons using visual stimuli, recent deep learning studies have employed Activation Maximization (AM) to synthesize inputs that elicit strong responses from artificial neurons. In this work, we propose a unified feature visualization framework applicable to both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Unlike prior efforts that predominantly focus on the last output-layer neurons in CNNs, we extend feature visualization to intermediate layers as well, offering deeper insights into the hierarchical structure of learned feature representations. Furthermore, we investigate how activation maximization can be leveraged to generate adversarial examples, revealing potential vulnerabilities and decision boundaries of DNNs. Our experiments demonstrate the effectiveness of our approach in both traditional CNNs and modern ViT, highlighting its generalizability and interpretive value.

[188] SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations

Zhiqiang Shen,Peng Cao,Xiaoli Liu,Jinzhu Yang,Osmar R. Zaiane

Main category: cs.CV

TL;DR: SynMatch通过合成与伪标签匹配的图像来解决医学图像分割中的标签稀缺问题。

Details

Motivation: 标签稀缺是深度学习在医学图像分割中的主要挑战，近期研究使用强-弱伪监督来利用未标记的数据，但伪标签和未标记图像之间的一致性问题通常限制了性能。 Method: SynMatch通过从生成相应伪标签的同一分割模型中提取纹理和形状特征来合成图像。 Result: 实验结果显示，SynMatch在最具挑战性的BSL设置下表现尤为出色。例如，在5%和10%涂鸦注释的息肉分割任务中，它比近期基于强-弱伪监督的方法分别提高了29.71%和10.05%。 Conclusion: SynMatch是一个新颖的框架，在深度学习的医学图像分割中解决了标签稀缺的问题，无需改进伪标签，而是合成匹配的图像。 Abstract: Label scarcity remains a major challenge in deep learning-based medical image segmentation. Recent studies use strong-weak pseudo supervision to leverage unlabeled data. However, performance is often hindered by inconsistencies between pseudo labels and their corresponding unlabeled images. In this work, we propose \textbf{SynMatch}, a novel framework that sidesteps the need for improving pseudo labels by synthesizing images to match them instead. Specifically, SynMatch synthesizes images using texture and shape features extracted from the same segmentation model that generates the corresponding pseudo labels for unlabeled images. This design enables the generation of highly consistent synthesized-image-pseudo-label pairs without requiring any training parameters for image synthesis. We extensively evaluate SynMatch across diverse medical image segmentation tasks under semi-supervised learning (SSL), weakly-supervised learning (WSL), and barely-supervised learning (BSL) settings with increasingly limited annotations. The results demonstrate that SynMatch achieves superior performance, especially in the most challenging BSL setting. For example, it outperforms the recent strong-weak pseudo supervision-based method by 29.71\% and 10.05\% on the polyp segmentation task with 5\% and 10\% scribble annotations, respectively. The code will be released at https://github.com/Senyh/SynMatch.

[189] BEVANet: Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation

Ping-Mao Huang,I-Tien Chao,Ping-Chia Huang,Jia-Wei Liao,Yung-Yu Chuang

Main category: cs.CV

TL;DR: This paper introduces BEVANet, a novel architecture for real-time semantic segmentation that effectively captures large receptive fields and refines detailed contours using efficient attention mechanisms.

Details

Motivation: The motivation is to design efficient architectures for real-time semantic segmentation that can capture large receptive fields while refining detailed contours, addressing the high computational cost of vision transformers. Method: The proposed Bilateral Efficient Visual Attention Network (BEVANet) uses Sparse Decomposed Large Separable Kernel Attentions (SDLSKA), Comprehensive Kernel Selection (CKS), and Deep Large Kernel Pyramid Pooling Module (DLKPPM) to enhance performance. Result: BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining. Conclusion: BEVANet demonstrates state-of-the-art performance in real-time semantic segmentation by introducing efficient mechanisms to capture large receptive fields and refine detailed contours. Abstract: Real-time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision transformers model long-range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch communication, and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state-of-the-art performance. The code and model is available at https://github.com/maomao0819/BEVANet.

[190] DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices

Md Zahurul Haquea,Yeahyea Sarker,Muhammed Farhan Sadique Mahi,Syed Jubayer Jaman,Md Robiul Islam

Main category: cs.CV

TL;DR: 本研究开发了一种高效的火龙果质量检测AI模型，并将其应用于移动设备，便于农民实时进行质量检查。

Details

Motivation: 由于火龙果种植的扩展，高效的产前和产后质量检测变得至关重要，以提高农业生产率并减少采后损失。 Method: 本研究构建了一个包含13789张图像的多样化数据集，并开发了一个优化的轻量级卷积神经网络模型DragonFruitQualityNet，用于在移动设备上进行实时火龙果质量评估。 Result: 所提出的模型达到了93.98%的准确率，在水果质量分类方面优于现有方法。 Conclusion: 该研究通过开发DragonFruitQualityNet，一种轻量级卷积神经网络，并将其嵌入到移动应用程序中，提供了一种准确、高效、可扩展的AI驱动的火龙果质量检测方案，推动了数字农业的发展。 Abstract: Dragon fruit, renowned for its nutritional benefits and economic value, has experienced rising global demand due to its affordability and local availability. As dragon fruit cultivation expands, efficient pre- and post-harvest quality inspection has become essential for improving agricultural productivity and minimizing post-harvest losses. This study presents DragonFruitQualityNet, a lightweight Convolutional Neural Network (CNN) optimized for real-time quality assessment of dragon fruits on mobile devices. We curated a diverse dataset of 13,789 images, integrating self-collected samples with public datasets (dataset from Mendeley Data), and classified them into four categories: fresh, immature, mature, and defective fruits to ensure robust model training. The proposed model achieves an impressive 93.98% accuracy, outperforming existing methods in fruit quality classification. To facilitate practical adoption, we embedded the model into an intuitive mobile application, enabling farmers and agricultural stakeholders to conduct on-device, real-time quality inspections. This research provides an accurate, efficient, and scalable AI-driven solution for dragon fruit quality control, supporting digital agriculture and empowering smallholder farmers with accessible technology. By bridging the gap between research and real-world application, our work advances post-harvest management and promotes sustainable farming practices.

[191] MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

Haiyang Guo,Fei Zhu,Hongbo Zhao,Fanhu Zeng,Wenzhuo Liu,Shijie Ma,Da-Han Wang,Xu-Yao Zhang

Main category: cs.CV

TL;DR: This paper introduces MCITlib, a library for continual instruction tuning of Multimodal Large Language Models, addressing challenges in continual learning involving multiple modalities.

Details

Motivation: Continual learning aims to enable AI systems to continuously learn and adapt without forgetting past knowledge, particularly in Multimodal settings involving cross-modal interactions which pose additional challenges. Method: The authors developed MCITlib, which includes 8 representative algorithms for Multimodal Continual Instruction Tuning, and evaluated them on two carefully selected benchmarks. Result: MCITlib was developed with 8 implemented algorithms, which were systematically evaluated on 2 benchmarks, and it will be updated to keep pace with advancements in Multimodal Continual Learning. Conclusion: MCITlib serves as a comprehensive and evolving library for continual instruction tuning of Multimodal Large Language Models, aiming to advance research in Multimodal Continual Learning. Abstract: Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at https://github.com/Ghy0501/MCITlib.

[192] MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

Min Yang,Zihan Jia,Zhilin Dai,Sheng Guo,Limin Wang

Main category: cs.CV

TL;DR: The paper proposes MobileViCLIP, an efficient video-text model for mobile devices with strong performance in zero-shot classification and retrieval.

Details

Motivation: The motivation is to address the gap in efficient video pre-trained models suitable for mobile devices, as existing models focus on high-latency ViT architectures. Method: The paper introduces temporal structural reparameterization into an efficient image-text model and trains it on a large-scale high-quality video-text dataset. Result: MobileViCLIP-Small achieves 55.4x faster inference speed than InternVideo2-L14 and 6.7x faster than InternVideo2-S14, with similar zero-shot retrieval performance as InternVideo2-L14 and 6.9% better performance than InternVideo2-S14 on MSR-VTT. Conclusion: The paper concludes that MobileViCLIP, an efficient video-text model, can run on mobile devices with strong zero-shot classification and retrieval capabilities, outperforming other models in inference speed and retrieval performance. Abstract: Efficient lightweight neural networks are with increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video pre-trained models still focus on the common ViT architecture with high latency, and few works attempt to build efficient architecture on mobile devices. This paper bridges this gap by introducing temporal structural reparameterization into an efficient image-text model and training it on a large-scale high-quality video-text dataset, resulting in an efficient video-text model that can run on mobile devices with strong zero-shot classification and retrieval capabilities, termed as MobileViCLIP. In particular, in terms of inference speed on mobile devices, our MobileViCLIP-Small is 55.4x times faster than InternVideo2-L14 and 6.7x faster than InternVideo2-S14. In terms of zero-shot retrieval performance, our MobileViCLIP-Small obtains similar performance as InternVideo2-L14 and obtains 6.9\% better than InternVideo2-S14 on MSR-VTT. The code is available at https://github.com/MCG-NJU/MobileViCLIP.

[193] DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

Junyu Xiong,Yonghui Wang,Weichao Zhao,Chenyu Liu,Bing Yin,Wengang Zhou,Houqiang Li

Main category: cs.CV

TL;DR: DocR1 is a new MLLM trained with EviGRPO, a reinforcement learning framework that improves multi-page document understanding by guiding models to retrieve relevant pages before answering. It achieves top performance on multi-page tasks and maintains strong results on single-page benchmarks.

Details

Motivation: Understanding multi-page documents is challenging for MLLMs due to the need for fine-grained visual comprehension and multi-hop reasoning. While RL has been used to improve reasoning in MLLMs, its application to multi-page understanding is underexplored, which motivated the development of DocR1. Method: DocR1 was trained using a novel RL framework called Evidence Page-Guided GRPO (EviGRPO), which uses an evidence-aware reward mechanism to guide a coarse-to-fine reasoning strategy. Additionally, a two-stage annotation pipeline and a curriculum learning strategy were designed to build the datasets. Result: Extensive experiments showed that DocR1 achieves state-of-the-art performance on multi-page tasks and maintains strong results on single-page benchmarks. Two datasets, EviBench and ArxivFullQA, were also created to support training and evaluation. Conclusion: DocR1, an MLLM trained with EviGRPO, achieves state-of-the-art performance on multi-page document understanding tasks while maintaining strong results on single-page benchmarks. Abstract: Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.

[194] RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning

Jinjing Gu,Tianbao Qin,Yuanyuan Pu,Zhengpeng Zhao

Main category: cs.CV

TL;DR: RORPCap是一种高效的图像字幕生成方法，通过基于检索的对象和关系提示，结合Mamba映射网络和CLIP与GPT-2模型，在保证性能的同时显著降低了训练成本。

Details

Motivation: 传统方法依赖于冗余的对象检测和复杂的GCN构建，导致高训练成本。 Method: RORPCap使用基于检索的对象和关系提示方法，结合Mamba映射网络和CLIP与GPT-2模型进行图像字幕生成。 Result: RORPCap在MS-COCO数据集上仅需2.6小时训练时间，取得了120.5%的CIDEr分数和22.0%的SPICE分数。 Conclusion: RORPCap在图像字幕生成任务中表现出色，具有较短的训练时间和可比的性能指标，具有作为替代方案的潜力。 Abstract: Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However, these models suffer from redundant detection information, difficulty in GCN construction, and high training costs. To address these issues, a Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed, inspired by the fact that image-text retrieval can provide rich semantic information for input images. RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image. These words are then incorporate into predefined prompt templates and encoded as prompt embeddings. Next, a Mamba-based mapping network is designed to quickly map image embeddings extracted by CLIP to visual-text embeddings. Finally, the resulting prompt embeddings and visual-text embeddings are concatenated to form textual-enriched feature embeddings, which are fed into a GPT-2 model for caption generation. Extensive experiments conducted on the widely used MS-COCO dataset show that the RORPCap requires only 2.6 hours under cross-entropy loss training, achieving 120.5% CIDEr score and 22.0% SPICE score on the "Karpathy" test split. RORPCap achieves comparable performance metrics to detector-based and GCN-based models with the shortest training time and demonstrates its potential as an alternative for image captioning.

Tuyen Tran,Thao Minh Le,Quang-Hung Le,Truyen Tran

Main category: cs.CV

TL;DR: 本文提出Planner-Refiner框架，通过迭代优化视觉表示并结合语言引导，有效解决视频中视觉与语言对齐的复杂性问题，尤其适用于处理复杂语言提示。

Details

Motivation: 视频中的视觉-语言对齐需要解决语言的复杂性、动态交互实体及其动作链、以及语言与视觉之间的语义差距问题。 Method: Planner-Refiner框架通过迭代优化视觉元素的时空表示，并利用语言指导减少语义差距。Planner模块将复杂的语言提示分解为短句链，Refiner模块通过名词短语和动词短语对视觉标记进行空间和时间上的注意力引导。 Result: Planner-Refiner在多个视频-语言对齐任务（如Referring Video Object Segmentation和Temporal Grounding）中表现出优于现有方法的效果，尤其是在处理复杂语言提示时。此外，作者引入了一个新的MeViS-X基准来评估模型处理长查询的能力。 Conclusion: Planner-Refiner框架在视频-语言对齐任务中表现出色，尤其是在处理复杂语言提示时具有显著优势。 Abstract: Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner's effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models' capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach's potential, especially for complex prompts.

[196] CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation

Fangtai Wu,Mushui Liu,Weijie He,Wanggui He,Hao Jiang,Zhao Wang,Yunlong Yu

Main category: cs.CV

TL;DR: CoAR是一种新的框架，用于在统一的AR模型中注入主题概念，通过保持所有预训练参数完全冻结并使用分层多模态上下文学习策略来学习有效的主题表示。

Details

Motivation: 现有的定制生成方法依赖于完全微调或适配器，这使得它们成本高昂且容易过拟合或灾难性遗忘。 Method: CoAR使用了一种分层多模态上下文学习策略来学习有效的、特定主题的表示，并引入了正则化来防止过拟合和语言漂移。 Result: CoAR在主题驱动的个性化和风格个性化方面都取得了优越的性能，同时在计算和内存效率方面有显著提升。 Conclusion: CoAR是一个新的框架，可以在统一的AR模型中注入主题概念，同时保持所有预训练参数完全冻结。 Abstract: The unified autoregressive (AR) model excels at multimodal understanding and generation, but its potential for customized image generation remains underexplored. Existing customized generation methods rely on full fine-tuning or adapters, making them costly and prone to overfitting or catastrophic forgetting. In this paper, we propose \textbf{CoAR}, a novel framework for injecting subject concepts into the unified AR models while keeping all pre-trained parameters completely frozen. CoAR learns effective, specific subject representations with only a minimal number of parameters using a Layerwise Multimodal Context Learning strategy. To address overfitting and language drift, we further introduce regularization that preserves the pre-trained distribution and anchors context tokens to improve subject fidelity and re-contextualization. Additionally, CoAR supports training-free subject customization in a user-provided style. Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization, while delivering significant gains in computational and memory efficiency. Notably, CoAR tunes less than \textbf{0.05\%} of the parameters while achieving competitive performance compared to recent Proxy-Tuning. Code: https://github.com/KZF-kzf/CoAR

[197] SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal

Tingyu Yang,Jue Gong,Jinpei Guo,Wenbo Li,Yong Guo,Yulun Zhang

Main category: cs.CV

TL;DR: This paper proposes SODiff, a novel diffusion model for removing JPEG compression artifacts by leveraging semantic-oriented guidance and adaptive denoising, achieving superior restoration performance compared to existing techniques.

Details

Motivation: JPEG compression often introduces visual artifacts, especially at high compression ratios, and existing deep learning methods struggle to restore complex textures, leading to over-smoothed results. This necessitates a more effective approach for artifact removal that better preserves image details. Method: The paper introduces SODiff, a semantic-oriented one-step diffusion model for removing JPEG artifacts. It uses a semantic-aligned image prompt extractor (SAIPE) to extract and align features from low-quality images with the text encoder's space, while preserving reconstruction details. Additionally, it employs a quality factor-aware time predictor to adaptively select optimal denoising steps based on the image's compression quality. Result: Experimental results demonstrate that SODiff outperforms recent state-of-the-art methods in both visual quality and quantitative metrics for JPEG artifact removal. Conclusion: SODiff successfully addresses the issue of JPEG compression artifact removal by incorporating semantic-oriented guidance and a quality factor-aware time predictor, outperforming existing methods in visual quality and quantitative metrics. Abstract: JPEG, as a widely used image compression standard, often introduces severe visual artifacts when achieving high compression ratios. Although existing deep learning-based restoration methods have made considerable progress, they often struggle to recover complex texture details, resulting in over-smoothed outputs. To overcome these limitations, we propose SODiff, a novel and efficient semantic-oriented one-step diffusion model for JPEG artifacts removal. Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model, thereby fully leveraging its powerful generative prior. To this end, SODiff incorporates a semantic-aligned image prompt extractor (SAIPE). SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder. Simultaneously, it preserves crucial information for faithful reconstruction. Furthermore, we propose a quality factor-aware time predictor that implicitly learns the compression quality factor (QF) of the LQ image and adaptively selects the optimal denoising start timestep for the diffusion process. Extensive experimental results show that our SODiff outperforms recent leading methods in both visual quality and quantitative metrics. Code is available at: https://github.com/frakenation/SODiff

[198] GS4Buildings: Prior-Guided Gaussian Splatting for 3D Building Reconstruction

Qilin Zhang,Olaf Wysocki,Boris Jutzi

Main category: cs.CV

TL;DR: GS4Buildings是一种新的基于语义3D建筑模型的高斯点阵方法，用于改进大规模城市场景中的建筑表面重建，显著提高了重建的完整性和精度。

Details

Motivation: 2D高斯点阵(2DGS)在大规模复杂城市场景中表现不佳，频繁遮挡导致建筑重建不完整，因此需要一种更鲁棒的方法来解决这些问题。 Method: GS4Buildings直接从低层次细节层次(LoD)2语义3D建筑模型初始化高斯点阵，并通过从平面建筑几何生成先验深度和法线图来优化过程，同时引入了一种可选的建筑专注模式以限制重建区域，实现更高效的表示。 Result: GS4Buildings在城市数据集上的实验显示重建完整性提高了20.5%，几何精度提高了32.8%，并且实现了71.8%的高斯原语减少。 Conclusion: GS4Buildings有效地利用语义3D建筑模型进行高斯点阵重建，提高了建筑表面重建的鲁棒性和可扩展性，并在城市数据集上展示了其重建完整性和几何精度的显著改进。 Abstract: Recent advances in Gaussian Splatting (GS) have demonstrated its effectiveness in photo-realistic rendering and 3D reconstruction. Among these, 2D Gaussian Splatting (2DGS) is particularly suitable for surface reconstruction due to its flattened Gaussian representation and integrated normal regularization. However, its performance often degrades in large-scale and complex urban scenes with frequent occlusions, leading to incomplete building reconstructions. We propose GS4Buildings, a novel prior-guided Gaussian Splatting method leveraging the ubiquity of semantic 3D building models for robust and scalable building surface reconstruction. Instead of relying on traditional Structure-from-Motion (SfM) pipelines, GS4Buildings initializes Gaussians directly from low-level Level of Detail (LoD)2 semantic 3D building models. Moreover, we generate prior depth and normal maps from the planar building geometry and incorporate them into the optimization process, providing strong geometric guidance for surface consistency and structural accuracy. We also introduce an optional building-focused mode that limits reconstruction to building regions, achieving a 71.8% reduction in Gaussian primitives and enabling a more efficient and compact representation. Experiments on urban datasets demonstrate that GS4Buildings improves reconstruction completeness by 20.5% and geometric accuracy by 32.8%. These results highlight the potential of semantic building model integration to advance GS-based reconstruction toward real-world urban applications such as smart cities and digital twins. Our project is available: https://github.com/zqlin0521/GS4Buildings.

[199] Training and Inference within 1 Second -- Tackle Cross-Sensor Degradation of Real-World Pansharpening with Efficient Residual Feature Tailoring

Tianyu Xin,Jin-Liang Xiao,Zeyu Xia,Shan Yin,Liang-Jian Deng

Main category: cs.CV

TL;DR: This paper introduces a novel method for pansharpening that improves cross-sensor generalization while drastically reducing training and inference time.

Details

Motivation: Existing deep learning methods for pansharpening face challenges in generalizing across sensors, requiring either time-consuming retraining or extra data. This work aims to address these limitations. Method: The method involves modular decomposition of deep learning models, integration of a Feature Tailor at a critical interface, and efficient training with physics-aware unsupervised losses in a patch-wise manner. Result: The experiments showed significant improvements in cross-sensor performance, with training and inference times reduced to sub-seconds for smaller images and just seconds for larger images, making it over 100 times faster than zero-shot methods. Conclusion: The proposed method for pansharpening demonstrates improved generalization ability and low generalization cost, achieving state-of-the-art results in cross-sensor degradation scenarios. Abstract: Deep learning methods for pansharpening have advanced rapidly, yet models pretrained on data from a specific sensor often generalize poorly to data from other sensors. Existing methods to tackle such cross-sensor degradation include retraining model or zero-shot methods, but they are highly time-consuming or even need extra training data. To address these challenges, our method first performs modular decomposition on deep learning-based pansharpening models, revealing a general yet critical interface where high-dimensional fused features begin mapping to the channel space of the final image. % may need revisement A Feature Tailor is then integrated at this interface to address cross-sensor degradation at the feature level, and is trained efficiently with physics-aware unsupervised losses. Moreover, our method operates in a patch-wise manner, training on partial patches and performing parallel inference on all patches to boost efficiency. Our method offers two key advantages: (1) $\textit{Improved Generalization Ability}$: it significantly enhance performance in cross-sensor cases. (2) $\textit{Low Generalization Cost}$: it achieves sub-second training and inference, requiring only partial test inputs and no external data, whereas prior methods often take minutes or even hours. Experiments on the real-world data from multiple datasets demonstrate that our method achieves state-of-the-art quality and efficiency in tackling cross-sensor degradation. For example, training and inference of $512\times512\times8$ image within $\textit{0.2 seconds}$ and $4000\times4000\times8$ image within $\textit{3 seconds}$ at the fastest setting on a commonly used RTX 3090 GPU, which is over 100 times faster than zero-shot methods.

[200] DIP-GS: Deep Image Prior For Gaussian Splatting Sparse View Recovery

Rajaei Khatib,Raja Giryes

Main category: cs.CV

TL;DR: DIP-GS improves sparse-view 3D scene reconstruction by integrating a Deep Image Prior into the 3DGS framework without relying on pre-trained models.

Details

Motivation: 3D Gaussian Splatting (3DGS) struggles with sparse view reconstruction. The paper aims to enhance its performance in such scenarios. Method: DIP-GS integrates a Deep Image Prior (DIP) into the 3DGS framework, operating in a coarse-to-fine manner to improve sparse view reconstruction. Result: DIP-GS performs well in sparse-view recovery scenarios where traditional 3DGS fails, achieving state-of-the-art results. Conclusion: DIP-GS is able to achieve SOTA competitive results on various sparse-view reconstruction tasks without using any pre-trained models. Abstract: 3D Gaussian Splatting (3DGS) is a leading 3D scene reconstruction method, obtaining high-quality reconstruction with real-time rendering runtime performance. The main idea behind 3DGS is to represent the scene as a collection of 3D gaussians, while learning their parameters to fit the given views of the scene. While achieving superior performance in the presence of many views, 3DGS struggles with sparse view reconstruction, where the input views are sparse and do not fully cover the scene and have low overlaps. In this paper, we propose DIP-GS, a Deep Image Prior (DIP) 3DGS representation. By using the DIP prior, which utilizes internal structure and patterns, with coarse-to-fine manner, DIP-based 3DGS can operate in scenarios where vanilla 3DGS fails, such as sparse view recovery. Note that our approach does not use any pre-trained models such as generative models and depth estimation, but rather relies only on the input frames. Among such methods, DIP-GS obtains state-of-the-art (SOTA) competitive results on various sparse-view reconstruction tasks, demonstrating its capabilities.

[201] LET-US: Long Event-Text Understanding of Scenes

Rui Chen,Xingyu Chen,Shaoan Wang,Shihan Kong,Junzhi Yu

Main category: cs.CV

TL;DR: 本文提出LET-US框架，通过自适应压缩机制处理长时间事件流，实现跨模态理解和分析，并构建了大规模事件-文本对齐数据集进行训练和评估，以提高多模态大语言模型在长时间事件流上的描述准确性和语义理解能力。

Details

Motivation: 现有的多模态大语言模型在处理RGB视频内容方面取得了显著成功，但在处理事件相机产生的长时间事件流时仍存在局限，无法有效解释事件流或仅能处理极短序列。 Method: LET-US采用自适应压缩机制减少输入事件量，同时保留关键视觉细节。通过两阶段优化范式缩小事件流与文本表示之间的模态差异，并利用文本引导的跨模态查询、层次聚类和相似度计算提取最具代表性的事件特征。此外，构建了大规模事件-文本对齐数据集以支持训练。 Result: LET-US在多个任务（推理、描述、分类、时间定位和时刻检索）的综合基准测试中表现优异，实验结果表明其在长时间事件流的描述准确性和语义理解方面优于当前最先进的多模态大语言模型。 Conclusion: LET-US框架为长时间事件流-文本的理解提供了有效解决方案，拓展了多模态大语言模型在事件流处理方面的边界。 Abstract: Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream--text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively equips our model with the capacity to interpret event-based scenes. To handle the voluminous temporal information inherent in long event streams, we leverage text-guided cross-modal queries for feature reduction, augmented by hierarchical clustering and similarity computation to distill the most representative event features. Moreover, we curate and construct a large-scale event-text aligned dataset to train our model, achieving tighter alignment of event features within the LLM embedding space. We also develop a comprehensive benchmark covering a diverse set of tasks -- reasoning, captioning, classification, temporal localization and moment retrieval. Experimental results demonstrate that LET-US outperforms prior state-of-the-art MLLMs in both descriptive accuracy and semantic comprehension on long-duration event streams. All datasets, codes, and models will be publicly available.

[202] ForensicsSAM: Toward Robust and Unified Image Forgery Detection and Localization Resisting to Adversarial Attack

Rongxuan Peng,Shunquan Tan,Chenqi Kong,Anwei Luo,Alex C. Kot,Jiwu Huang

Main category: cs.CV

TL;DR: 本文提出了一种新的对抗鲁棒性IFDL框架ForensicsSAM，通过注入伪造专家和对抗专家，以及设计对抗检测器，有效提升了图像伪造检测和定位的性能。

Details

Motivation: 现有的PEFT方法忽略了它们对抗攻击的脆弱性，本文旨在解决这一问题。 Method: 提出ForensicsSAM框架，包括注入伪造专家以增强捕捉伪造伪影的能力、设计一个轻量级的对抗检测器以识别RGB域中的结构化任务特定伪影、注入对抗专家以逐步纠正由对抗噪声引起的特征偏移。 Result: 实验结果表明，ForensicsSAM在多个基准测试中实现了对各种对抗攻击方法的卓越抵抗力，并在图像级伪造检测和像素级伪造定位方面达到了最先进的性能。 Conclusion: ForensicsSAM是一个统一的IFDL框架，具有内置的对抗鲁棒性，可以有效抵抗各种对抗攻击，同时在图像级伪造检测和像素级伪造定位方面提供最先进的性能。 Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a popular strategy for adapting large vision foundation models, such as the Segment Anything Model (SAM) and LLaVA, to downstream tasks like image forgery detection and localization (IFDL). However, existing PEFT-based approaches overlook their vulnerability to adversarial attacks. In this paper, we show that highly transferable adversarial images can be crafted solely via the upstream model, without accessing the downstream model or training data, significantly degrading the IFDL performance. To address this, we propose ForensicsSAM, a unified IFDL framework with built-in adversarial robustness. Our design is guided by three key ideas: (1) To compensate for the lack of forgery-relevant knowledge in the frozen image encoder, we inject forgery experts into each transformer block to enhance its ability to capture forgery artifacts. These forgery experts are always activated and shared across any input images. (2) To detect adversarial images, we design an light-weight adversary detector that learns to capture structured, task-specific artifact in RGB domain, enabling reliable discrimination across various attack methods. (3) To resist adversarial attacks, we inject adversary experts into the global attention layers and MLP modules to progressively correct feature shifts induced by adversarial noise. These adversary experts are adaptively activated by the adversary detector, thereby avoiding unnecessary interference with clean images. Extensive experiments across multiple benchmarks demonstrate that ForensicsSAM achieves superior resistance to various adversarial attack methods, while also delivering state-of-the-art performance in image-level forgery detection and pixel-level forgery localization. The resource is available at https://github.com/siriusPRX/ForensicsSAM.

[203] CharacterShot: Controllable and Consistent 4D Character Animation

Junyao Gao,Jiaxing Li,Wenran Liu,Yanhong Zeng,Fei Shen,Kai Chen,Yanan Sun,Cairong Zhao

Main category: cs.CV

TL;DR: This paper introduces CharacterShot, a framework for creating dynamic 3D characters from a single image and 2D pose sequence, which outperforms existing methods.

Details

Motivation: The motivation is to enable individual designers to create dynamic 3D characters from a single reference image and a 2D pose sequence in a controllable and consistent manner. Method: The paper proposes CharacterShot, which involves pretraining a 2D character animation model, lifting the model to 3D with a dual-attention module, and employing a neighbor-constrained 4D Gaussian splatting optimization. Additionally, a dataset named Character4D is constructed. Result: The experiments demonstrate that the proposed approach outperforms current state-of-the-art methods on a newly constructed benchmark called CharacterBench. Conclusion: The paper concludes that CharacterShot outperforms current state-of-the-art methods in creating dynamic 3D characters from a single reference image and a 2D pose sequence. Abstract: In this paper, we propose \textbf{CharacterShot}, a controllable and consistent 4D character animation framework that enables any individual designer to create dynamic 3D characters (i.e., 4D character animation) from a single reference character image and a 2D pose sequence. We begin by pretraining a powerful 2D character animation model based on a cutting-edge DiT-based image-to-video model, which allows for any 2D pose sequnce as controllable signal. We then lift the animation model from 2D to 3D through introducing dual-attention module together with camera prior to generate multi-view videos with spatial-temporal and spatial-view consistency. Finally, we employ a novel neighbor-constrained 4D gaussian splatting optimization on these multi-view videos, resulting in continuous and stable 4D character representations. Moreover, to improve character-centric performance, we construct a large-scale dataset Character4D, containing 13,115 unique characters with diverse appearances and motions, rendered from multiple viewpoints. Extensive experiments on our newly constructed benchmark, CharacterBench, demonstrate that our approach outperforms current state-of-the-art methods. Code, models, and datasets will be publicly available at https://github.com/Jeoyal/CharacterShot.

[204] CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization

Youqi Wang,Shunquan Tan,Rongxuan Peng,Bin Li,Jiwu Huang

Main category: cs.CV

TL;DR: This paper introduces CLUE, a new method for detecting image forgeries by repurposing a text-to-image synthesis model, which outperforms existing techniques and is resistant to common post-processing attacks.

Details

Motivation: The motivation is to address the issue of visually convincing forgeries due to the increasing accessibility of image editing tools and generative AI, which compromises the authenticity of digital media. Method: CLUE (Capture Latent Uncovered Evidence) uses Low-Rank Adaptation (LoRA) to reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor, incorporating contextual features from the Segment Anything Model (SAM). Result: Extensive evaluations demonstrate CLUE's superior generalization performance and robustness against common post-processing attacks and Online Social Networks. Conclusion: The paper concludes that CLUE is a state-of-the-art forgery localization tool that outperforms previous methods and is robust against common post-processing attacks and Online Social Networks. Abstract: The increasing accessibility of image editing tools and generative AI has led to a proliferation of visually convincing forgeries, compromising the authenticity of digital media. In this paper, in addition to leveraging distortions from conventional forgeries, we repurpose the mechanism of a state-of-the-art (SOTA) text-to-image synthesis model by exploiting its internal generative process, turning it into a high-fidelity forgery localization tool. To this end, we propose CLUE (Capture Latent Uncovered Evidence), a framework that employs Low- Rank Adaptation (LoRA) to parameter-efficiently reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor. Our approach begins with the strategic use of SD3's Rectified Flow (RF) mechanism to inject noise at varying intensities into the latent representation, thereby steering the LoRAtuned denoising process to amplify subtle statistical inconsistencies indicative of a forgery. To complement the latent analysis with high-level semantic context and precise spatial details, our method incorporates contextual features from the image encoder of the Segment Anything Model (SAM), which is parameter-efficiently adapted to better trace the boundaries of forged regions. Extensive evaluations demonstrate CLUE's SOTA generalization performance, significantly outperforming prior methods. Furthermore, CLUE shows superior robustness against common post-processing attacks and Online Social Networks (OSNs). Code is publicly available at https://github.com/SZAISEC/CLUE.

[205] Freeze and Reveal: Exposing Modality Bias in Vision-Language Models

Vivek Hruday Kavuri,Vysishtya Karanam,Venkata Jahnavi Venkamsetty,Kriti Madumadukala,Lakshmipathi Balaji Darur,Ponnurangam Kumaraguru

Main category: cs.CV

TL;DR: This paper investigates and addresses gender bias in Vision Language Models by identifying the sources of bias in either vision or text encoders and proposing efficient debiasing methods, CDA and DAUDoS, which reduce bias with minimal computational cost.

Details

Motivation: Vision Language Models often inherit gender biases from their training data, which can come from either the vision or text modalities. This work aims to identify the dominant source of bias and develop efficient debiasing techniques to reduce gender bias in multi-modal systems. Method: The authors used Counterfactual Data Augmentation (CDA) and Task Vector methods to dissect the contributions of vision and text backbones to gender bias. They introduced a novel metric called Degree of Stereotypicality and a corresponding debiasing method, DAUDoS. The methods were evaluated on the VisoGender benchmark using a gender-annotated dataset. Result: CDA reduced the gender gap by 6%, while DAUDoS reduced it by 3% but with only one-third of the data. Both methods improved the model's ability to correctly identify gender in images by 3%, with DAUDoS achieving this using minimal training data. The experiments revealed that CLIP's vision encoder is more biased, whereas PaliGemma2's text encoder is more biased. Conclusion: Vision Language Models (VLMs) inherit gender biases from training data, and these biases can be reduced using targeted debiasing methods like Counterfactual Data Augmentation (CDA) and Data Augmentation Using Degree of Stereotypicality (DAUDoS). The study identifies that biases in CLIP and PaliGemma2 models are primarily rooted in vision and text encoders respectively, suggesting that future multi-modal systems can adopt more targeted and effective bias mitigation strategies. Abstract: Vision Language Models achieve impressive multi-modal performance but often inherit gender biases from their training data. This bias might be coming from both the vision and text modalities. In this work, we dissect the contributions of vision and text backbones to these biases by applying targeted debiasing using Counterfactual Data Augmentation and Task Vector methods. Inspired by data-efficient approaches in hate-speech classification, we introduce a novel metric, Degree of Stereotypicality and a corresponding debiasing method, Data Augmentation Using Degree of Stereotypicality - DAUDoS, to reduce bias with minimal computational cost. We curate a gender annotated dataset and evaluate all methods on VisoGender benchmark to quantify improvements and identify dominant source of bias. Our results show that CDA reduces the gender gap by 6% and DAUDoS by 3% but using only one-third of the data. Both methods also improve the model's ability to correctly identify gender in images by 3%, with DAUDoS achieving this improvement using only almost one-third of training data. From our experiment's, we observed that CLIP's vision encoder is more biased whereas PaliGemma2's text encoder is more biased. By identifying whether bias stems more from vision or text encoders, our work enables more targeted and effective bias mitigation strategies in future multi-modal systems.

[206] Levarging Learning Bias for Noisy Anomaly Detection

Yuxin Zhang,Yunkang Cao,Yuqi Cheng,Yihan Sun,Weiming Shen

Main category: cs.CV

TL;DR: 提出了一种利用模型学习偏差的两阶段框架，用于解决训练数据可能包含未标记异常的无监督图像异常检测问题，具有良好的性能和广泛适用性。

Details

Motivation: 传统方法假设训练数据无异常，但现实数据可能存在污染，导致模型将异常误认为正常，影响检测性能。因此需要一种能够抵抗训练数据污染的方法。 Method: 第一阶段通过划分训练集、训练子模型并聚合跨模型异常分数来过滤出纯净数据集；第二阶段在该数据集上训练最终的检测器。 Result: 在Real-IAD基准上的实验表明，该方法在不同噪声条件下表现出色，具有良好的异常检测和定位能力，消融实验也验证了其对数据污染的鲁棒性。 Conclusion: 该论文提出的两阶段框架能够有效利用模型的固有学习偏差，以应对训练数据中可能存在未标记异常的完全无监督图像异常检测挑战，具有广泛的兼容性和实用性。 Abstract: This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework's contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.

[207] Health Care Waste Classification Using Deep Learning Aligned with Nepal's Bin Color Guidelines

Suman Kunwar,Prabesh Rai

Main category: cs.CV

TL;DR: This study evaluates deep learning models for health care waste classification in Nepal, finding YOLOv5-s as the most accurate model, which has been deployed online for public use.

Details

Motivation: The increasing number of Health Care facilities in Nepal has led to challenges in managing health care waste (HCW). Improper segregation and disposal of HCW can cause contamination, spread infectious diseases, and endanger waste handlers. Method: The study benchmarks HCW classification models using Stratified K-fold techniques with 5 folds on combined HCW data. A repetitive ANOVA was used to assess statistical significance. Result: YOLOv5-s achieved the highest accuracy of 95.06% but had a slightly slower inference speed compared to YOLOv8-n. EfficientNet-B0 achieved 93.22% accuracy but had the highest inference time. Conclusion: YOLOv5-s is the best performing model for HCW classification in Nepal, and it has been deployed to the web for public usage. Abstract: The increasing number of Health Care facilities in Nepal has also added up the challenges on managing health care waste (HCW). Improper segregation and disposal of HCW leads to the contamination, spreading of infectious diseases and puts a risk of waste handlers. This study benchmarks the state of the art waste classification models: ResNeXt-50, EfficientNet-B0, MobileNetV3-S, YOLOv8-n and YOLOv5-s using Stratified K-fold techniques where we use 5 folds on combined HCW data, and found that the YOLOv5-s achieved higher of 95.06% accuracy but fell short few milliseconds in inference speed with YOLOv8-n model. The EfficientNet-B0 showed promising results of 93.22% accuracy but took the highest inference time. A repetitive ANOVA was performed to see statistical significance and the best performing model (YOLOv5-s) was deployed to the web with mapped bin color using Nepal's HCW management standards for public usage. Further work on the data was suggested along with localized context.

[208] AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning

Siminfar Samakoush Galougah,Rishie Raj,Sanjoy Chowdhury,Sayan Nag,Ramani Duraiswami

Main category: cs.CV

TL;DR: AURA is a new audio-visual benchmark designed to evaluate cross-modal reasoning by forcing models to use both audio and video for answers. It introduces AuraScore, which reveals that while models often get answers right, their reasoning is frequently flawed.

Details

Motivation: Current audio-visual benchmarks focus only on final answer accuracy, failing to assess the reasoning process. This makes it difficult to differentiate between genuine comprehension and correct answers achieved through flawed logic or hallucinations. Method: AURA benchmark was introduced, which includes questions across six cognitive domains that require cross-modal reasoning. A novel metric, AuraScore, was proposed to evaluate reasoning fidelity through Factual Consistency and Core Inference scores. Result: State-of-the-art models achieved high accuracy (up to 92% on some tasks) but scored below 45% on Factual Consistency and Core Inference, indicating flawed reasoning processes. Conclusion: The study concludes that despite high accuracy in answering questions, models often rely on flawed reasoning, highlighting the need for benchmarks like AURA to ensure logical and evidence-based reasoning. Abstract: Current audio-visual (AV) benchmarks focus on final answer accuracy, overlooking the underlying reasoning process. This makes it difficult to distinguish genuine comprehension from correct answers derived through flawed reasoning or hallucinations. To address this, we introduce AURA (Audio-visual Understanding and Reasoning Assessment), a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs). AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling, explicitly designed to be unanswerable from a single modality. This forces models to construct a valid logical path grounded in both audio and video, setting AURA apart from AV datasets that allow uni-modal shortcuts. To assess reasoning traces, we propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity. It decomposes reasoning into two aspects: (i) Factual Consistency - whether reasoning is grounded in perceptual evidence, and (ii) Core Inference - the logical validity of each reasoning step. Evaluations of SOTA models on AURA reveal a critical reasoning gap: although models achieve high accuracy (up to 92% on some tasks), their Factual Consistency and Core Inference scores fall below 45%. This discrepancy highlights that models often arrive at correct answers through flawed logic, underscoring the need for our benchmark and paving the way for more robust multimodal evaluation.

[209] Novel View Synthesis with Gaussian Splatting: Impact on Photogrammetry Model Accuracy and Resolution

Pranav Chougule

Main category: cs.CV

TL;DR: 这篇论文研究了摄影测量和高斯点绘技术在3D重建和视图合成上的比较，作者改进了高斯点绘方法以生成更高质量的新视角，并证明其在3D建模中的潜力。

Details

Motivation: 论文的动机是比较摄影测量和高斯点绘技术在3D模型重建和视图合成方面的性能，并探索高斯点绘在提高摄影测量质量方面的潜力。 Method: 论文通过创建一个真实场景的图像数据集，并使用摄影测量和高斯点绘两种方法构建3D模型进行比较。作者修改并增强了高斯点绘仓库，以在Blender环境中生成新的相机姿态渲染图像，并利用增强数据集生成新的摄影测量模型。 Result: 结果表明，高斯点绘在生成新颖高质量视图方面具有显著效果，并能够改进基于摄影测量的3D重建。同时分析了两种方法的优势和局限性。 Conclusion: 论文得出结论，高斯点绘在生成新颖高质量视图方面表现出色，并且有潜力改进基于摄影测量的3D重建。对扩展现实（XR）、摄影测量和自动驾驶车辆模拟的应用提供了有价值的信息。 Abstract: In this paper, I present a comprehensive study comparing Photogrammetry and Gaussian Splatting techniques for 3D model reconstruction and view synthesis. I created a dataset of images from a real-world scene and constructed 3D models using both methods. To evaluate the performance, I compared the models using structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS), and lp/mm resolution based on the USAF resolution chart. A significant contribution of this work is the development of a modified Gaussian Splatting repository, which I forked and enhanced to enable rendering images from novel camera poses generated in the Blender environment. This innovation allows for the synthesis of high-quality novel views, showcasing the flexibility and potential of Gaussian Splatting. My investigation extends to an augmented dataset that includes both original ground images and novel views synthesized via Gaussian Splatting. This augmented dataset was employed to generate a new photogrammetry model, which was then compared against the original photogrammetry model created using only the original images. The results demonstrate the efficacy of using Gaussian Splatting to generate novel high-quality views and its potential to improve photogrammetry-based 3D reconstructions. The comparative analysis highlights the strengths and limitations of both approaches, providing valuable information for applications in extended reality (XR), photogrammetry, and autonomous vehicle simulations. Code is available at https://github.com/pranavc2255/gaussian-splatting-novel-view-render.git.

[210] VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding

Jian Chen,Ming Li,Jihyung Kil,Chenguang Wang,Tong Yu,Ryan Rossi,Tianyi Zhou,Changyou Chen,Ruiyi Zhang

Main category: cs.CV

TL;DR: Error

Details

Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Most organizational data in this world are stored as documents, and visual retrieval plays a crucial role in unlocking the collective intelligence from all these documents. However, existing benchmarks focus on English-only document retrieval or only consider multilingual question-answering on a single-page image. To bridge this gap, we introduce VisR-Bench, a multilingual benchmark designed for question-driven multimodal retrieval in long documents. Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents, enabling fine-grained evaluation of multimodal retrieval. VisR-Bench spans sixteen languages with three question types (figures, text, and tables), offering diverse linguistic and question coverage. Unlike prior datasets, we include queries without explicit answers, preventing models from relying on superficial keyword matching. We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs, providing insights into their strengths and limitations. Our results show that while MLLMs significantly outperform text-based and multimodal encoder models, they still struggle with structured tables and low-resource languages, highlighting key challenges in multilingual visual retrieval.

[211] FormCoach: Lift Smarter, Not Harder

Xiaoye Zuo,Nikos Athanasiou,Ginger Delmas,Yiming Huang,Xingyu Fu,Lingjie Liu

Main category: cs.CV

TL;DR: FormCoach is an AI-powered fitness coaching system that uses VLMs to provide real-time exercise form feedback, highlighting opportunities and challenges in embodied AI.

Details

Motivation: Expert fitness feedback is often inaccessible to at-home fitness enthusiasts, creating a need for an affordable, accessible, and intelligent coaching solution. Method: FormCoach uses vision-language models (VLMs) and a dataset of 1,700 expert-annotated video pairs to provide real-time feedback on exercise form through a web interface. Result: The system can detect subtle form errors and provide tailored corrections, though benchmarks reveal significant gaps compared to human-level coaching. Conclusion: FormCoach presents a novel approach to AI-driven fitness coaching by integrating vision-language models, showcasing both the potential and challenges of nuanced movement analysis in interactive systems. Abstract: Good form is the difference between strength and strain, yet for the fast-growing community of at-home fitness enthusiasts, expert feedback is often out of reach. FormCoach transforms a simple camera into an always-on, interactive AI training partner, capable of spotting subtle form errors and delivering tailored corrections in real time, leveraging vision-language models (VLMs). We showcase this capability through a web interface and benchmark state-of-the-art VLMs on a dataset of 1,700 expert-annotated user-reference video pairs spanning 22 strength and mobility exercises. To accelerate research in AI-driven coaching, we release both the dataset and an automated, rubric-based evaluation pipeline, enabling standardized comparison across models. Our benchmarks reveal substantial gaps compared to human-level coaching, underscoring both the challenges and opportunities in integrating nuanced, context-aware movement analysis into interactive AI systems. By framing form correction as a collaborative and creative process between humans and machines, FormCoach opens a new frontier in embodied AI.

[212] From Field to Drone: Domain Drift Tolerant Automated Multi-Species and Damage Plant Semantic Segmentation for Herbicide Trials

Artzai Picon,Itziar Eguskiza,Daniel Mugica,Javier Romero,Carlos Javier Jimenez,Eric White,Gabriel Do-Lago-Junqueira,Christian Klukas,Ramon Navarra-Mestre

Main category: cs.CV

TL;DR: 本研究开发了一种改进的分割模型，用于自动化识别作物和杂草的损伤，显著提升了识别准确率和跨设备适用性。

Details

Motivation: 传统人工视觉评估方法耗时、劳动密集且主观，自动化识别因细微视觉差异面临挑战。 Method: 改进的分割模型结合了通用自监督视觉模型和基于植物分类学的层次推理。 Result: 物种识别F1得分从0.52提升至0.85，R平方从0.75到0.98；损伤分类F1得分从0.28到0.44，R平方从0.71到0.87。 Conclusion: 模型在跨设备评估中表现出强大的鲁棒性，现在已被部署在BASF的表型分析流程中，实现了大规模自动化作物和杂草监测。 Abstract: Field trials are vital in herbicide research and development to assess effects on crops and weeds under varied conditions. Traditionally, evaluations rely on manual visual assessments, which are time-consuming, labor-intensive, and subjective. Automating species and damage identification is challenging due to subtle visual differences, but it can greatly enhance efficiency and consistency. We present an improved segmentation model combining a general-purpose self-supervised visual model with hierarchical inference based on botanical taxonomy. Trained on a multi-year dataset (2018-2020) from Germany and Spain using digital and mobile cameras, the model was tested on digital camera data (year 2023) and drone imagery from the United States, Germany, and Spain (year 2024) to evaluate robustness under domain shift. This cross-device evaluation marks a key step in assessing generalization across platforms of the model. Our model significantly improved species identification (F1-score: 0.52 to 0.85, R-squared: 0.75 to 0.98) and damage classification (F1-score: 0.28 to 0.44, R-squared: 0.71 to 0.87) over prior methods. Under domain shift (drone images), it maintained strong performance with moderate degradation (species: F1-score 0.60, R-squared 0.80; damage: F1-score 0.41, R-squared 0.62), where earlier models failed. These results confirm the model's robustness and real-world applicability. It is now deployed in BASF's phenotyping pipeline, enabling large-scale, automated crop and weed monitoring across diverse geographies.

[213] Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Joonghyuk Shin,Alchan Hwang,Yujin Kim,Daneul Kim,Jaesik Park

Main category: cs.CV

TL;DR: 本文分析了MM-DiT模型的注意力机制，提出了适用于多模态扩散Transformer的新型图像编辑方法，能够实现从全局到局部的编辑，适应不同类型的MM-DiT模型。

Details

Motivation: 由于基于Transformer的扩散模型已经超越了传统的U-Net架构，而MM-DiT成为当前主流方法，因此需要对其注意力机制进行深入分析并开发适配的图像编辑技术。 Method: 通过将注意力矩阵分解为四个不同的块来系统分析MM-DiT的注意力机制，并基于此提出了一种新的图像编辑方法。 Result: 研究揭示了MM-DiT注意力机制的固有特性，并提出了一种适用于多种MM-DiT变体的高效图像编辑方法，包括支持快速编辑的少步模型。 Conclusion: 该论文提出了一种针对多模态扩散Transformer（MM-DiT）的基于提示的鲁棒图像编辑方法，并揭示了MM-DiT注意力机制的行为模式，弥合了现有基于U-Net的方法与新兴架构之间的差距。 Abstract: Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MMDiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT's attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT's behavioral patterns.

[214] Enhancing Reliability of Medical Image Diagnosis through Top-rank Learning with Rejection Module

Xiaotong Ji,Ryoma Bise,Seiichi Uchida

Main category: cs.CV

TL;DR: This paper proposes an enhanced top-rank learning method with a rejection module to address noisy labels and ambiguous instances in medical image processing, improving diagnostic accuracy.

Details

Motivation: Challenges in medical image processing, such as noisy labels and class-ambiguous instances, can hinder the effectiveness of top-rank learning. Method: A novel approach integrating a rejection module with top-rank learning to identify and mitigate outliers during training. Result: Experimental validation on a medical dataset showed that the methodology successfully detects and mitigates outliers. Conclusion: The proposed approach effectively improves the reliability and accuracy of medical image diagnoses by mitigating outliers through a rejection module. Abstract: In medical image processing, accurate diagnosis is of paramount importance. Leveraging machine learning techniques, particularly top-rank learning, shows significant promise by focusing on the most crucial instances. However, challenges arise from noisy labels and class-ambiguous instances, which can severely hinder the top-rank objective, as they may be erroneously placed among the top-ranked instances. To address these, we propose a novel approach that enhances toprank learning by integrating a rejection module. Cooptimized with the top-rank loss, this module identifies and mitigates the impact of outliers that hinder training effectiveness. The rejection module functions as an additional branch, assessing instances based on a rejection function that measures their deviation from the norm. Through experimental validation on a medical dataset, our methodology demonstrates its efficacy in detecting and mitigating outliers, improving the reliability and accuracy of medical image diagnoses.

[215] Enhanced Generative Structure Prior for Chinese Text Image Super-resolution

Xiaoming Li,Wangmeng Zuo,Chen Change Loy

Main category: cs.CV

TL;DR: This paper proposes a high-quality text image super-resolution framework for Chinese characters that uses a codebook-based mechanism and StyleGAN to restore precise strokes of low-resolution characters.

Details

Motivation: Faithful text image super-resolution is challenging due to the unique structure and diverse font styles of characters, especially for complex scripts like Chinese. Method: The framework incorporates a structure prior within a StyleGAN model, using a codebook-based mechanism to restrict the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector $w$ in StyleGAN controls the character's style. Result: Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world low-resolution Chinese text with irregular layouts. Conclusion: The paper proposes a novel framework for Chinese text image super-resolution that uses a codebook-based mechanism and StyleGAN to restore precise strokes of low-resolution characters. Abstract: Faithful text image super-resolution (SR) is challenging because each character has a unique structure and usually exhibits diverse font styles and layouts. While existing methods primarily focus on English text, less attention has been paid to more complex scripts like Chinese. In this paper, we introduce a high-quality text image SR framework designed to restore the precise strokes of low-resolution (LR) Chinese characters. Unlike methods that rely on character recognition priors to regularize the SR task, we propose a novel structure prior that offers structure-level guidance to enhance visual quality. Our framework incorporates this structure prior within a StyleGAN model, leveraging its generative capabilities for restoration. To maintain the integrity of character structures while accommodating various font styles and layouts, we implement a codebook-based mechanism that restricts the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector $w$ in StyleGAN controls the character's style, including typeface, orientation, and location. Through the collaborative interaction between the codebook and style, we generate a high-resolution structure prior that aligns with LR characters both spatially and structurally. Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world LR Chinese text with irregular layouts. Our code and pre-trained models will be available at https://github.com/csxmli2016/MARCONetPlusPlus

[216] A DICOM Image De-identification Algorithm in the MIDI-B Challenge

Hongzhu Jiang,Sihan Xie,Zhiyu Wan

Main category: cs.CV

TL;DR: This paper discusses the importance of de-identifying DICOM medical images to protect patient privacy while maintaining data utility for research and diagnostics, detailing methods like pixel masking and text removal, and presents results from the MIDI-B challenge where their algorithm ranked 2nd with 99.92% accuracy.

Details

Motivation: Image de-identification is essential for publicly sharing medical images while protecting patient privacy as mandated by regulations such as HIPAA and DICOM PS3.15, and recommended by organizations like TCIA. Method: The paper details methods such as pixel masking, date shifting, date hashing, text recognition, text replacement, and text removal to process datasets during the test phase, ensuring compliance with standards like HIPAA and DICOM PS3.15. Result: The latest version of their solution algorithm achieved a 99.92% accuracy rate in executing required de-identification actions and ranked 2nd in the MIDI-B challenge. The paper also provides a comprehensive overview of standards and regulations governing DICOM image de-identification. Conclusion: The study concluded that de-identifying DICOM images is crucial for protecting patient privacy while maintaining the data's utility for research and diagnostics. Their solution algorithm ranked 2nd out of 10 teams in the MIDI-B challenge, correctly executing 99.92% of required actions. The paper also highlights the need for continued improvement in current approaches. Abstract: Image de-identification is essential for the public sharing of medical images, particularly in the widely used Digital Imaging and Communications in Medicine (DICOM) format as required by various regulations and standards, including Health Insurance Portability and Accountability Act (HIPAA) privacy rules, the DICOM PS3.15 standard, and best practices recommended by the Cancer Imaging Archive (TCIA). The Medical Image De-Identification Benchmark (MIDI-B) Challenge at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024) was organized to evaluate rule-based DICOM image de-identification algorithms with a large dataset of clinical DICOM images. In this report, we explore the critical challenges of de-identifying DICOM images, emphasize the importance of removing personally identifiable information (PII) to protect patient privacy while ensuring the continued utility of medical data for research, diagnostics, and treatment, and provide a comprehensive overview of the standards and regulations that govern this process. Additionally, we detail the de-identification methods we applied - such as pixel masking, date shifting, date hashing, text recognition, text replacement, and text removal - to process datasets during the test phase in strict compliance with these standards. According to the final leaderboard of the MIDI-B challenge, the latest version of our solution algorithm correctly executed 99.92% of the required actions and ranked 2nd out of 10 teams that completed the challenge (from a total of 22 registered teams). Finally, we conducted a thorough analysis of the resulting statistics and discussed the limitations of current approaches and potential avenues for future improvement.

[217] Domain Generalization of Pathological Image Segmentation by Patch-Level and WSI-Level Contrastive Learning

Yuki Shigeyasu,Shota Harada,Akihiko Yoshizawa,Kazuhiro Terada,Naoki Nakazima,Mariyo Kurata,Hiroyuki Abe,Tetsuo Ushiku,Ryoma Bise

Main category: cs.CV

TL;DR: 本文提出了一种新的领域泛化方法，通过在医院内部对非肿瘤区域的WSI特征进行聚类，并采用两阶段对比学习方法（WSI级和补丁级），以减少域间特征差距并缓解域转移问题。

Details

Motivation: 传统的域泛化方法依赖于多医院数据，但数据收集的挑战使这种方法往往不切实际。因此，本文聚焦于医院内部的域转移问题。 Method: 通过聚类WSI级特征（来自非肿瘤区域）并将这些聚类视为域，使用WSI级和补丁级对比学习的两阶段方法减少域间特征差距。 Result: 该方法通过对比学习有效减少了WSI对之间不同聚类的特征差距，从而缓解了域转移问题。 Conclusion: 该论文提出了一种新的领域泛化方法，通过利用医院内部的全切片图像（WSI）级域转移，而不是依赖于多医院数据，有效应对病理图像中的域转移问题。 Abstract: In this paper, we address domain shifts in pathological images by focusing on shifts within whole slide images~(WSIs), such as patient characteristics and tissue thickness, rather than shifts between hospitals. Traditional approaches rely on multi-hospital data, but data collection challenges often make this impractical. Therefore, the proposed domain generalization method captures and leverages intra-hospital domain shifts by clustering WSI-level features from non-tumor regions and treating these clusters as domains. To mitigate domain shift, we apply contrastive learning to reduce feature gaps between WSI pairs from different clusters. The proposed method introduces a two-stage contrastive learning approach WSI-level and patch-level contrastive learning to minimize these gaps effectively.

[218] CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts

Junuk Cha,Jihyeon Kim

Main category: cs.CV

TL;DR: This paper introduces CoT-Pose, a novel framework that uses CoT reasoning to generate 3D human poses from abstract textual inputs, improving alignment with natural human communication.

Details

Motivation: The motivation stems from the limitations of current text-to-pose models that rely on low-level, detailed prompts, which do not align with how humans naturally communicate actions and intentions using abstract language. Method: The authors introduced a novel framework called CoT-Pose that integrates chain-of-thought (CoT) reasoning into the pose generation process. They also proposed a data synthesis pipeline to generate training triplets of abstract prompts, detailed prompts, and 3D poses. Result: Experimental results showed that the CoT-Pose model can effectively generate plausible and semantically aligned 3D human poses from abstract textual inputs, addressing the mismatch between human communication and existing model requirements. Conclusion: The study concludes that incorporating CoT reasoning into pose generation allows for the effective interpretation of abstract textual inputs into accurate 3D human poses, highlighting the importance of high-level understanding in this field. Abstract: Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.

[219] Commentary Generation for Soccer Highlights

Chidaksh Ravuru

Main category: cs.CV

TL;DR: 本文扩展了 MatchVoice 模型用于足球集锦解说生成，发现其具有良好的泛化能力，但仍需结合其他视频-语言领域技术来提升性能。

Details

Motivation: 现有的足球解说生成框架（如 SoccerNet-Caption）无法实现视频内容与解说之间的细粒度对齐，因此需要改进模型以提高时间同步性和描述准确性。 Method: 扩展 MatchVoice 模型用于基于 GOAL 数据集的足球集锦解说生成，并通过不同训练配置和硬件条件进行实验评估。此外，还探索了不同窗口大小对零样本性能的影响。 Result: 成功复现了 MatchTime 的结果，并评估了不同训练配置和硬件限制对模型性能的影响。同时发现，MatchVoice 在足球集锦解说生成中表现出良好的泛化能力。 Conclusion: MatchVoice 展示了良好的泛化能力，但需要结合更广泛的视频-语言领域的技术来进一步提升性能。 Abstract: Automated soccer commentary generation has evolved from template-based systems to advanced neural architectures, aiming to produce real-time descriptions of sports events. While frameworks like SoccerNet-Caption laid foundational work, their inability to achieve fine-grained alignment between video content and commentary remains a significant challenge. Recent efforts such as MatchTime, with its MatchVoice model, address this issue through coarse and fine-grained alignment techniques, achieving improved temporal synchronization. In this paper, we extend MatchVoice to commentary generation for soccer highlights using the GOAL dataset, which emphasizes short clips over entire games. We conduct extensive experiments to reproduce the original MatchTime results and evaluate our setup, highlighting the impact of different training configurations and hardware limitations. Furthermore, we explore the effect of varying window sizes on zero-shot performance. While MatchVoice exhibits promising generalization capabilities, our findings suggest the need for integrating techniques from broader video-language domains to further enhance performance. Our code is available at https://github.com/chidaksh/SoccerCommentary.

[220] Adaptive Pseudo Label Selection for Individual Unlabeled Data by Positive and Unlabeled Learning

Takehiro Yamane,Itaru Tsuge,Susumu Saito,Ryoma Bise

Main category: cs.CV

TL;DR: 这篇论文提出了一种利用PU学习进行医学图像分割的新伪标签方法，该方法在选择各种背景区域的伪标签方面表现出色。

Details

Motivation: 为了在医学图像分割中实现对单个图像的学习并选择有效的伪标签，该方法解决了选择各种背景区域伪标签的困难。 Method: 该论文引入了仅使用正样本和未标记数据的二分类问题方法（PU学习），用于获得区分每个未标记图像前景和背景区域的适当度量。 Result: 实验结果表明所提出的方法是有效的。 Conclusion: 该论文提出了一种新的医学图像分割伪标签方法，能够有效地选择各种背景区域的伪标签。 Abstract: This paper proposes a novel pseudo-labeling method for medical image segmentation that can perform learning on ``individual images'' to select effective pseudo-labels. We introduce Positive and Unlabeled Learning (PU learning), which uses only positive and unlabeled data for binary classification problems, to obtain the appropriate metric for discriminating foreground and background regions on each unlabeled image. Our PU learning makes us easy to select pseudo-labels for various background regions. The experimental results show the effectiveness of our method.

[221] Decoupled Functional Evaluation of Autonomous Driving Models via Feature Map Quality Scoring

Ludan Zhang,Sihan Wang,Yuqi Dai,Shuofei Qiao,Lei He

Main category: cs.CV

TL;DR: This paper proposes a new evaluation framework for feature maps in end-to-end autonomous driving models, leading to enhanced performance in 3D object detection.

Details

Motivation: The motivation is to address the lack of explicit supervision signals for intermediate modules in end-to-end autonomous driving models, which limits interpretability and evaluation capabilities. Method: The study proposes an independent evaluation method based on Feature Map Convergence Score (FMCS) and constructs a Dual-Granularity Dynamic Weighted Scoring System (DG-DWSS). It also develops a CLIP-based Feature Map Quality Evaluation Network (CLIP-FMQE-Net) for real-time analysis. Result: Experiments on the NuScenes dataset showed a 3.89 percent gain in NDS, demonstrating improved 3D object detection performance with the proposed method. Conclusion: The study concludes that the proposed FMCS and DG-DWSS framework effectively improves the quality of feature maps and overall model performance in end-to-end autonomous driving systems. Abstract: End-to-end models are emerging as the mainstream in autonomous driving perception and planning. However, the lack of explicit supervision signals for intermediate functional modules leads to opaque operational mechanisms and limited interpretability, making it challenging for traditional methods to independently evaluate and train these modules. Pioneering in the issue, this study builds upon the feature map-truth representation similarity-based evaluation framework and proposes an independent evaluation method based on Feature Map Convergence Score (FMCS). A Dual-Granularity Dynamic Weighted Scoring System (DG-DWSS) is constructed, formulating a unified quantitative metric - Feature Map Quality Score - to enable comprehensive evaluation of the quality of feature maps generated by functional modules. A CLIP-based Feature Map Quality Evaluation Network (CLIP-FMQE-Net) is further developed, combining feature-truth encoders and quality score prediction heads to enable real-time quality analysis of feature maps generated by functional modules. Experimental results on the NuScenes dataset demonstrate that integrating our evaluation module into the training improves 3D object detection performance, achieving a 3.89 percent gain in NDS. These results verify the effectiveness of our method in enhancing feature representation quality and overall model performance.

[222] Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation

Minghao Yin,Yukang Cao,Songyou Peng,Kai Han

Main category: cs.CV

TL;DR: Splat4D是一种从单目视频生成高质量4D内容的新框架，兼顾时空一致性并支持用户指令引导。

Details

Motivation: 单目视频生成高质量4D内容面临时空一致性、细节保留和用户引导的挑战。 Method: 利用多视角渲染、不一致性识别、视频扩散模型以及不对称U-Net进行优化。 Result: 在多个公共基准测试中，Splat4D在各种指标上均表现出最先进的性能。 Conclusion: Splat4D框架在生成高质量的4D内容方面表现出色，具有时空一致性，并能根据用户指令生成和编辑内容。 Abstract: Generating high-quality 4D content from monocular videos for applications such as digital humans and AR/VR poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions.

[223] Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models

Khanh-Binh Nguyen,Phuoc-Nguyen Bui,Hyunseung Choo,Duc Thanh Nguyen

Main category: cs.CV

TL;DR: The paper proposes the Adaptive Cache Enhancement (ACE) framework to improve the robustness and generalization of vision-language models (VLMs) under distribution shifts, achieving state-of-the-art performance among test-time adaptation (TTA) methods.

Details

Motivation: Vision-language models (VLMs) suffer performance degradation under distribution shifts in downstream tasks, particularly without labeled data. Existing cache-based TTA methods face challenges with unreliable confidence metrics and rigid decision boundaries, prompting the need for a more robust solution. Method: The Adaptive Cache Enhancement (ACE) framework constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, using dynamic, class-specific thresholds refined via an exponential moving average and exploration-augmented updates. Result: Extensive experiments on 15 diverse benchmark datasets show that ACE outperforms existing TTA methods, delivering superior robustness and generalization in challenging out-of-distribution scenarios. Conclusion: The ACE framework successfully addresses the challenges faced by cache-based TTA methods, achieving state-of-the-art performance with improved robustness and generalization in out-of-distribution scenarios. Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.

[224] Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification

Zhaorui Tan,Tan Pan,Kaizhu Huang,Weimiao Yu,Kai Yao,Chen Jiang,Qiufeng Wang,Anh Nguyen,Xin Guo,Yuan Cheng,Xi Yang

Main category: cs.CV

TL;DR: 本文研究了Vision Transformers中LayerNorm层在数据稀缺和领域转移情况下的微调动态，并提出了一种基于Fine-tuning Shift Ratio (FSR) 和标量λ的微调框架，以提升模型在不同分布数据下的性能。

Details

Motivation: LayerNorm在Vision Transformers (ViTs) 中至关重要，但其在数据稀缺和领域转移下的微调动态尚未被深入研究。 Method: 提出了一种使用标量λ的简单而有效的重新缩放机制，该机制与FSR负相关，并结合循环框架来增强LayerNorm的微调。 Result: 在各种目标训练样本条件下，实验验证了所提框架的有效性。特别地，实验表明，在数据稀缺的情况下，OOD任务的FSR较低，λ较高，说明目标训练样本不足。 Conclusion: 本文提出了基于Fine-tuning Shift Ratio (FSR) 的LayerNorm微调框架，用于解决Vision Transformers (ViTs)在数据稀缺和领域转移下的微调问题。 Abstract: LayerNorm is pivotal in Vision Transformers (ViTs), yet its fine-tuning dynamics under data scarcity and domain shifts remain underexplored. This paper shows that shifts in LayerNorm parameters after fine-tuning (LayerNorm shifts) are indicative of the transitions between source and target domains; its efficacy is contingent upon the degree to which the target training samples accurately represent the target domain, as quantified by our proposed Fine-tuning Shift Ratio ($FSR$). Building on this, we propose a simple yet effective rescaling mechanism using a scalar $\lambda$ that is negatively correlated to $FSR$ to align learned LayerNorm shifts with those ideal shifts achieved under fully representative data, combined with a cyclic framework that further enhances the LayerNorm fine-tuning. Extensive experiments across natural and pathological images, in both in-distribution (ID) and out-of-distribution (OOD) settings, and various target training sample regimes validate our framework. Notably, OOD tasks tend to yield lower $FSR$ and higher $\lambda$ in comparison to ID cases, especially with scarce data, indicating under-represented target training samples. Moreover, ViTFs fine-tuned on pathological data behave more like ID settings, favoring conservative LayerNorm updates. Our findings illuminate the underexplored dynamics of LayerNorm in transfer learning and provide practical strategies for LayerNorm fine-tuning.

[225] GAPNet: A Lightweight Framework for Image and Video Salient Object Detection via Granularity-Aware Paradigm

Yu-Huan Wu,Wei Liu,Zi-Xuan Zhu,Zizhou Wang,Yong Liu,Liangli Zhen

Main category: cs.CV

TL;DR: 本文提出了一种名为GAPNet的轻量级网络，用于显著目标检测，通过粒度感知连接、高效特征融合模块和自注意力机制，在图像和视频任务中实现了最先进的性能。

Details

Motivation: 现有的显著目标检测（SOD）模型通常依赖于计算密集型的骨干网络，限制了其在边缘设备等实际场景中的应用。因此，本文提出了一种轻量级的解决方案。 Method: GAPNet采用了一种粒度感知的范式，包括粒度感知连接、粒度金字塔卷积（GPC）和跨尺度注意力（CSA）模块，以及基于编码器的自注意力模块，以优化特征利用和语义解释。 Result: GAPNet在轻量级图像和视频SOD模型中实现了新的最先进的性能，同时保持了较低的计算成本。 Conclusion: GAPNet实现了轻量级的显著目标检测，在图像和视频SOD模型中达到了最先进的性能，同时支持在边缘设备上的实际应用。 Abstract: Recent salient object detection (SOD) models predominantly rely on heavyweight backbones, incurring substantial computational cost and hindering their practical application in various real-world settings, particularly on edge devices. This paper presents GAPNet, a lightweight network built on the granularity-aware paradigm for both image and video SOD. We assign saliency maps of different granularities to supervise the multi-scale decoder side-outputs: coarse object locations for high-level outputs and fine-grained object boundaries for low-level outputs. Specifically, our decoder is built with granularity-aware connections which fuse high-level features of low granularity and low-level features of high granularity, respectively. To support these connections, we design granular pyramid convolution (GPC) and cross-scale attention (CSA) modules for efficient fusion of low-scale and high-scale features, respectively. On top of the encoder, a self-attention module is built to learn global information, enabling accurate object localization with negligible computational cost. Unlike traditional U-Net-based approaches, our proposed method optimizes feature utilization and semantic interpretation while applying appropriate supervision at each processing stage. Extensive experiments show that the proposed method achieves a new state-of-the-art performance among lightweight image and video SOD models. Code is available at https://github.com/yuhuan-wu/GAPNet.

[226] Voice Pathology Detection Using Phonation

Sri Raksha Siva,Nived Suthahar,Prakash Boominathan,Uma Ranjan

Main category: cs.CV

TL;DR: This study presents a noninvasive, machine learning-based framework for detecting voice pathologies using phonation data from the Saarbrücken Voice Database. Acoustic features and RNNs, including LSTM and attention mechanisms, are employed to classify samples into normal and pathological categories. Data augmentation techniques and preprocessing enhance model generalizability and signal quality. The proposed framework serves as an automated diagnostic tool for early detection of voice pathologies, contributing to AI-driven healthcare and improved patient outcomes.

Details

Motivation: Voice disorders significantly affect communication and quality of life, requiring an early and accurate diagnosis. Traditional methods like laryngoscopy are invasive, subjective, and often inaccessible. Method: Phonation data from the Saarbrücken Voice Database are analyzed using acoustic features such as Mel Frequency Cepstral Coefficients (MFCCs), chroma features, and Mel spectrograms. Recurrent Neural Networks (RNNs), including LSTM and attention mechanisms, classify samples into normal and pathological categories. Data augmentation techniques, including pitch shifting and Gaussian noise addition, enhance model generalizability, while preprocessing ensures signal quality. Scale-based features, such as H"older and Hurst exponents, further capture signal irregularities and long-term dependencies. Result: This research proposes a noninvasive, machine learning-based framework for detecting voice pathologies using phonation data. Conclusion: The proposed framework offers a noninvasive, automated diagnostic tool for early detection of voice pathologies, supporting AI-driven healthcare and improving patient outcomes. Abstract: Voice disorders significantly affect communication and quality of life, requiring an early and accurate diagnosis. Traditional methods like laryngoscopy are invasive, subjective, and often inaccessible. This research proposes a noninvasive, machine learning-based framework for detecting voice pathologies using phonation data. Phonation data from the Saarbr\"ucken Voice Database are analyzed using acoustic features such as Mel Frequency Cepstral Coefficients (MFCCs), chroma features, and Mel spectrograms. Recurrent Neural Networks (RNNs), including LSTM and attention mechanisms, classify samples into normal and pathological categories. Data augmentation techniques, including pitch shifting and Gaussian noise addition, enhance model generalizability, while preprocessing ensures signal quality. Scale-based features, such as H\"older and Hurst exponents, further capture signal irregularities and long-term dependencies. The proposed framework offers a noninvasive, automated diagnostic tool for early detection of voice pathologies, supporting AI-driven healthcare, and improving patient outcomes.

[227] From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users

Shahroz Tariq,Simon S. Woo,Priyanka Singh,Irena Irmalasari,Saakshi Gupta,Dev Gupta

Main category: cs.CV

TL;DR: This paper introduces DF-P2E, a novel interpretable framework for deepfake detection that combines visual, semantic, and narrative explanations to support human reasoning.

Details

Motivation: Existing deepfake detection systems lack transparency and interpretability, limiting their usability in real-world contexts, especially for non-expert users. There is a need for systems that align detection with human reasoning. Method: DF-P2E integrates visual, semantic, and narrative explanation layers. It includes a deepfake classifier with Grad-CAM visualization, a visual captioning module, and a narrative refinement module using a fine-tuned LLM. Result: Experiments on the DF40 benchmark show competitive detection performance and high-quality explanations that align with Grad-CAM activations, making deepfake detection interpretable and accessible. Conclusion: The paper concludes that DF-P2E provides a scalable and interpretable framework for deepfake detection, aligning detection performance with high-quality, context-aware explanations to advance trustworthy AI systems. Abstract: The proliferation of deepfake technologies poses urgent challenges and serious risks to digital integrity, particularly within critical sectors such as forensics, journalism, and the legal system. While existing detection systems have made significant progress in classification accuracy, they typically function as black-box models, offering limited transparency and minimal support for human reasoning. This lack of interpretability hinders their usability in real-world decision-making contexts, especially for non-expert users. In this paper, we present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. The framework consists of three modular components: (1) a deepfake classifier with Grad-CAM-based saliency visualisation, (2) a visual captioning module that generates natural language summaries of manipulated regions, and (3) a narrative refinement module that uses a fine-tuned Large Language Model (LLM) to produce context-aware, user-sensitive explanations. We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date. Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations. By unifying prediction and explanation in a coherent, human-aligned pipeline, this work offers a scalable approach to interpretable deepfake detection, advancing the broader vision of trustworthy and transparent AI systems in adversarial media environments.

[228] ShoulderShot: Generating Over-the-Shoulder Dialogue Videos

Yuang Zhang,Junqi Cheng,Haoyu Zhao,Jiaxi Gu,Fangyuan Zou,Zenghui Lu,Peng Shu

Main category: cs.CV

TL;DR: 本文提出了一种名为ShoulderShot的框架，用于生成过肩对话视频，实现了角色一致性与长对话生成。

Details

Motivation: 过肩对话视频在电影和广告中至关重要，但在视频生成研究中仍鲜有探索。 Method: 结合双镜头生成和循环视频技术，以保持不同镜头间角色一致性并实现扩展对话。 Result: 在镜头切换布局、空间连续性以及对话长度灵活性方面，结果表明ShoulderShot的能力超越了现有方法。 Conclusion: ShoulderShot提供了一种创新的视频生成框架，能够超越现有方法在对话视频生成中的表现。 Abstract: Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at https://shouldershot.github.io.

[229] LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Wenhui Song,Hanhui Li,Jiehui Huang,Panwen Hu,Yuhao Cheng,Long Chen,Yiqiang Yan,Xiaodan Liang

Main category: cs.CV

TL;DR: LaVieID通过引入局部路由器和时间自回归模块，有效缓解了扩散变压器在生成视频过程中身份信息丢失的问题，从而能够生成高保真个性化视频。

Details

Motivation: LaVieID的设计动机是为了从空间和时间两个角度缓解扩散变压器（DiTs）随机全局生成过程中固有的身份信息丢失问题。 Method: LaVieID引入了一个本地路由器，通过细粒度的本地面部结构的加权组合来显式表示潜在状态，并集成了时间自回归模块，以通过利用潜在标记的长期时间依赖性来预测偏差，从而纠正标记。 Result: LaVieID能够生成高保真个性化视频，并在身份保持文本到视频的任务上达到最先进的性能。 Conclusion: LaVieID是一个新的局部自回归视频扩散框架，旨在解决具有挑战性的保持身份的文本到视频任务，能够生成高保真个性化视频并达到最先进的性能。 Abstract: In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.

[230] X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning

Jian Ma,Xujie Zhu,Zihao Pan,Qirong Peng,Xu Guo,Chen Chen,Haonan Lu

Main category: cs.CV

TL;DR: 本文提出了X2Edit，包括一个覆盖14种编辑任务的大型数据集和基于FLUX.1的高效模型训练方法，显著提升了图像编辑性能。

Details

Motivation: 为了解决现有开源数据集在任意指令图像编辑任务中的不足，以及缺乏与流行生成模型兼容的插件式编辑模块的问题。 Method: 设计了X2Edit数据集和任务感知的MoE-LoRA训练方法，结合对比学习以提升模型性能。 Result: 构建了一个包含370万高质量数据的数据集，并通过仅使用全模型8%参数的方法，在编辑性能上取得了与其他优秀模型相当的效果。 Conclusion: X2Edit不仅提出了一个全面的数据集，还通过高效的模型训练方法，在图像编辑领域取得了显著成果，并已开源。 Abstract: Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8\% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: https://github.com/OPPO-Mente-Lab/X2Edit.

[231] An Iterative Reconstruction Method for Dental Cone-Beam Computed Tomography with a Truncated Field of View

Hyoung Suk Park,Kiwan Jeon

Main category: cs.CV

TL;DR: 本文提出了一种减轻CBCT截断伪影的两阶段方法，通过隐式神经表示和传统迭代重建改进图像质量。

Details

Motivation: 在CBCT系统设计中，使用小型探测器导致视野（FOV）截断，从而影响图像质量，该问题在迭代重建过程中尤为突出。 Method: 研究采用了隐式神经表示（INR）来生成先验图像，然后在传统迭代重建过程中使用经过校正的投影数据进行第二阶段重建。 Result: 数值结果表明所提出的双网格方法能有效抑制截断伪影，提高CBCT图像质量。 Conclusion: 该研究提出了一种两阶段方法，以减轻牙科锥形束计算机断层扫描（CBCT）中的截断伪影，从而提高图像质量。 Abstract: In dental cone-beam computed tomography (CBCT), compact and cost-effective system designs often use small detectors, resulting in a truncated field of view (FOV) that does not fully encompass the patient's head. In iterative reconstruction approaches, the discrepancy between the actual projection and the forward projection within the truncated FOV accumulates over iterations, leading to significant degradation in the reconstructed image quality. In this study, we propose a two-stage approach to mitigate truncation artifacts in dental CBCT. In the first stage, we employ Implicit Neural Representation (INR), leveraging its superior representation power, to generate a prior image over an extended region so that its forward projection fully covers the patient's head. To reduce computational and memory burdens, INR reconstruction is performed with a coarse voxel size. The forward projection of this prior image is then used to estimate the discrepancy due to truncated FOV in the measured projection data. In the second stage, the discrepancy-corrected projection data is utilized in a conventional iterative reconstruction process within the truncated region. Our numerical results demonstrate that the proposed two-grid approach effectively suppresses truncation artifacts, leading to improved CBCT image quality.

[232] SOFA: Deep Learning Framework for Simulating and Optimizing Atrial Fibrillation Ablation

Yunsung Chung,Chanho Lim,Ghassan Bidaoui,Christian Massad,Nassir Marrouche,Jihun Hamm

Main category: cs.CV

TL;DR: SOFA是一种新型深度学习框架，通过模拟和优化房颤消融参数，有效降低房颤复发风险预测值，实现个性化治疗。

Details

Motivation: 房颤治疗中的导管消融效果存在较大差异，评估和改进消融效果具有挑战性，因此需要一种能够预测复发风险并优化治疗方案的工具。 Method: SOFA采用多模态、多视角生成器处理心房的2.5D表示，通过模拟消融结果生成术后瘢痕图像，并引入优化方案来优化消融参数，以降低房颤复发风险。 Result: SOFA在定量评估中准确合成了术后消融图像，并通过优化方案实现了模型预测复发风险的22.18%的降低。 Conclusion: SOFA通过模拟和优化消融参数，能够显著降低预测的房颤复发风险，为个性化房颤消融治疗提供了一种新工具。 Abstract: Atrial fibrillation (AF) is a prevalent cardiac arrhythmia often treated with catheter ablation procedures, but procedural outcomes are highly variable. Evaluating and improving ablation efficacy is challenging due to the complex interaction between patient-specific tissue and procedural factors. This paper asks two questions: Can AF recurrence be predicted by simulating the effects of procedural parameters? How should we ablate to reduce AF recurrence? We propose SOFA (Simulating and Optimizing Atrial Fibrillation Ablation), a novel deep-learning framework that addresses these questions. SOFA first simulates the outcome of an ablation strategy by generating a post-ablation image depicting scar formation, conditioned on a patient's pre-ablation LGE-MRI and the specific procedural parameters used (e.g., ablation locations, duration, temperature, power, and force). During this simulation, it predicts AF recurrence risk. Critically, SOFA then introduces an optimization scheme that refines these procedural parameters to minimize the predicted risk. Our method leverages a multi-modal, multi-view generator that processes 2.5D representations of the atrium. Quantitative evaluations show that SOFA accurately synthesizes post-ablation images and that our optimization scheme leads to a 22.18\% reduction in the model-predicted recurrence risk. To the best of our knowledge, SOFA is the first framework to integrate the simulation of procedural effects, recurrence prediction, and parameter optimization, offering a novel tool for personalizing AF ablation.

[233] Enhancing Egocentric Object Detection in Static Environments using Graph-based Spatial Anomaly Detection and Correction

Vishakha Lall,Yisi Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于图神经网络的后处理方法，通过建模物体间的空间关系来纠正检测异常，提升了目标检测的性能。

Details

Motivation: 现有的目标检测模型未能充分利用静态环境中物体空间布局的一致性，导致预测不一致、漏检或误分类，尤其是在杂乱或遮挡场景中。 Method: 提出了一种基于图的后处理流程，使用图神经网络（GNN）对物体之间的空间关系进行建模，以纠正检测异常。 Result: 实验表明，将空间推理结合到目标检测中显著提高了检测性能，mAP@50增益高达4%。 Conclusion: 该方法强调了利用环境空间结构提升目标检测系统可靠性的潜力。 Abstract: In many real-world applications involving static environments, the spatial layout of objects remains consistent across instances. However, state-of-the-art object detection models often fail to leverage this spatial prior, resulting in inconsistent predictions, missed detections, or misclassifications, particularly in cluttered or occluded scenes. In this work, we propose a graph-based post-processing pipeline that explicitly models the spatial relationships between objects to correct detection anomalies in egocentric frames. Using a graph neural network (GNN) trained on manually annotated data, our model identifies invalid object class labels and predicts corrected class labels based on their neighbourhood context. We evaluate our approach both as a standalone anomaly detection and correction framework and as a post-processing module for standard object detectors such as YOLOv7 and RT-DETR. Experiments demonstrate that incorporating this spatial reasoning significantly improves detection performance, with mAP@50 gains of up to 4%. This method highlights the potential of leveraging the environment's spatial structure to improve reliability in object detection systems.

[234] A Trustworthy Method for Multimodal Emotion Recognition

Junxiao Xue,Xiaozhen Liu,Jie Wang,Xuecheng Wu,Bin Wu

Main category: cs.CV

TL;DR: This paper proposes a reliable emotion recognition framework (TER) that improves model robustness by incorporating uncertainty estimation and confidence-based fusion, achieving strong performance and high reliability on multiple datasets.

Details

Motivation: Existing emotion recognition methods focus on performance enhancement using complex models, but they often neglect the reliability of decisions, especially under noisy, corrupted, or out-of-distribution conditions. Method: The TER method utilizes uncertainty estimation to calculate prediction confidence and combines results from multiple modalities based on their confidence values. It also introduces trusted precision, recall, Acc, and F1 score for evaluating trusted performance. Result: TER achieves 82.40% accuracy on Music-video and trusted F1 scores of 0.7511 and 0.9035 on IEMOCAP and Music-video, respectively, outperforming other methods in trusted performance. Conclusion: The proposed trusted emotion recognition (TER) method enhances the reliability and robustness of emotion recognition models, achieving state-of-the-art performance on Music-video and superior trusted F1 scores on IEMOCAP and Music-video. Abstract: Existing emotion recognition methods mainly focus on enhancing performance by employing complex deep models, typically resulting in significantly higher model complexity. Although effective, it is also crucial to ensure the reliability of the final decision, especially for noisy, corrupted and out-of-distribution data. To this end, we propose a novel emotion recognition method called trusted emotion recognition (TER), which utilizes uncertainty estimation to calculate the confidence value of predictions. TER combines the results from multiple modalities based on their confidence values to output the trusted predictions. We also provide a new evaluation criterion to assess the reliability of predictions. Specifically, we incorporate trusted precision and trusted recall to determine the trusted threshold and formulate the trusted Acc. and trusted F1 score to evaluate the model's trusted performance. The proposed framework combines the confidence module that accordingly endows the model with reliability and robustness against possible noise or corruption. The extensive experimental results validate the effectiveness of our proposed model. The TER achieves state-of-the-art performance on the Music-video, achieving 82.40% Acc. In terms of trusted performance, TER outperforms other methods on the IEMOCAP and Music-video, achieving trusted F1 scores of 0.7511 and 0.9035, respectively.

[235] AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

Dejie Yang,Zijing Zhao,Yang Liu

Main category: cs.CV

TL;DR: This paper introduces AR-VRM, a method for visual robot manipulation that explicitly imitates human actions using keypoint prediction and analogical reasoning, achieving strong performance on benchmarks and real-world tasks, especially in data-limited settings.

Details

Motivation: Visual Robot Manipulation requires costly multi-modal data, and existing approaches show limited generalization due to reliance on implicit training or irrelevant web data. The study aims to improve generalization by explicitly imitating human actions from large-scale video datasets. Method: The method involves a keypoint Vision-Language Model (VLM) pretraining scheme to learn human action knowledge and predict human hand keypoints. During fine-tuning, human action videos with similar tasks and observations are retrieved, and an Analogical Reasoning (AR) map is learned to link human keypoints to robot components. Result: The method achieves leading performance on the CALVIN benchmark and real-world experiments, with significant improvements in few-shot scenarios, demonstrating the effectiveness of explicit imitation of human actions under data scarcity. Conclusion: The proposed AR-VRM method explicitly imitates human actions using large-scale human action video datasets and achieves leading performance on the CALVIN benchmark and real-world experiments, particularly excelling in few-shot scenarios. Abstract: Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action keypoints instead of irrelevant visual cues, our method achieves leading performance on the CALVIN benchmark {and real-world experiments}. In few-shot scenarios, our AR-VRM outperforms previous methods by large margins , underscoring the effectiveness of explicitly imitating human actions under data scarcity.

[236] LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering

Xiaohang Zhan,Dingming Liu

Main category: cs.CV

TL;DR: 本文提出一种基于体积渲染原理的训练-免费图像生成方法，通过潜在空间中的遮挡关系控制，显著提升遮挡准确性并实现多种视觉效果调整。

Details

Motivation: 现有图像生成方法在遮挡关系控制方面不够精确，而布局到图像方法未能显式解决遮挡问题，因此需要一种更精确的遮挡控制方法。 Method: 利用体积渲染原理，在潜在空间中基于遮挡关系和物体透射率估计进行场景“渲染”，无需重新训练或微调预训练的图像扩散模型。 Result: 在大量实验中，该方法在遮挡准确性方面显著优于现有方法，并可实现多种透明度和视觉效果调整。 Conclusion: 该论文提出了一种无需训练的图像生成算法，可以精确控制图像中物体之间的遮挡关系，并通过调整物体的不透明度实现多种视觉效果。 Abstract: We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to "render" the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.

[237] Collaborative Learning of Scattering and Deep Features for SAR Target Recognition with Noisy Labels

Yimin Fu,Zhunga Liu,Dongxiu Guo,Longfei Wang

Main category: cs.CV

TL;DR: The paper proposes CLSDF, a collaborative learning method for SAR automatic target recognition that integrates scattering and deep features, effectively handles noisy labels, and achieves superior performance on the MSTAR dataset.

Details

Motivation: The motivation is to address the challenge of acquiring high-quality labeled SAR data, which often contains noisy labels that degrade the performance of SAR automatic target recognition (ATR). Existing noisy label learning methods are mainly focused on image data and are insufficient for SAR data due to its non-intuitive visual characteristics. Method: A multi-model feature fusion framework is designed to integrate scattering and deep features. Attributed scattering centers (ASCs) are treated as dynamic graph structure data, and physical characteristics are extracted to enrich deep image features. Loss distribution is modeled using class-wise Gaussian Mixture Models (GMMs) to separate clean and noisy samples. Semi-supervised learning with two divergent branches and a joint distribution alignment strategy is implemented to enhance co-guessed label reliability. Result: The proposed method demonstrates state-of-the-art performance on the MSTAR dataset under various operating conditions and label noise scenarios. Conclusion: The proposed CLSDF method achieves state-of-the-art performance for SAR ATR with noisy labels by integrating scattering and deep features, using a multi-model feature fusion framework, and implementing semi-supervised learning with GMM-based noise modeling. Abstract: The acquisition of high-quality labeled synthetic aperture radar (SAR) data is challenging due to the demanding requirement for expert knowledge. Consequently, the presence of unreliable noisy labels is unavoidable, which results in performance degradation of SAR automatic target recognition (ATR). Existing research on learning with noisy labels mainly focuses on image data. However, the non-intuitive visual characteristics of SAR data are insufficient to achieve noise-robust learning. To address this problem, we propose collaborative learning of scattering and deep features (CLSDF) for SAR ATR with noisy labels. Specifically, a multi-model feature fusion framework is designed to integrate scattering and deep features. The attributed scattering centers (ASCs) are treated as dynamic graph structure data, and the extracted physical characteristics effectively enrich the representation of deep image features. Then, the samples with clean and noisy labels are divided by modeling the loss distribution with multiple class-wise Gaussian Mixture Models (GMMs). Afterward, the semi-supervised learning of two divergent branches is conducted based on the data divided by each other. Moreover, a joint distribution alignment strategy is introduced to enhance the reliability of co-guessed labels. Extensive experiments have been done on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset, and the results show that the proposed method can achieve state-of-the-art performance under different operating conditions with various label noises.

[238] Undress to Redress: A Training-Free Framework for Virtual Try-On

Zhiying Li,Junhao Wu,Yeying Jin,Daiheng Gao,Yun Ji,Kaichuan Kong,Lei Yu,Hao Xu,Kai Chen,Bruce Gu,Nana Wang,Zhaoxin Fan

Main category: cs.CV

TL;DR: 为了解决虚拟试穿中长袖到短袖转换的难题，提出了一种新的训练自由框架UR-VTON，其通过先虚拟脱衣再穿衣的机制，结合动态无分类引导调度和结构优化器，有效提高了生成图像的细节和质量。

Details

Motivation: 现有VTON方法在处理长袖到短袖转换时，由于原始图像中暴露皮肤较少，往往产生不真实的输出，这源于现有模型的“多数”补全规则。 Method: 引入了“脱衣-穿衣”机制，并结合动态无分类引导调度和结构优化器，以提高生成图像的多样性和细节保真度。 Result: 在LS-TON基准测试中，UR-VTON在细节保留和图像质量方面均优于现有最先进方法。 Conclusion: UR-VTON是一个无需训练的新颖框架，可以与任何现有VTON方法无缝集成，有效解决长袖到短袖转换中的皮肤恢复问题。 Abstract: Virtual try-on (VTON) is a crucial task for enhancing user experience in online shopping by generating realistic garment previews on personal photos. Although existing methods have achieved impressive results, they struggle with long-sleeve-to-short-sleeve conversions-a common and practical scenario-often producing unrealistic outputs when exposed skin is underrepresented in the original image. We argue that this challenge arises from the ''majority'' completion rule in current VTON models, which leads to inaccurate skin restoration in such cases. To address this, we propose UR-VTON (Undress-Redress Virtual Try-ON), a novel, training-free framework that can be seamlessly integrated with any existing VTON method. UR-VTON introduces an ''undress-to-redress'' mechanism: it first reveals the user's torso by virtually ''undressing,'' then applies the target short-sleeve garment, effectively decomposing the conversion into two more manageable steps. Additionally, we incorporate Dynamic Classifier-Free Guidance scheduling to balance diversity and image quality during DDPM sampling, and employ Structural Refiner to enhance detail fidelity using high-frequency cues. Finally, we present LS-TON, a new benchmark for long-sleeve-to-short-sleeve try-on. Extensive experiments demonstrate that UR-VTON outperforms state-of-the-art methods in both detail preservation and image quality. Code will be released upon acceptance.

[239] TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding

Chaohong Guo,Xun Mo,Yongwei Nie,Xuemiao Xu,Chao Xu,Fei Yu,Chengjiang Long

Main category: cs.CV

TL;DR: This paper proposes TAR-TVG, a novel framework for Temporal Video Grounding that enhances reasoning and prediction quality using timestamp anchors and a multi-stage training strategy.

Details

Motivation: The motivation is to explicitly constrain the reasoning process in existing models to ensure the quality of temporal predictions in Temporal Video Grounding. Method: The method introduces timestamp anchors as intermediate verification points within the reasoning process and uses a three-stage training strategy involving initial GRPO training, supervised fine-tuning (SFT), and final GRPO optimization. Result: The model achieves state-of-the-art performance while generating interpretable and verifiable reasoning chains with progressively refined temporal estimations. Conclusion: The proposed TAR-TVG framework achieves state-of-the-art performance in Temporal Video Grounding by incorporating timestamp anchors in the reasoning process and utilizing a three-stage training strategy. Abstract: Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.

[240] Make Your MoVe: Make Your 3D Contents by Adapting Multi-View Diffusion Models to External Editing

Weitao Wang,Haoran Xu,Jun Meng,Haoqian Wang

Main category: cs.CV

TL;DR: 本文介绍了一种无需调整的即插即用方案，用于在生成编辑后的3D资产时保持其原始几何形状，提高多视角一致性和网格质量。

Details

Motivation: 随着3D生成技术的发展，用户对个性化内容的需求增加，但现有的编辑工具主要集中在2D领域，直接应用于3D生成会导致信息丢失，因此需要一种新的方法来解决这一问题。 Method: 提出了一种新的几何保持模块和注入开关器，用于指导编辑后的多视角生成，并控制原始法线的监督程度。 Result: 实验表明，该方法在多种多视角扩散模型和编辑方法的组合下，均能持续提高编辑后的3D资产的多视角一致性和网格质量。 Conclusion: 该论文提出了一种无需调整的即插即用方案，用于在生成编辑后的3D资产时保持其原始几何形状，通过几何保持模块和注入开关器来提高多视角一致性和网格质量。 Abstract: As 3D generation techniques continue to flourish, the demand for generating personalized content is rapidly rising. Users increasingly seek to apply various editing methods to polish generated 3D content, aiming to enhance its color, style, and lighting without compromising the underlying geometry. However, most existing editing tools focus on the 2D domain, and directly feeding their results into 3D generation methods (like multi-view diffusion models) will introduce information loss, degrading the quality of the final 3D assets. In this paper, we propose a tuning-free, plug-and-play scheme that aligns edited assets with their original geometry in a single inference run. Central to our approach is a geometry preservation module that guides the edited multi-view generation with original input normal latents. Besides, an injection switcher is proposed to deliberately control the supervision extent of the original normals, ensuring the alignment between the edited color and normal views. Extensive experiments show that our method consistently improves both the multi-view consistency and mesh quality of edited 3D assets, across multiple combinations of multi-view diffusion models and editing methods.

[241] Multi-view Normal and Distance Guidance Gaussian Splatting for Surface Reconstruction

Bo Jia,Yanan Guo,Ying Chang,Benkui Zhang,Ying Xie,Kangning Du,Lin Cao

Main category: cs.CV

TL;DR: 本文提出了一种多视角正常数和距离引导的高斯点绘方法，以解决多视角场景中的距离和全局匹配挑战。

Details

Motivation: 为了解决多视角场景中的距离和全局匹配挑战，当高斯法线向量在单视角投影平面内对齐时，在切换到附近视角时可能会出现偏差。 Method: 我们提出了一种多视角距离重投影正则化模块和多视角法线增强模块，通过计算两个相邻视角和同一高斯表面之间的距离损失来实现多视角高斯对齐，并通过匹配相邻视角中像素点的法线并计算损失来确保视角间的一致性。 Result: 该方法通过约束附近的深度图并校准3D法线，实现了几何深度统一和高精度重建。 Conclusion: 实验结果表明，该方法在定量和定性评估中均优于基线方法，显著增强了3DGS的表面重建能力。 Abstract: 3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multi-view normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS.

[242] DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models

Licheng Zhang,Bach Le,Naveed Akhtar,Tuan Ngo

Main category: cs.CV

TL;DR: 该研究提出了一种半自动化管道，结合深度学习和大语言模型，高效构建用于多类别门检测的高质量数据集，减少了人工标注成本。

Details

Motivation: 尽管门的检测和分类在建筑合规性检查和室内场景理解中至关重要，但目前缺乏专为细粒度多类别门检测设计的公开数据集。这项工作旨在通过减少人工标注成本来解决这一问题。 Method: 提出了一种半自动化流程，包括使用最先进的目标检测模型统一检测门、利用大语言模型（LLM）基于视觉和上下文特征对每个检测实例进行分类，以及引入人工参与阶段确保标签和边界框的高质量。 Result: 成功构建了一个多类别门检测数据集，并展示了该方法在平面图分析中用于基准测试神经模型的有效性。 Conclusion: 该研究展示了一种结合深度学习和多模态推理的高效数据集构建方法，并显著降低了标注成本，同时生成适用于复杂现实领域（如平面图分析）的高质量数据集。 Abstract: Accurate detection and classification of diverse door types in floor plans drawings is critical for multiple applications, such as building compliance checking, and indoor scene understanding. Despite their importance, publicly available datasets specifically designed for fine-grained multi-class door detection remain scarce. In this work, we present a semi-automated pipeline that leverages a state-of-the-art object detector and a large language model (LLM) to construct a multi-class door detection dataset with minimal manual effort. Doors are first detected as a unified category using a deep object detection model. Next, an LLM classifies each detected instance based on its visual and contextual features. Finally, a human-in-the-loop stage ensures high-quality labels and bounding boxes. Our method significantly reduces annotation cost while producing a dataset suitable for benchmarking neural models in floor plan analysis. This work demonstrates the potential of combining deep learning and multimodal reasoning for efficient dataset construction in complex real-world domains.

[243] A Registration-Based Star-Shape Segmentation Model and Fast Algorithms

Daoping Zhang,Xue-Cheng Tai,Lok Ming Lui

Main category: cs.CV

TL;DR: This paper proposes a novel star-shape segmentation model using a registration framework and level set representation to improve segmentation accuracy for complex images, validated through experiments on synthetic and real datasets.

Details

Motivation: The motivation is to improve the accuracy of image segmentation in the presence of occlusions, obscurities, or noise by incorporating star-shape priors and prior information. Method: The method combines level set representation with a registration framework while imposing constraints on the deformed level set function. It uses the alternating direction method of multipliers to solve the proposed model. Result: Numerical experiments on synthetic and real images demonstrate the efficacy of the proposed approach in achieving accurate star-shape segmentation. Conclusion: The proposed star-shape segmentation model based on the registration framework effectively achieves accurate segmentation, accommodating full and partial star-shapes with single or multiple centers, and enforcing boundaries through specified landmarks. Abstract: Image segmentation plays a crucial role in extracting objects of interest and identifying their boundaries within an image. However, accurate segmentation becomes challenging when dealing with occlusions, obscurities, or noise in corrupted images. To tackle this challenge, prior information is often utilized, with recent attention on star-shape priors. In this paper, we propose a star-shape segmentation model based on the registration framework. By combining the level set representation with the registration framework and imposing constraints on the deformed level set function, our model enables both full and partial star-shape segmentation, accommodating single or multiple centers. Additionally, our approach allows for the enforcement of identified boundaries to pass through specified landmark locations. We tackle the proposed models using the alternating direction method of multipliers. Through numerical experiments conducted on synthetic and real images, we demonstrate the efficacy of our approach in achieving accurate star-shape segmentation.

[244] Enhancing Small-Scale Dataset Expansion with Triplet-Connection-based Sample Re-Weighting

Ting Xiang,Changjian Chen,Zhuo Tang,Qifeng Zhang,Fei Lyu,Li Yang,Jiapeng Zhang,Kenli Li

Main category: cs.CV

TL;DR: The paper introduces TriReWeight, a novel re-weighting method for generative data augmentation, which improves model performance in both natural and medical image datasets.

Details

Motivation: The scarcity of images in real-world applications like medical diagnosis limits the performance of computer vision models. Using pre-trained generative models to expand datasets is effective, but noisy images can be generated due to the uncontrollable generation process and ambiguity of natural language. Method: TriReWeight, a triplet-connection-based sample re-weighting method, is developed based on theoretical analysis of three types of supervision for generated images. Result: TriReWeight theoretically guarantees integration with any generative data augmentation methods, avoids performance degradation, and achieves near-optimal generalization bounds. Experimental results show that it outperforms existing SOTA methods by 7.9% on average over six natural image datasets and by 3.4% on average over three medical datasets. Conclusion: TriReWeight is an effective and versatile method for improving generative data augmentation, capable of enhancing various methods without degrading their performance. Abstract: The performance of computer vision models in certain real-world applications, such as medical diagnosis, is often limited by the scarcity of available images. Expanding datasets using pre-trained generative models is an effective solution. However, due to the uncontrollable generation process and the ambiguity of natural language, noisy images may be generated. Re-weighting is an effective way to address this issue by assigning low weights to such noisy images. We first theoretically analyze three types of supervision for the generated images. Based on the theoretical analysis, we develop TriReWeight, a triplet-connection-based sample re-weighting method to enhance generative data augmentation. Theoretically, TriReWeight can be integrated with any generative data augmentation methods and never downgrade their performance. Moreover, its generalization approaches the optimal in the order $O(\sqrt{d\ln (n)/n})$. Our experiments validate the correctness of the theoretical analysis and demonstrate that our method outperforms the existing SOTA methods by $7.9\%$ on average over six natural image datasets and by $3.4\%$ on average over three medical datasets. We also experimentally validate that our method can enhance the performance of different generative data augmentation methods.

[245] Grouped Speculative Decoding for Autoregressive Image Generation

Junhyuk So,Juncheol Shin,Hyunho Kook,Eunhyeok Park

Main category: cs.CV

TL;DR: This paper introduces Grouped Speculative Decoding (GSD), a training-free method to accelerate autoregressive image generation by leveraging the redundancy and diversity of image tokens through dynamic clustering, achieving a 3.7x speed-up without quality loss.

Details

Motivation: Autoregressive (AR) image models, while powerful, suffer from long inference times due to their sequential nature. Existing Speculative Decoding (SD) methods offer limited acceleration and often require additional training. The authors aim to address this by leveraging the inherent redundancy and diversity of image tokens for more effective acceleration. Method: The paper proposes Grouped Speculative Decoding (GSD), which evaluates clusters of visually valid tokens dynamically, instead of relying on a single most-likely token as in traditional Speculative Decoding (SD). Result: GSD achieves an average acceleration of 3.7x for AR image models while preserving image quality, and it does not require any additional training. Conclusion: GSD proves to be an effective, training-free method for accelerating AR image models, achieving an average speed-up of 3.7x while maintaining image quality. Abstract: Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likely token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7x while preserving image quality-all without requiring any additional training. The source code is available at https://github.com/junhyukso/GSD

[246] Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion

Minseo Kim,Minchan Kwon,Dongyeun Lee,Yunho Jeon,Junmo Kim

Main category: cs.CV

TL;DR: Contrastive Inversion 通过对比学习与解耦交叉注意力微调，有效提取图像共同概念，提升生成质量。

Details

Motivation: 现有方法依赖额外指导（如文本提示或空间掩码），可能导致辅助特征分离不完全，影响生成质量。 Method: 使用对比学习训练目标token和辅助文本token，应用解耦交叉注意力微调。 Result: 实验结果表明，Contrastive Inversion 在概念表示和编辑方面表现优异，优于现有技术。 Conclusion: Contrastive Inversion 提出了一种无需额外信息的新方法，通过对比学习和解耦交叉注意力微调，有效提取图像集中的共同概念，提高了生成质量。 Abstract: The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation quality.In this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.

[247] Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

Haoran Wang,Zekun Li,Jian Zhang,Lei Qi,Yinghuan Shi

Main category: cs.CV

TL;DR: 本文提出CAV-SAM，通过将参考-目标图像对视为伪视频，利用SAM2的视频分割能力，实现高效的模型适配，显著提升分割性能。

Details

Motivation: 现有的参考分割方法主要依赖于元学习，需要大量的数据和计算成本，因此需要一种轻量级的模型适配方法。 Method: CAV-SAM包含两个核心模块：基于扩散模型的语义过渡模块（DBST）用于构建语义变换序列，测试时几何对齐模块（TTGA）通过测试时微调对齐该序列中的几何变化。 Result: CAV-SAM在广泛使用的数据集上评估，其分割性能比当前最先进的方法提升了超过5%。 Conclusion: CAV-SAM通过将参考-目标图像对的对应关系表示为伪视频，成功利用SAM2的iVOS能力，实现了在下游任务中的轻量级适配，并在多个数据集上显著提升了分割性能。 Abstract: Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.

[248] UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models

Jinke Li,Jiarui Yu,Chenxing Wei,Hande Dong,Qiang Lin,Liangjing Yang,Zhicai Wang,Yanbin Hao

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大语言模型的SVG理解和生成方法，并介绍了首个综合性的统一SVG数据集UniSVG，通过该数据集的训练，开源的多模态大语言模型在SVG任务中表现出色。

Details

Motivation: 随着AI系统的普及，让AI能够理解和生成SVG变得越来越迫切。然而，AI驱动的SVG理解和生成仍然面临巨大挑战。 Method: 本文提出了一种基于多模态大语言模型的SVG理解和生成方法，并构建了一个名为UniSVG的数据集，用于训练和评估多模态大语言模型在SVG领域的性能。 Result: 通过在UniSVG数据集上进行学习，开源的多模态大语言模型在各种SVG理解和生成任务中表现优异，超越了最先进的闭源多模态大语言模型GPT-4V。 Conclusion: 本文提出了一种基于多模态大语言模型的SVG理解和生成方法，并介绍了首个综合性的统一SVG数据集UniSVG，通过在该数据集上的学习，开源的多模态大语言模型在各种SVG U&G任务中表现出色，超越了如GPT-4V这样的最先进的闭源多模态大语言模型。 Abstract: Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM's capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on https://ryanlijinke.github.io/.

[249] Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation

Xiaoyan Liu,Kangrui Li,Jiaxin Liu

Main category: cs.CV

TL;DR: Dream4D 是一种新的框架，用于生成时空连贯的 4D 内容，通过结合可控视频生成和神经 4D 重建的优势，实现了比现有方法更高的质量。

Details

Motivation: 当前方法在处理具有多个相互作用元素的大型场景时，难以在保持视图一致性的同时处理复杂的场景动态。 Method: Dream4D 框架采用了两阶段架构：首先使用少量学习从单个图像预测最佳相机轨迹，然后通过专门的姿态条件扩散过程生成几何一致的多视角序列，最后将其转换为持久的 4D 表示。 Result: Dream4D 在利用视频扩散模型的丰富时间先验和重建模型的几何感知方面表现出色，4D 生成的质量显著提高（例如，mPSNR、mSSIM）超过现有方法。 Conclusion: Dream4D 成功地弥合了高保真空间表示和物理上合理的时间动态之间的差距，为 4D 内容的合成提供了新的可能性。 Abstract: The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.

[250] Prototype-Guided Curriculum Learning for Zero-Shot Learning

Lei Wang,Shiming Chen,Guo-Sen Xie,Ziming Hong,Chaojian Yu,Qinmu Peng,Xinge You

Main category: cs.CV

TL;DR: This paper proposes a new framework, CLZSL, to improve Zero-Shot Learning by addressing issues caused by manually defined semantic prototypes through two modules: PCL and PUP.

Details

Motivation: The motivation is to improve knowledge transfer in ZSL by addressing the noisy supervision caused by instance-level mismatches and class-level imprecision in manually defined semantic prototypes. Method: The paper introduces a prototype-guided curriculum learning framework (CLZSL) with two modules: PCL and PUP. The PCL module prioritizes samples with high cosine similarity between visual mappings and class-level semantic prototypes, while the PUP module dynamically updates the semantic prototypes using learned visual mappings. Result: Experiments on standard benchmark datasets (AWA2, SUN, and CUB) demonstrate the effectiveness of the proposed method in reducing the impact of noisy supervision and improving visual-semantic mapping. Conclusion: The proposed CLZSL framework effectively addresses the issues of instance-level mismatch and class-level imprecision in ZSL by using a PCL module and a PUP module, leading to improved visual-semantic mapping and knowledge transfer. Abstract: In Zero-Shot Learning (ZSL), embedding-based methods enable knowledge transfer from seen to unseen classes by learning a visual-semantic mapping from seen-class images to class-level semantic prototypes (e.g., attributes). However, these semantic prototypes are manually defined and may introduce noisy supervision for two main reasons: (i) instance-level mismatch: variations in perspective, occlusion, and annotation bias will cause discrepancies between individual sample and the class-level semantic prototypes; and (ii) class-level imprecision: the manually defined semantic prototypes may not accurately reflect the true semantics of the class. Consequently, the visual-semantic mapping will be misled, reducing the effectiveness of knowledge transfer to unseen classes. In this work, we propose a prototype-guided curriculum learning framework (dubbed as CLZSL), which mitigates instance-level mismatches through a Prototype-Guided Curriculum Learning (PCL) module and addresses class-level imprecision via a Prototype Update (PUP) module. Specifically, the PCL module prioritizes samples with high cosine similarity between their visual mappings and the class-level semantic prototypes, and progressively advances to less-aligned samples, thereby reducing the interference of instance-level mismatches to achieve accurate visual-semantic mapping. Besides, the PUP module dynamically updates the class-level semantic prototypes by leveraging the visual mappings learned from instances, thereby reducing class-level imprecision and further improving the visual-semantic mapping. Experiments were conducted on standard benchmark datasets-AWA2, SUN, and CUB-to verify the effectiveness of our method.

[251] Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)

Lennart Bastian,Mohammad Rashed,Nassir Navab,Tolga Birdal

Main category: cs.CV

TL;DR: This paper proposes a novel method for modeling 3D rotational dynamics using Neural Controlled Differential Equations, which is robust to noise, applicable to non-inertial systems, and generalizes well to real-world scenarios.

Details

Motivation: Modeling the rotation of moving objects is a fundamental task in computer vision, but SO(3) extrapolation faces challenges like unknown physical quantities, non-conservative kinematics due to external forces, and the need for robustness in sparse and noisy observations. Method: The method involves modeling trajectories of noisy pose estimates on the manifold of 3D rotations using Neural Controlled Differential Equations guided with SO(3) Savitzky-Golay paths. Result: The model achieves robust extrapolation capabilities in simulation and real-world settings by learning to approximate object dynamics from noisy states during training, without relying on energy or momentum conservation assumptions. Conclusion: The paper concludes that their proposed method effectively models rotational dynamics in complex, non-inertial systems, showing robustness to noise and generalizing well to real-world scenarios with unknown physical parameters. Abstract: Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness. We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths. Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters. By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings. Code is available at https://github.com/bastianlb/forecasting-rotational-dynamics

[252] GaitSnippet: Gait Recognition Beyond Unordered Sets and Ordered Sequences

Saihui Hou,Chenye Wang,Wenpeng Lang,Zhengxiang Lan,Yongzhen Huang

Main category: cs.CV

TL;DR: This paper introduces a snippet-based approach for gait recognition that effectively captures multi-scale temporal context, outperforming traditional set-based and sequence-based methods.

Details

Motivation: The motivation stems from the limitations of current gait recognition approaches: set-based methods miss short-range temporal context, while sequence-based methods struggle with long-range dependencies. Method: The method involves treating gait as a composition of individualized actions (snippets), with random frames selected from continuous segments, and introduces Snippet Sampling and Snippet Modeling for effective recognition. Result: The proposed method achieves rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone, demonstrating its effectiveness. Conclusion: The proposed snippet-based approach for gait recognition addresses the limitations of existing set-based and sequence-based methods by incorporating multi-scale temporal context, showing promising performance on multiple datasets. Abstract: Recent advancements in gait recognition have significantly enhanced performance by treating silhouettes as either an unordered set or an ordered sequence. However, both set-based and sequence-based approaches exhibit notable limitations. Specifically, set-based methods tend to overlook short-range temporal context for individual frames, while sequence-based methods struggle to capture long-range temporal dependencies effectively. To address these challenges, we draw inspiration from human identification and propose a new perspective that conceptualizes human gait as a composition of individualized actions. Each action is represented by a series of frames, randomly selected from a continuous segment of the sequence, which we term a snippet. Fundamentally, the collection of snippets for a given sequence enables the incorporation of multi-scale temporal context, facilitating more comprehensive gait feature learning. Moreover, we introduce a non-trivial solution for snippet-based gait recognition, focusing on Snippet Sampling and Snippet Modeling as key components. Extensive experiments on four widely-used gait datasets validate the effectiveness of our proposed approach and, more importantly, highlight the potential of gait snippets. For instance, our method achieves the rank-1 accuracy of 77.5% on Gait3D and 81.7% on GREW using a 2D convolution-based backbone.

[253] Boosting Active Defense Persistence: A Two-Stage Defense Framework Combining Interruption and Poisoning Against Deepfake

Hongrui Zheng,Yuezun Li,Liejun Wang,Yunfeng Diao,Zhiqing Guo

Main category: cs.CV

TL;DR: 本文提出了一种新的两阶段防御框架TSDF，用于解决现有主动防御策略缺乏持久性的问题，通过使用双功能对抗扰动，不仅能够直接扭曲伪造内容，还能干扰攻击者重新训练模型的数据准备过程，从而确保防御的长期有效性。

Details

Motivation: 现有的主动防御策略缺乏持久性，攻击者可以通过重新训练模型绕过这些防御。因此，需要一种能够长期有效的防御方法。 Method: 提出了一种创新的两阶段防御框架（TSDF），利用双功能对抗扰动，实现直接扭曲伪造结果和破坏攻击者重新训练模型所需的数据准备过程。 Result: 实验表明，传统中断方法在遭受对抗性重新训练时性能急剧下降，而TSDF框架显示了强大的双重防御能力，能够提高主动防御的持久性。 Conclusion: TSDF通过双重防御能力有效提高了主动防御的持久性，并且代码将公开，以促进相关领域的进一步研究。 Abstract: Active defense strategies have been developed to counter the threat of deepfake technology. However, a primary challenge is their lack of persistence, as their effectiveness is often short-lived. Attackers can bypass these defenses by simply collecting protected samples and retraining their models. This means that static defenses inevitably fail when attackers retrain their models, which severely limits practical use. We argue that an effective defense not only distorts forged content but also blocks the model's ability to adapt, which occurs when attackers retrain their models on protected images. To achieve this, we propose an innovative Two-Stage Defense Framework (TSDF). Benefiting from the intensity separation mechanism designed in this paper, the framework uses dual-function adversarial perturbations to perform two roles. First, it can directly distort the forged results. Second, it acts as a poisoning vehicle that disrupts the data preparation process essential for an attacker's retraining pipeline. By poisoning the data source, TSDF aims to prevent the attacker's model from adapting to the defensive perturbations, thus ensuring the defense remains effective long-term. Comprehensive experiments show that the performance of traditional interruption methods degrades sharply when it is subjected to adversarial retraining. However, our framework shows a strong dual defense capability, which can improve the persistence of active defense. Our code will be available at https://github.com/vpsg-research/TSDF.

[254] Power Battery Detection

Xiaoqi Zhao,Peiqian Cao,Lihe Zhang,Zonglei Feng,Hanqi Liu,Jiaming Zuo,Youwei Pang,Weisi Lin,Georges El Fakhri,Huchuan Lu,Xiaofeng Liu

Main category: cs.CV

TL;DR: The paper introduces PBD5K, a large-scale benchmark for power battery detection, and proposes MDCNeXt, a model designed to improve detection accuracy by integrating multi-dimensional structure clues.

Details

Motivation: The motivation behind the paper is to address the challenges of manual inspection inefficiency and error-proneness, as well as the struggles of traditional vision algorithms with densely packed plates, low contrast, scale variation, and imaging artifacts in power battery detection. Method: The paper formulates PBD as a point-level segmentation problem and proposes MDCNeXt, which incorporates two state space modules: a prompt-filtered module and a density-aware reordering module. Additionally, a distance-adaptive mask generation strategy is proposed. Result: The paper introduces PBD5K, the first large-scale benchmark for power battery detection, and an intelligent annotation pipeline for scalable and consistent labeling. Conclusion: The paper concludes by presenting MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues for power battery detection, along with a distance-adaptive mask generation strategy to improve discrimination and suppress visual interference. Abstract: Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.

[255] MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks

Yushen Xu,Xiaosong Li,Zhenyu Kuang,Xiaoqi Cheng,Haishu Tan,Huafeng Li

Main category: cs.CV

TL;DR: This paper introduces MambaTrans, a new method for adapting multimodal fused images to improve object detection and semantic segmentation tasks without modifying pre-trained models.

Details

Motivation: The motivation stems from the issue that existing downstream pre-training models are primarily trained on visible images, which can lead to degraded performance when applied to multimodal fused images due to significant pixel distribution differences. Method: The paper proposes MambaTrans, a novel multimodal fusion image modality translator that uses a Multi-Model State Space Block combining mask-image-text cross-attention and a 3D-Selective Scan Module to enhance visual capabilities. Result: Experiments on public datasets show that MambaTrans effectively improves the performance of multimodal images in downstream tasks. Conclusion: MambaTrans effectively enhances the performance of multimodal image fusion in downstream tasks, such as object detection and semantic segmentation, without requiring adjustments to pre-trained models. Abstract: The goal of multimodal image fusion is to integrate complementary information from infrared and visible images, generating multimodal fused images for downstream tasks. Existing downstream pre-training models are typically trained on visible images. However, the significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images. This paper explores adapting multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. To address this, we propose MambaTrans, a novel multimodal fusion image modality translator. MambaTrans uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi-Model State Space Block, combines mask-image-text cross-attention and a 3D-Selective Scan Module, enhancing pure visual capabilities. By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long-term dependencies among text, masks, and images. This enables favorable results in pre-trained models without adjusting their parameters. Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.

[256] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Bao Li,Xiaomei Zhang,Miao Xu,Zhaoxin Fan,Xiangyu Zhu,Zhen Lei

Main category: cs.CV

TL;DR: Pose-RFT introduces a hybrid reinforcement learning framework to enhance 3D human pose generation in MLLMs, outperforming existing methods in both image-to-pose and text-to-pose tasks.

Details

Motivation: Existing pose-specific MLLMs struggle with modeling ambiguity and achieving task-specific alignment for accurate 3D pose generation due to reliance on supervised objectives like SMPL parameter regression. Method: Pose-RFT formulates 3D pose generation as a hybrid action reinforcement learning problem, combining discrete language prediction and continuous pose generation, optimized using the HyGRPO algorithm with group-wise reward normalization and task-specific reward functions. Result: Pose-RFT demonstrates significant performance improvements over existing MLLMs on multiple 3D pose generation benchmarks, validating its effectiveness in capturing spatial and semantic correspondences. Conclusion: Pose-RFT, a reinforcement fine-tuning framework, significantly enhances 3D human pose generation in MLLMs by addressing ambiguity and task-specific alignment through HyGRPO and task-specific reward functions. Abstract: Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization over sampled responses to guide joint optimization of discrete and continuous actions. Pose-RFT further incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of hybrid action reinforcement fine-tuning for 3D pose generation.

[257] DiTVR: Zero-Shot Diffusion Transformer for Video Restoration

Sicheng Gao,Nancy Mehta,Zongwei Wu,Radu Timofte

Main category: cs.CV

TL;DR: 本文提出了一种名为DiTVR的零样本视频恢复框架，通过结合扩散变压器、轨迹感知注意力机制和波导流一致采样器，解决了传统方法在细节真实感、时间一致性和数据需求方面的不足。

Details

Motivation: 视频恢复领域传统基于回归的方法生成的细节不够真实且需要大量配对数据集，而近期的生成扩散模型在时间一致性方面面临挑战。因此，提出一种零样本框架来改进这些问题。 Method: DiTVR采用了扩散变压器，结合了轨迹感知注意力机制和波导流一致采样器。轨迹感知注意力机制通过光流轨迹对齐tokens，并关注对时间动态最敏感的层次。时空邻居缓存根据帧间的运动对应关系动态选择相关tokens。采样器仅在低频带注入数据一致性，以保持高频先验并加速收敛。 Result: DiTVR在视频恢复基准测试中建立了新的零样本最先进水平，表现出优越的时间一致性和细节保留能力，并对光流噪声和遮挡具有鲁棒性。 Conclusion: DiTVR为视频恢复提供了一种有效的零样本解决方案，克服了传统方法在数据需求、细节真实感和时间一致性方面的限制。 Abstract: Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.

[258] Semi-supervised Multiscale Matching for SAR-Optical Image

Jingze Gai,Changchun Li

Main category: cs.CV

TL;DR: 本文提出了一种高效的半监督SAR-光学图像匹配方法（S2M2-SAR），通过结合少量标注数据和大量未标注数据，实现了优于现有半监督方法和与全监督方法相当的性能。

Details

Motivation: 现有的SAR-光学图像匹配方法依赖于像素级匹配对应关系的监督，耗时且复杂的手动标注，使得收集足够的标注SAR-光学图像对变得困难。 Method: 设计了一种半监督的SAR-光学图像匹配流程，结合了稀缺的标注数据和大量的未标注数据，并提出了一种半监督多尺度匹配方法（S2M2-SAR）。 Result: 实验结果表明，S2M2-SAR在基准数据集上的表现优于现有半监督方法，并达到与全监督SOTA方法相当的性能。 Conclusion: S2M2-SAR不仅超越了现有的半监督方法，还达到了与全监督SOTA方法相当的性能，展示了其高效性和实际潜力。 Abstract: Driven by the complementary nature of optical and synthetic aperture radar (SAR) images, SAR-optical image matching has garnered significant interest. Most existing SAR-optical image matching methods aim to capture effective matching features by employing the supervision of pixel-level matched correspondences within SAR-optical image pairs, which, however, suffers from time-consuming and complex manual annotation, making it difficult to collect sufficient labeled SAR-optical image pairs. To handle this, we design a semi-supervised SAR-optical image matching pipeline that leverages both scarce labeled and abundant unlabeled image pairs and propose a semi-supervised multiscale matching for SAR-optical image matching (S2M2-SAR). Specifically, we pseudo-label those unlabeled SAR-optical image pairs with pseudo ground-truth similarity heatmaps by combining both deep and shallow level matching results, and train the matching model by employing labeled and pseudo-labeled similarity heatmaps. In addition, we introduce a cross-modal feature enhancement module trained using a cross-modality mutual independence loss, which requires no ground-truth labels. This unsupervised objective promotes the separation of modality-shared and modality-specific features by encouraging statistical independence between them, enabling effective feature disentanglement across optical and SAR modalities. To evaluate the effectiveness of S2M2-SAR, we compare it with existing competitors on benchmark datasets. Experimental results demonstrate that S2M2-SAR not only surpasses existing semi-supervised methods but also achieves performance competitive with fully supervised SOTA methods, demonstrating its efficiency and practical potential.

[259] Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models

Chenyue Song,Chen Hui,Haiqi Zhu,Feng Jiang,Yachun Mi,Wei Zhang,Shaohui Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的无参考图像质量评估模型RSFIQA，通过结合区域级失真信息和多维质量差异，提高了对局部质量变化的敏感度，并在多个基准数据集中表现出色。

Details

Motivation: 现有的无参考图像质量评估方法要么关注于全局表示，缺乏对语义显著区域的洞察，要么对区域特征采用统一加权，削弱了对局部质量变化的敏感度。 Method: 利用Segment Anything Model (SAM)将输入图像动态划分为不重叠的语义区域，并使用多模态大语言模型(MLLM)来提取描述性内容并感知多维失真，同时引入了区域感知语义注意力机制(RSA)来生成全局注意力图。 Result: RSFIQA在多个基准数据集中实现了具有竞争力的质量预测性能，证明了该方法的鲁棒性和有效性。 Conclusion: RSFIQA是一种无参考图像质量评估模型，其通过整合区域级的失真信息来感知多维质量差异，并具有骨干网络无关的特性，能够无缝集成到各种深度神经网络架构中。 Abstract: No-reference image quality assessment (NR-IQA) aims to simulate the process of perceiving image quality aligned with subjective human perception. However, existing NR-IQA methods either focus on global representations that leads to limited insights into the semantically salient regions or employ a uniform weighting for region features that weakens the sensitivity to local quality variations. In this paper, we propose a fine-grained image quality assessment model, named RSFIQA, which integrates region-level distortion information to perceive multi-dimensional quality discrepancies. To enhance regional quality awareness, we first utilize the Segment Anything Model (SAM) to dynamically partition the input image into non-overlapping semantic regions. For each region, we teach a powerful Multi-modal Large Language Model (MLLM) to extract descriptive content and perceive multi-dimensional distortions, enabling a comprehensive understanding of both local semantics and quality degradations. To effectively leverage this information, we introduce Region-Aware Semantic Attention (RSA) mechanism, which generates a global attention map by aggregating fine-grained representations from local regions. In addition, RSFIQA is backbone-agnostic and can be seamlessly integrated into various deep neural network architectures. Extensive experiments demonstrate the robustness and effectiveness of the proposed method, which achieves competitive quality prediction performance across multiple benchmark datasets.

[260] Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP

Ke Ma,Jun Long,Hongxiao Fei,Liujie Hua,Yueyi Luo

Main category: cs.CV

TL;DR: This paper proposes an Architectural Co-Design framework that enhances Vision-Language Models for Zero-Shot Anomaly Detection by integrating local inductive biases and enabling adaptive text-visual fusion, resulting in better accuracy and robustness across diverse benchmarks.

Details

Motivation: The motivation is to bridge the adaptation gap in applying pre-trained Vision-Language Models to Zero-Shot Anomaly Detection due to their lack of local inductive biases and reliance on inflexible feature fusion paradigms. Method: The method involves a Convolutional Low-Rank Adaptation (Conv-LoRA) adapter for local inductive biases and a Dynamic Fusion Gateway (DFG) that adaptively modulates text prompts based on visual context. Result: Extensive experiments on various industrial and medical benchmarks showed superior accuracy and robustness, highlighting the importance of the co-design approach in adapting foundation models to dense perception tasks. Conclusion: The Architectural Co-Design framework effectively addresses the limitations of pre-trained Vision-Language Models in Zero-Shot Anomaly Detection by integrating local inductive biases and enabling adaptive bidirectional fusion, leading to improved performance in dense perception tasks. Abstract: Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method integrates a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.

[261] MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

Animesh Jain,Alexandros Stergiou

Main category: cs.CV

TL;DR: 本文提出MIMIC框架，用于可视化VLM内部表示，通过联合反转和正则化方法提升模型解释性。

Details

Motivation: 现有的VLM架构复杂且难以解释，限制了透明度和信任度，因此需要一种可视化其内部表示的方法。 Method: MIMIC利用基于VLM的联合反转和特征对齐目标，并引入了三个正则化项：空间对齐、自然图像平滑性和语义真实感。 Result: 通过定量和定性评估，MIMIC能够反转视觉概念，并结合视觉质量和语义文本指标评估效果。 Conclusion: MIMIC是第一个解决VLM概念视觉解释的模型反转方法。 Abstract: Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework to visualize the internal representations of VLMs by synthesizing visual concepts corresponding to internal encodings. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We quantitatively and qualitatively evaluate MIMIC by inverting visual concepts over a range of varying-length free-form VLM output texts. Reported results include both standard visual quality metrics as well as semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.

[262] Effortless Vision-Language Model Specialization in Histopathology without Annotation

Jingna Qiu,Nishanth Jain,Jonas Ammeling,Marc Aubreville,Katharina Breininger

Main category: cs.CV

TL;DR: This paper proposes an annotation-free adaptation method for Vision-Language Models (VLMs) in histopathology through continued pretraining on domain- and task-relevant image-caption pairs, which enhances zero-shot and few-shot performance without manual labeling

Details

Motivation: Recent VLMs in histopathology may lead to suboptimal performance in specific downstream applications due to their general-purpose design. Supervised fine-tuning methods address this issue but require manually labeled samples for adaptation Method: continued pretraining on domain- and task-relevant image-caption pairs extracted from existing databases Result: experiments on two VLMs across three downstream tasks show that image-caption pairs substantially enhance both zero-shot and few-shot performance; continued pretraining matches the performance of few-shot methods with larger training sizes while eliminating manual labeling Conclusion: continued pretraining is a promising pathway for adapting VLMs to new histopathology tasks since it is effective, task-agnostic, and annotation-free Abstract: Recent advances in Vision-Language Models (VLMs) in histopathology, such as CONCH and QuiltNet, have demonstrated impressive zero-shot classification capabilities across various tasks. However, their general-purpose design may lead to suboptimal performance in specific downstream applications. While supervised fine-tuning methods address this issue, they require manually labeled samples for adaptation. This paper investigates annotation-free adaptation of VLMs through continued pretraining on domain- and task-relevant image-caption pairs extracted from existing databases. Our experiments on two VLMs, CONCH and QuiltNet, across three downstream tasks reveal that these pairs substantially enhance both zero-shot and few-shot performance. Notably, with larger training sizes, continued pretraining matches the performance of few-shot methods while eliminating manual labeling. Its effectiveness, task-agnostic design, and annotation-free workflow make it a promising pathway for adapting VLMs to new histopathology tasks. Code is available at https://github.com/DeepMicroscopy/Annotation-free-VLM-specialization.

[263] CBDES MoE: Hierarchically Decoupled Mixture-of-Experts for Functional Modules in Autonomous Driving

Qi Xiang,Kunsong Shi,Zhigui Lin,Lei He

Main category: cs.CV

TL;DR: CBDES MoE是一种创新的模块化混合专家框架，通过动态选择专家路径，在自动驾驶多模态BEV感知任务中实现了更优性能。

Details

Motivation: 现有的多模态BEV方法存在输入适应性有限、建模能力受限和泛化性能不佳的问题，需要一种更高效和灵活的解决方案。 Method: 提出了一种基于自注意力路由机制的轻量级动态专家路径选择架构CBDES MoE，实现稀疏且输入感知的高效推理。 Result: 在nuScenes数据集上的评估表明，CBDES MoE在3D物体检测中表现优于固定单专家基线模型，mAP提高1.6点，NDS提高4.1点。 Conclusion: CBDES MoE通过功能模块级别的分层解耦混合专家架构，提升了自动驾驶中多模态BEV感知系统的适应性、建模能力和泛化性能。 Abstract: Bird's Eye View (BEV) perception systems based on multi-sensor feature fusion have become a fundamental cornerstone for end-to-end autonomous driving. However, existing multi-modal BEV methods commonly suffer from limited input adaptability, constrained modeling capacity, and suboptimal generalization. To address these challenges, we propose a hierarchically decoupled Mixture-of-Experts architecture at the functional module level, termed Computing Brain DEvelopment System Mixture-of-Experts (CBDES MoE). CBDES MoE integrates multiple structurally heterogeneous expert networks with a lightweight Self-Attention Router (SAR) gating mechanism, enabling dynamic expert path selection and sparse, input-aware efficient inference. To the best of our knowledge, this is the first modular Mixture-of-Experts framework constructed at the functional module granularity within the autonomous driving domain. Extensive evaluations on the real-world nuScenes dataset demonstrate that CBDES MoE consistently outperforms fixed single-expert baselines in 3D object detection. Compared to the strongest single-expert model, CBDES MoE achieves a 1.6-point increase in mAP and a 4.1-point improvement in NDS, demonstrating the effectiveness and practical advantages of the proposed approach.

[264] Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images

Shunya Nagashima,Komei Sugiura

Main category: cs.CV

TL;DR: 本文提出了 Deep SWM，一种结合深度状态空间模型和稀疏掩码自编码器的新方法，用于太阳耀斑预测，显著提高了预测的准确性和可靠性。

Details

Motivation: 现有的基于启发式物理特征的方法缺乏从太阳图像中学习表示的能力，而端到端学习方法难以建模太阳图像中的长距离时间依赖性，因此需要更准确可靠的太阳耀斑预测方法。 Method: 提出 Deep SWM 模型，基于深度状态空间模型处理十通道太阳图像和长距离时空依赖关系，并采用稀疏掩码自编码器进行预训练。 Result: Deep SWM 在标准指标上的性能和可靠性均优于基线方法，甚至超过了人类专家的表现。 Conclusion: Deep SWM 结合了深度状态空间模型和稀疏掩码自编码器，有效处理多通道太阳图像和长距离时空依赖关系，提升了太阳耀斑预测的性能和可靠性。 Abstract: Accurate, reliable solar flare prediction is crucial for mitigating potential disruptions to critical infrastructure, while predicting solar flares remains a significant challenge. Existing methods based on heuristic physical features often lack representation learning from solar images. On the other hand, end-to-end learning approaches struggle to model long-range temporal dependencies in solar images. In this study, we propose Deep Space Weather Model (Deep SWM), which is based on multiple deep state space models for handling both ten-channel solar images and long-range spatio-temporal dependencies. Deep SWM also features a sparse masked autoencoder, a novel pretraining strategy that employs a two-phase masking approach to preserve crucial regions such as sunspots while compressing spatial information. Furthermore, we built FlareBench, a new public benchmark for solar flare prediction covering a full 11-year solar activity cycle, to validate our method. Our method outperformed baseline methods and even human expert performance on standard metrics in terms of performance and reliability. The project page can be found at https://keio-smilab25.github.io/DeepSWM.

[265] Morphological Analysis of Semiconductor Microstructures using Skeleton Graphs

Noriko Nitta,Rei Miyata,Naoto Oishi

Main category: cs.CV

TL;DR: This paper investigates how ion beam irradiation affects the morphology of Ge surfaces, finding that irradiation angle has a greater influence than fluence.

Details

Motivation: To understand the impact of ion beam irradiation parameters on the morphological properties of Ge surfaces. Method: Electron microscopy images were processed to extract topological features as skeleton graphs, embedded using a graph convolutional network, and analyzed using principal component analysis. Cluster separability was evaluated using the Davies-Bouldin index. Result: Variations in irradiation angle significantly affect the morphology of Ge surfaces, more so than variations in irradiation fluence. Conclusion: Irradiation angle has a more significant impact on the morphological properties of Ge surfaces compared to irradiation fluence. Abstract: In this paper, electron microscopy images of microstructures formed on Ge surfaces by ion beam irradiation were processed to extract topological features as skeleton graphs, which were then embedded using a graph convolutional network. The resulting embeddings were analyzed using principal component analysis, and cluster separability in the resulting PCA space was evaluated using the Davies-Bouldin index. The results indicate that variations in irradiation angle have a more significant impact on the morphological properties of Ge surfaces than variations in irradiation fluence.

[266] Tracking Any Point Methods for Markerless 3D Tissue Tracking in Endoscopic Stereo Images

Konrad Reuter,Suresh Guttikonda,Sarah Latus,Lennart Maack,Christian Betz,Tobias Maurer,Alexander Schlaefer

Main category: cs.CV

TL;DR: 本文提出了一种基于TAP网络的无标记3D组织跟踪方法，用于微创手术中的精确组织跟踪，以支持手术导航并提高安全性。

Details

Motivation: 微创手术中动态组织运动和有限视野带来的挑战需要准确的组织跟踪，以提高手术安全性和效果。 Method: 结合两个CoTracker模型，一个用于时间跟踪，一个用于立体匹配，从立体内窥镜图像估计3D运动。 Result: 在临床腹腔镜设置和模拟组织运动的机器人臂上进行了评估，实验表明在鸡肉组织模型上跟踪结果更可靠，在10mm/s速度下欧氏距离误差低至1.1mm。 Conclusion: 基于TAP的方法在具有挑战性的手术场景中展示了准确、无标记的3D跟踪潜力。 Abstract: Minimally invasive surgery presents challenges such as dynamic tissue motion and a limited field of view. Accurate tissue tracking has the potential to support surgical guidance, improve safety by helping avoid damage to sensitive structures, and enable context-aware robotic assistance during complex procedures. In this work, we propose a novel method for markerless 3D tissue tracking by leveraging 2D Tracking Any Point (TAP) networks. Our method combines two CoTracker models, one for temporal tracking and one for stereo matching, to estimate 3D motion from stereo endoscopic images. We evaluate the system using a clinical laparoscopic setup and a robotic arm simulating tissue motion, with experiments conducted on a synthetic 3D-printed phantom and a chicken tissue phantom. Tracking on the chicken tissue phantom yielded more reliable results, with Euclidean distance errors as low as 1.1 mm at a velocity of 10 mm/s. These findings highlight the potential of TAP-based models for accurate, markerless 3D tracking in challenging surgical scenarios.

[267] Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model

Bin Cao,Sipeng Zheng,Ye Wang,Lujie Xia,Qianshan Wei,Qin Jin,Jing Liu,Zongqing Lu

Main category: cs.CV

TL;DR: This paper introduces Being-M0.5, a real-time, controllable vision-language-motion model that overcomes key limitations in existing models, enabling precise, granular motion control for practical applications.

Details

Motivation: Existing vision-language-motion models (VLMMs) face limitations in controllability, including poor response to commands, limited pose initialization, long-term sequence performance, handling of unseen scenarios, and lack of fine-grained body control. These issues hinder practical deployment. Method: The study introduces Being-M0.5, a real-time, controllable vision-language-motion model (VLMM), built using the HuMo100M dataset. A part-aware residual quantization technique was developed for motion tokenization to enable fine-grained control over body parts. Result: Being-M0.5 achieves state-of-the-art performance in motion generation tasks with precise control over individual body parts and real-time capabilities, validated through extensive experiments and efficiency analysis. Conclusion: Being-M0.5 represents a significant advancement in human motion generation technology, offering precise control and real-time performance, and is expected to accelerate the adoption of such technologies in real-world applications. Abstract: Human motion generation has emerged as a critical technology with transformative potential for real-world applications. However, existing vision-language-motion models (VLMMs) face significant limitations that hinder their practical deployment. We identify controllability as a main bottleneck, manifesting in five key aspects: inadequate response to diverse human commands, limited pose initialization capabilities, poor performance on long-term sequences, insufficient handling of unseen scenarios, and lack of fine-grained control over individual body parts. To overcome these limitations, we present Being-M0.5, the first real-time, controllable VLMM that achieves state-of-the-art performance across multiple motion generation tasks. Our approach is built upon HuMo100M, the largest and most comprehensive human motion dataset to date, comprising over 5 million self-collected motion sequences, 100 million multi-task instructional instances, and detailed part-level annotations that address a critical gap in existing datasets. We introduce a novel part-aware residual quantization technique for motion tokenization that enables precise, granular control over individual body parts during generation. Extensive experimental validation demonstrates Being-M0.5's superior performance across diverse motion benchmarks, while comprehensive efficiency analysis confirms its real-time capabilities. Our contributions include design insights and detailed computational analysis to guide future development of practical motion generators. We believe that HuMo100M and Being-M0.5 represent significant advances that will accelerate the adoption of motion generation technologies in real-world applications. The project page is available at https://beingbeyond.github.io/Being-M0.5.

[268] CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning

Yanshu Li,Jianjiang Yang,Zhennan Shen,Ligong Han,Haoyan Xu,Ruixiang Tang

Main category: cs.CV

TL;DR: 本文提出了一种名为CATP的训练自由剪枝方法，用于解决多模态上下文学习中的图像令牌冗余问题。CATP通过两个阶段的渐进式剪枝考虑输入序列中的复杂跨模态交互。结果显示，CATP在多个模型和基准测试中均优于所有基线方法，提高了效率并提升了性能。

Details

Motivation: 现有图像令牌剪枝方法主要考虑单图像任务，忽略了冗余更大且效率更关键的多模态上下文学习。 Method: CATP通过两个阶段进行渐进式剪枝以考虑输入序列中的复杂跨模态交互。 Result: 在移除了77.8%的图像令牌后，CATP在四个LVLMs和八个基准测试中平均性能提高了0.6%，并平均减少了10.78%的推理延迟。 Conclusion: CATP是一个针对多模态上下文学习的训练自由剪枝方法，可以提高效率并提升性能。 Abstract: Modern large vision-language models (LVLMs) convert each input image into a large set of tokens, far outnumbering the text tokens. Although this improves visual perception, it introduces severe image token redundancy. Because image tokens carry sparse information, many add little to reasoning, yet greatly increase inference cost. The emerging image token pruning methods tackle this issue by identifying the most important tokens and discarding the rest. These methods can raise efficiency with only modest performance loss. However, most of them only consider single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is greater and efficiency is more critical. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and cause unstable performance. Applying existing pruning methods in this setting leads to large accuracy drops, exposing a clear gap and the need for new techniques. Thus, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method targeted at multimodal ICL. CATP consists of two stages that perform progressive pruning to fully account for the complex cross-modal interactions in the input sequence. After removing 77.8\% of the image tokens, CATP produces an average performance gain of 0.6\% over the vanilla model on four LVLMs and eight benchmarks, exceeding all baselines remarkably. Meanwhile, it effectively improves efficiency by achieving an average reduction of 10.78\% in inference latency. CATP enhances the practical value of multimodal ICL and lays the groundwork for future progress in interleaved image-text scenarios.

[269] Selective Contrastive Learning for Weakly Supervised Affordance Grounding

WonJun Moon,Hyun Seok Seong,Jae-Pil Heo

Main category: cs.CV

TL;DR: This paper introduces a novel method for identifying functional parts of objects through a combination of CLIP-based object identification and contrastive learning objectives, improving the understanding of object affordances without needing detailed pixel-level annotations.

Details

Motivation: The motivation is to overcome the limitations of existing models that rely heavily on classification and focus on irrelevant class-specific patterns, rather than affordance-related parts. Method: The method involves using CLIP to identify action-associated objects in egocentric and exocentric images, followed by cross-referencing these objects to discover part-level affordance clues. It incorporates selective prototypical and pixel contrastive objectives to adaptively learn affordance-relevant cues at both part and object levels. Result: Experimental results show that the approach effectively identifies affordance-relevant regions, shifting activation from irrelevant areas to meaningful cues, thereby enhancing object interaction capabilities. Conclusion: The paper concludes that the proposed method effectively improves the identification of affordance-relevant parts in objects through selective prototypical and pixel contrastive objectives. Abstract: Facilitating an entity's interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common class-specific patterns that are unrelated to affordance. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, by cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method. Codes are available at github.com/hynnsk/SelectiveCL.

[270] TAP: Parameter-efficient Task-Aware Prompting for Adverse Weather Removal

Hanting Wang,Shengpeng Ji,Shulei Wang,Hai Huang,Xiao Jin,Qifei Zhang,Tao Jin

Main category: cs.CV

TL;DR: This paper proposes a parameter-efficient All-in-One framework for image restoration under adverse weather conditions, using task-aware enhanced prompts to improve performance while significantly reducing model parameters.

Details

Motivation: Most existing All-in-One image restoration methods require dedicated network modules or parameters for each degradation type, overlooking inter-task relatedness and leading to significant parameter overhead. Method: A two-stage training paradigm (pretraining and prompt-tuning) is used, incorporating low-rank decomposition and contrastive constraints on trainable soft prompts to enhance task modeling and parameter efficiency. Result: The proposed method achieves superior performance on various image restoration tasks while maintaining high parameter efficiency, as evidenced by t-SNE analysis and experimental results. Conclusion: The proposed parameter-efficient All-in-One image restoration framework effectively handles various adverse weather degradations by leveraging task-aware enhanced prompts, achieving superior performance with only 2.75M parameters. Abstract: Image restoration under adverse weather conditions has been extensively explored, leading to numerous high-performance methods. In particular, recent advances in All-in-One approaches have shown impressive results by training on multi-task image restoration datasets. However, most of these methods rely on dedicated network modules or parameters for each specific degradation type, resulting in a significant parameter overhead. Moreover, the relatedness across different restoration tasks is often overlooked. In light of these issues, we propose a parameter-efficient All-in-One image restoration framework that leverages task-aware enhanced prompts to tackle various adverse weather degradations.Specifically, we adopt a two-stage training paradigm consisting of a pretraining phase and a prompt-tuning phase to mitigate parameter conflicts across tasks. We first employ supervised learning to acquire general restoration knowledge, and then adapt the model to handle specific degradation via trainable soft prompts. Crucially, we enhance these task-specific prompts in a task-aware manner. We apply low-rank decomposition to these prompts to capture both task-general and task-specific characteristics, and impose contrastive constraints to better align them with the actual inter-task relatedness. These enhanced prompts not only improve the parameter efficiency of the restoration model but also enable more accurate task modeling, as evidenced by t-SNE analysis. Experimental results on different restoration tasks demonstrate that the proposed method achieves superior performance with only 2.75M parameters.

[271] NeeCo: Image Synthesis of Novel Instrument States Based on Dynamic and Deformable 3D Gaussian Reconstruction

Tianle Zeng,Junlei Hu,Gerardo Loza Galindo,Sharib Ali,Duygu Sarikaya,Pietro Valdastri,Dominic Jones

Main category: cs.CV

TL;DR: This study introduces a new dynamic Gaussian Splatting approach to overcome data scarcity in surgical automation, enabling realistic synthetic image generation and improving neural network performance.

Details

Motivation: Current data-driven approaches in surgical automation are limited by the need for large, high-quality labeled datasets. Method: A dynamic Gaussian model was introduced to represent surgical scenes, enabling synthetic image rendering and automatic annotation. A dynamic training adjustment strategy was also implemented. Result: The method produced photo-realistic synthetic images with high PSNR (29.87) and improved neural network performance by 15% compared to standard data augmentation techniques. Conclusion: The proposed dynamic Gaussian Splatting technique effectively addresses data scarcity in surgical image datasets and enhances the performance of medical-specific neural networks. Abstract: Computer vision-based technologies significantly enhance surgical automation by advancing tool tracking, detection, and localization. However, Current data-driven approaches are data-voracious, requiring large, high-quality labeled image datasets, which limits their application in surgical data science. Our Work introduces a novel dynamic Gaussian Splatting technique to address the data scarcity in surgical image datasets. We propose a dynamic Gaussian model to represent dynamic surgical scenes, enabling the rendering of surgical instruments from unseen viewpoints and deformations with real tissue backgrounds. We utilize a dynamic training adjustment strategy to address challenges posed by poorly calibrated camera poses from real-world scenarios. Additionally, we propose a method based on dynamic Gaussians for automatically generating annotations for our synthetic data. For evaluation, we constructed a new dataset featuring seven scenes with 14,000 frames of tool and camera motion and tool jaw articulation, with a background of an ex-vivo porcine model. Using this dataset, we synthetically replicate the scene deformation from the ground truth data, allowing direct comparisons of synthetic image quality. Experimental results illustrate that our method generates photo-realistic labeled image datasets with the highest values in Peak-Signal-to-Noise Ratio (29.87). We further evaluate the performance of medical-specific neural networks trained on real and synthetic images using an unseen real-world image dataset. Our results show that the performance of models trained on synthetic images generated by the proposed method outperforms those trained with state-of-the-art standard data augmentation by 10%, leading to an overall improvement in model performances by nearly 15%.

[272] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Bowen Xue,Qixin Yan,Wenjing Wang,Hao Liu,Chen Li

Main category: cs.CV

TL;DR: 本文提出Stand-In，一种轻量级视频生成框架，通过少量额外参数实现高效身份保留与多功能集成。

Details

Motivation: 现有方法依赖过多训练参数且缺乏与其他AIGC工具的兼容性，需要一种高效且灵活的解决方案。 Method: 引入条件图像分支和受限自注意力机制，通过约1%的额外参数快速学习身份控制。 Result: 仅需2000对数据即可快速学习，在视频质量和身份保留方面优于全参数训练方法。 Conclusion: Stand-In框架在视频生成中实现了高效的身份保留，同时具有轻量化和多功能集成的优势。 Abstract: Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1\% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

[273] CTC Transcription Alignment of the Bullinger Letters: Automatic Improvement of Annotation Quality

Marco Peer,Anna Scius-Bertrand,Andreas Fischer

Main category: cs.CV

TL;DR: 本文提出了一種基於CTC對齊算法的自我訓練方法，用於解決歷史文獻中手寫文本識別的標註錯誤問題，特別是連字符問題，並發布了一個新的手工修正數據集。

Details

Motivation: 歷史文獻的手寫文本識別因手寫變異性、退化的來源和有限的佈局感知標註而具有挑戰性，特別是標註錯誤（尤其是連字符問題）需要解決。 Method: 引入一種基於CTC對齊算法的自我訓練方法，通過動態規劃和使用CTC損失訓練的模型輸出概率，將完整轉錄文本與文本行圖像進行匹配。 Result: 該方法提高了性能（例如，使用PyLaia時CER提高了1.1個百分點），並提高了對齊準確性，且發現較弱的模型能產生更準確的對齊，從而實現迭代訓練策略。 Conclusion: 該方法可用於迭代地改進文本識別流程的CER和對齊質量，並發布了一個新的手工修正數據集。 Abstract: Handwritten text recognition for historical documents remains challenging due to handwriting variability, degraded sources, and limited layout-aware annotations. In this work, we address annotation errors - particularly hyphenation issues - in the Bullinger correspondence, a large 16th-century letter collection. We introduce a self-training method based on a CTC alignment algorithm that matches full transcriptions to text line images using dynamic programming and model output probabilities trained with the CTC loss. Our approach improves performance (e.g., by 1.1 percentage points CER with PyLaia) and increases alignment accuracy. Interestingly, we find that weaker models yield more accurate alignments, enabling an iterative training strategy. We release a new manually corrected subset of 100 pages from the Bullinger dataset, along with our code and benchmarks. Our approach can be applied iteratively to further improve the CER as well as the alignment quality for text recognition pipelines. Code and data are available via https://github.com/andreas-fischer-unifr/nntp.

[274] Generative Video Matting

Yongtao Ge,Kangyang Xie,Guangkai Xu,Mingyu Liu,Li Ke,Longtao Huang,Hui Xue,Hao Chen,Chunhua Shen

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的视频抠图方法，通过大规模预训练和合成数据生成管道，解决了传统方法在真实世界场景中泛化能力差的问题。

Details

Motivation: 由于缺乏高质量的真实数据，传统的视频抠图方法在真实世界场景中的泛化能力通常较差。现有的视频抠图数据集只能提供人工标注的不完美的alpha和前景注释。 Method: 该论文通过合成和伪标记的分割数据集进行大规模预训练，并开发了一种可扩展的合成数据生成管道。此外，该论文引入了一种新颖的视频抠图方法，能够有效利用预训练视频扩散模型中的丰富先验知识。 Result: 该论文在三个基准数据集上进行了全面的定量评估，展示了其方法的优越性能，并在各种真实世界场景中呈现了全面的定性结果，说明了该方法的强泛化能力。 Conclusion: 该论文提出了一种新的视频抠图方法，利用大规模预训练和合成数据生成管道，在真实世界场景中表现出强大的泛化能力。 Abstract: Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at https://github.com/aim-uofa/GVM.

[275] Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction

Xudong Cai,Shuo Wang,Peng Wang,Yongcai Wang,Zhaoxin Fan,Wanting Li,Tianbao Zhang,Jianrong Tao,Yeying Jin,Deying Li

Main category: cs.CV

TL;DR: Mem4D通过分离静态几何和动态运动的记忆建模，有效解决了动态场景重建中的记忆需求困境，实现了高效且精确的重建。

Details

Motivation: 现有基于记忆的方法在处理动态场景的密集几何重建时，无法同时保持静态结构的长期稳定性和动态运动的高保真细节保留，导致几何漂移或模糊的重建结果。 Method: 提出了一种双记忆架构，包括用于捕捉动态内容的瞬态动力学记忆（TDM）和用于保持静态元素一致性的持久结构记忆（PSM），通过交替查询这两种记忆实现高效的在线重建。 Result: 实验表明，该方法在具有挑战性的基准测试中实现了最先进的性能或具有竞争力的结果，同时保持了高效率。 Conclusion: Mem4D框架通过解耦静态几何和动态运动的建模，有效解决了现有方法中的记忆需求困境，同时保持了高效率和先进的性能。 Abstract: Reconstructing dense geometry for dynamic scenes from a monocular video is a critical yet challenging task. Recent memory-based methods enable efficient online reconstruction, but they fundamentally suffer from a Memory Demand Dilemma: The memory representation faces an inherent conflict between the long-term stability required for static structures and the rapid, high-fidelity detail retention needed for dynamic motion. This conflict forces existing methods into a compromise, leading to either geometric drift in static structures or blurred, inaccurate reconstructions of dynamic objects. To address this dilemma, we propose Mem4D, a novel framework that decouples the modeling of static geometry and dynamic motion. Guided by this insight, we design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM) focuses on capturing high-frequency motion details from recent frames, enabling accurate and fine-grained modeling of dynamic content; 2) The Persistent Structure Memory (PSM) compresses and preserves long-term spatial information, ensuring global consistency and drift-free reconstruction for static elements. By alternating queries to these specialized memories, Mem4D simultaneously maintains static geometry with global consistency and reconstructs dynamic elements with high fidelity. Experiments on challenging benchmarks demonstrate that our method achieves state-of-the-art or competitive performance while maintaining high efficiency. Codes will be publicly available.

[276] RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering

Xing Zi,Jinghao Xiao,Yunxiao Shi,Xian Tao,Jun Li,Ali Braytee,Mukesh Prasad

Main category: cs.CV

TL;DR: 本文提出了一个新的大规模遥感VQA数据集RSVLM-QA，通过整合多个数据集和使用大语言模型生成丰富的注释与问题对，以提升遥感图像的理解与推理能力评估。

Details

Motivation: 现有RS VQA数据集受限于注释丰富性、问题多样性和特定推理能力的评估。 Method: 通过整合多个RS分割和检测数据集，采用双轨注释生成管道，结合大语言模型和自动化流程生成注释和问题对。 Result: RSVLM-QA包含13,820张图像和162,373个VQA对，具有广泛的注释和多样化的问题类型。 Conclusion: RSVLM-QA将为RS VQA和VLM研究社区提供关键资源，有望推动该领域的发展。 Abstract: Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.

[277] Safeguarding Generative AI Applications in Preclinical Imaging through Hybrid Anomaly Detection

Jakub Binda,Valentina Paneta,Vasileios Eleftheriadis,Hongkyou Chung,Panagiotis Papadimitroulas,Neo Christopher Chung

Main category: cs.CV

TL;DR: This paper proposes a hybrid anomaly detection framework to enhance the reliability and robustness of generative AI models in nuclear medicine, with applications in synthetic X-ray generation and radiation dose estimation.

Details

Motivation: The high-stakes nature of biomedical imaging requires robust mechanisms to detect and manage unexpected or erroneous behavior in generative AI models. Method: Development and implementation of a hybrid anomaly detection framework for safeguarding generative AI models, applied to two systems: Pose2Xray and DosimetrEYE. Result: The proposed outlier detection approach improves reliability, reduces manual oversight, and enables real-time quality control in the two applications. Conclusion: The hybrid anomaly detection framework strengthens the industrial viability of generative AI in preclinical settings by improving robustness, scalability, and regulatory compliance. Abstract: Generative AI holds great potentials to automate and enhance data synthesis in nuclear medicine. However, the high-stakes nature of biomedical imaging necessitates robust mechanisms to detect and manage unexpected or erroneous model behavior. We introduce development and implementation of a hybrid anomaly detection framework to safeguard GenAI models in BIOEMTECH's eyes(TM) systems. Two applications are demonstrated: Pose2Xray, which generates synthetic X-rays from photographic mouse images, and DosimetrEYE, which estimates 3D radiation dose maps from 2D SPECT/CT scans. In both cases, our outlier detection (OD) enhances reliability, reduces manual oversight, and supports real-time quality control. This approach strengthens the industrial viability of GenAI in preclinical settings by increasing robustness, scalability, and regulatory compliance.

[278] TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding

Jin-Seop Lee,SungJoon Lee,Jaehan Ahn,YunSeok Choi,Jee-Hyong Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为TAG的零样本视频时间定位方法，通过时间池化、时间一致性聚类和相似度调整来解决现有方法中的语义碎片化和依赖LLM的问题，实现了最先进的性能。

Details

Motivation: 现有的零样本视频时间定位方法存在语义碎片化、依赖扭曲的相似性分布以及需要使用计算成本高昂的语言模型（LLM）进行推理的问题，本文旨在解决这些限制。 Method: 本文提出了一种名为TAG的方法，包括时间池化（temporal pooling）、时间一致性聚类（temporal coherence clustering）和相似度调整（similarity adjustment），以更好地捕捉视频的时间上下文并修正相似性分布，从而提升定位准确性。 Result: TAG在Charades-STA和ActivityNet Captions数据集上取得了最先进的结果，且无需依赖LLM，降低了计算成本。 Conclusion: TAG是一种简单而有效的零样本视频时间定位方法，在不依赖语言模型的情况下，通过引入时间感知机制解决了现有方法的局限性。 Abstract: Video Temporal Grounding (VTG) aims to extract relevant video segments based on a given natural language query. Recently, zero-shot VTG methods have gained attention by leveraging pretrained vision-language models (VLMs) to localize target moments without additional training. However, existing approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments. When segments are fragmented, it becomes difficult to predict an accurate target moment that aligns with the text query. Also, they rely on skewed similarity distributions for localization, making it difficult to select the optimal segment. Furthermore, they heavily depend on the use of LLMs which require expensive inferences. To address these limitations, we propose a \textit{TAG}, a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment. Our proposed method effectively captures the temporal context of videos and addresses distorted similarity distributions without training. Our approach achieves state-of-the-art results on Charades-STA and ActivityNet Captions benchmark datasets without rely on LLMs. Our code is available at https://github.com/Nuetee/TAG

[279] VOIDFace: A Privacy-Preserving Multi-Network Face Recognition With Enhanced Security

Ajnas Muhammed,Iurri Medvedev,Nuno Gonçalves

Main category: cs.CV

TL;DR: 本文提出了 VOIDFace，一种隐私保护的面部识别框架，解决了数据复制问题，并赋予用户对个人数据的更大控制权。

Details

Motivation: 面部识别系统在训练过程中通常需要复制和存储大量人脸数据，导致数据管理困难以及用户隐私和伦理问题。 Method: 提出了 VOIDFace 框架，结合视觉秘密共享和基于补丁的多训练网络，以消除数据复制并开发隐私保护的面部识别系统。 Result: 在 VGGFace2 数据集上的实验表明，VOIDFace 在保持竞争力的同时，提供了被遗忘权、改进的数据控制、安全性和隐私保护。 Conclusion: VOIDFace 是一种新颖的面部识别框架，解决了数据复制和隐私保护的问题，通过使用视觉秘密共享和基于补丁的多训练网络，提高了面部识别训练的隐私性、安全性和效率，并赋予用户被遗忘权来控制个人数据。 Abstract: Advancement of machine learning techniques, combined with the availability of large-scale datasets, has significantly improved the accuracy and efficiency of facial recognition. Modern facial recognition systems are trained using large face datasets collected from diverse individuals or public repositories. However, for training, these datasets are often replicated and stored in multiple workstations, resulting in data replication, which complicates database management and oversight. Currently, once a user submits their face for dataset preparation, they lose control over how their data is used, raising significant privacy and ethical concerns. This paper introduces VOIDFace, a novel framework for facial recognition systems that addresses two major issues. First, it eliminates the need of data replication and improves data control to securely store training face data by using visual secret sharing. Second, it proposes a patch-based multi-training network that uses this novel training data storage mechanism to develop a robust, privacy-preserving facial recognition system. By integrating these advancements, VOIDFace aims to improve the privacy, security, and efficiency of facial recognition training, while ensuring greater control over sensitive personal face data. VOIDFace also enables users to exercise their Right-To-Be-Forgotten property to control their personal data. Experimental evaluations on the VGGFace2 dataset show that VOIDFace provides Right-To-Be-Forgotten, improved data control, security, and privacy while maintaining competitive facial recognition performance. Code is available at: https://github.com/ajnasmuhammed89/VOIDFace

[280] TrackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking

Tony Danjun Wang,Christian Heiliger,Nassir Navab,Lennart Bastian

Main category: cs.CV

TL;DR: TrackOR improves surgical team tracking with 3D geometric signatures, enabling personalized intelligent support systems in operating rooms.

Details

Motivation: Improving patient outcomes through intelligent support systems in surgical environments requires persistent and personalized tracking of surgical staff. Method: TrackOR framework for long-term multi-person tracking and re-identification using 3D geometric signatures. Result: TrackOR achieves state-of-the-art online tracking performance with +11% Association Accuracy over the strongest baseline and enables offline recovery for analysis-ready trajectories. Conclusion: TrackOR enables persistent identity tracking in the operating room by leveraging 3D geometric information, paving the way for personalized intelligent systems. Abstract: Providing intelligent support to surgical teams is a key frontier in automated surgical scene understanding, with the long-term goal of improving patient outcomes. Developing personalized intelligence for all staff members requires maintaining a consistent state of who is located where for long surgical procedures, which still poses numerous computational challenges. We propose TrackOR, a framework for tackling long-term multi-person tracking and re-identification in the operating room. TrackOR uses 3D geometric signatures to achieve state-of-the-art online tracking performance (+11% Association Accuracy over the strongest baseline), while also enabling an effective offline recovery process to create analysis-ready trajectories. Our work shows that by leveraging 3D geometric information, persistent identity tracking becomes attainable, enabling a critical shift towards the more granular, staff-centric analyses required for personalized intelligent systems in the operating room. This new capability opens up various applications, including our proposed temporal pathway imprints that translate raw tracking data into actionable insights for improving team efficiency and safety and ultimately providing personalized support.

[281] Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Fangyuan Mao,Aiming Hao,Jintao Chen,Dongxia Liu,Xiaokun Feng,Jiashu Zhu,Meiqi Wu,Chubin Chen,Jiahong Wu,Xiangxiang Chu

Main category: cs.CV

TL;DR: Omni-Effects 提出了一种统一的视觉特效生成框架，支持多种效果的空间可控生成，通过LoRA-MoE和SAP技术有效解决了多任务干扰和空间不可控问题。

Details

Motivation: 当前视觉特效生成方法受限于单效果LoRA训练，难以实现多效果的空间可控生成，需要一种统一的方法来解决这一问题。 Method: 提出了基于LoRA的专家混合模型（LoRA-MoE）和空间感知提示（SAP）技术，并引入了独立信息流（IIF）模块以防止效果之间的干扰。 Result: Omni-Effects 实现了精确的空间控制和多样化的效果生成，用户可以指定所需效果的类别和位置，显著提升了视觉特效生成的灵活性和实用性。 Conclusion: Omni-Effects 是第一个能够生成提示引导效果和空间可控复合效果的统一框架，解决了视觉特效生成中的多样性和空间可控性问题。 Abstract: Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.

[282] The Escalator Problem: Identifying Implicit Motion Blindness in AI for Accessibility

Xiantao Zhang

Main category: cs.CV

TL;DR: The paper discusses a significant limitation in Multimodal Large Language Models called Implicit Motion Blindness, exemplified by the Escalator Problem, and calls for improved approaches to video understanding that prioritize safety and reliability for visually impaired users.

Details

Motivation: The motivation for the paper is to identify and address a critical limitation in Multimodal Large Language Models (MLLMs) that affects their trustworthiness in real-world applications for the blind and visually impaired community. Method: The paper uses the Escalator Problem as a canonical example to highlight the issue of Implicit Motion Blindness. It analyzes the implications of this problem and advocates for a new approach in video understanding. Result: The result of the paper is the identification of the Escalator Problem as an example of Implicit Motion Blindness, an analysis of its implications, and a call to action for a paradigm shift in video understanding. Conclusion: The paper concludes that there is a critical issue termed Implicit Motion Blindness in MLLMs that affects their ability to perceive continuous, low-signal motion. This issue has implications for user trust and requires a shift in approach towards robust physical perception. Abstract: Multimodal Large Language Models (MLLMs) hold immense promise as assistive technologies for the blind and visually impaired (BVI) community. However, we identify a critical failure mode that undermines their trustworthiness in real-world applications. We introduce the Escalator Problem -- the inability of state-of-the-art models to perceive an escalator's direction of travel -- as a canonical example of a deeper limitation we term Implicit Motion Blindness. This blindness stems from the dominant frame-sampling paradigm in video understanding, which, by treating videos as discrete sequences of static images, fundamentally struggles to perceive continuous, low-signal motion. As a position paper, our contribution is not a new model but rather to: (I) formally articulate this blind spot, (II) analyze its implications for user trust, and (III) issue a call to action. We advocate for a paradigm shift from purely semantic recognition towards robust physical perception and urge the development of new, human-centered benchmarks that prioritize safety, reliability, and the genuine needs of users in dynamic environments.

Thinesh Thiyakesan Ponbagavathi,Chengzheng Yang,Alina Roitberg

Main category: cs.CV

TL;DR: 本文提出了一种名为ProGraD的方法，用于检测视频中的群体活动，通过使用可学习的群体提示和轻量级的GroupContext Transformer，有效地利用视觉基础模型（VFMs）进行群体活动识别，在多个基准测试中表现出色。

Details

Motivation: 尽管视觉基础模型（VFMs）在许多视觉任务中表现出色，但它们主要在以物体为中心的数据上进行预训练，对于群体动态建模的研究尚不充分。现有的群体活动检测方法通常依赖于特定任务的架构，需要全面微调，而简单地将这些方法中的CNN主干替换为VFMs并没有带来显著提升。 Method: 引入Prompt-driven Group Activity Detection (ProGraD)方法，包括1) 可学习的群体提示，以引导VFM注意力朝向社会配置；2) 一个轻量级的两层GroupContext Transformer，推断参与者与群体的关联及集体行为。 Result: 在两个最新的GAD基准测试（Cafe和Social-CAD）上评估了该方法。尽管在两种设置下都超过了最先进的技术，但在复杂的多组场景中，该方法特别有效，使用仅10M可训练参数，在Group mAP@1.0上获得了6.5%的增益，在Group mAP@0.5上获得了8.2%的增益。此外，实验表明ProGraD产生的注意力图具有可解释性，为参与者与群体的推理提供了见解。 Conclusion: 文章提出了一种新颖的群体活动检测方法ProGraD，它有效地结合了视觉基础模型与群体感知推理，通过可学习的群体提示和轻量级的GroupContext Transformer，在多个基准测试中取得了优异的结果。 Abstract: Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data and remain underexplored for modeling group dynamics. While they are a promising alternative to highly task-specific GAD architectures that require full fine-tuning, our initial investigation reveals that simply swapping CNN backbones used in these methods with VFMs brings little gain, underscoring the need for structured, group-aware reasoning on top. We introduce Prompt-driven Group Activity Detection (ProGraD) -- a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations, and 2) a lightweight two-layer GroupContext Transformer that infers actor-group associations and collective behavior. We evaluate our approach on two recent GAD benchmarks: Cafe, which features multiple concurrent social groups, and Social-CAD, which focuses on single-group interactions. While we surpass state-of-the-art in both settings, our method is especially effective in complex multi-group scenarios, where we yield a gain of 6.5\% (Group mAP\@1.0) and 8.2\% (Group mAP\@0.5) using only 10M trainable parameters. Furthermore, our experiments reveal that ProGraD produces interpretable attention maps, offering insights into actor-group reasoning. Code and models will be released.

[284] Sample-aware RandAugment: Search-free Automatic Data Augmentation for Effective Image Recognition

Anqi Xiao,Weichen Yu,Hongyuan Yu

Main category: cs.CV

TL;DR: SRA 是一种新的自动数据增强方法，能够动态调整增强策略，提高模型性能，并在多个任务中表现出色，无需复杂的搜索过程。

Details

Motivation: 主流自动数据增强方法面临两个挑战：搜索过程过于耗时或性能不佳。SRA 的提出旨在解决这些问题，提供一种更高效和实用的解决方案。 Method: 提出了一种名为 Sample-aware RandAugment (SRA) 的方法，该方法结合了启发式评分模块和不对称增强策略，动态调整增强策略以适应样本复杂度。 Result: SRA 在 ImageNet 上使用 ResNet-50 达到了 78.31% 的 Top-1 准确率，展示了良好的兼容性和泛化能力，并提升了下游目标检测任务中的识别效果。 Conclusion: SRA 是一种简单且有效的自动数据增强方法，它通过无需搜索的方案，实现了对每个样本的定制化增强策略，缩小了基于搜索和无需搜索的自动数据增强方法之间的性能差距，并在多个实验中展示了良好的性能和广泛的适用性。 Abstract: Automatic data augmentation (AutoDA) plays an important role in enhancing the generalization of neural networks. However, mainstream AutoDA methods often encounter two challenges: either the search process is excessively time-consuming, hindering practical application, or the performance is suboptimal due to insufficient policy adaptation during training. To address these issues, we propose Sample-aware RandAugment (SRA), an asymmetric, search-free AutoDA method that dynamically adjusts augmentation policies while maintaining straightforward implementation. SRA incorporates a heuristic scoring module that evaluates the complexity of the original training data, enabling the application of tailored augmentations for each sample. Additionally, an asymmetric augmentation strategy is employed to maximize the potential of this scoring module. In multiple experimental settings, SRA narrows the performance gap between search-based and search-free AutoDA methods, achieving a state-of-the-art Top-1 accuracy of 78.31\% on ImageNet with ResNet-50. Notably, SRA demonstrates good compatibility with existing augmentation pipelines and solid generalization across new tasks, without requiring hyperparameter tuning. The pretrained models leveraging SRA also enhance recognition in downstream object detection tasks. SRA represents a promising step towards simpler, more effective, and practical AutoDA designs applicable to a variety of future tasks. Our code is available at \href{https://github.com/ainieli/Sample-awareRandAugment}{https://github.com/ainieli/Sample-awareRandAugment

[285] Mitigating Biases in Surgical Operating Rooms with Geometry

Tony Danjun Wang,Tobias Czempiel,Nassir Navab,Lennart Bastian

Main category: cs.CV

TL;DR: 本文研究了深度神经网络在手术室环境中学习虚假相关性的问题，并提出使用几何表示来捕捉更具意义的人体生物特征，以解决这一问题。

Details

Motivation: 深度神经网络容易学习虚假相关性，特别是在手术室环境中，标准的手术服掩盖了可靠的识别标志，导致模型偏差。 Method: 通过梯度分析和编码人员为3D点云序列，分离身份相关形状和运动模式与外观混淆因素。 Result: 实验表明，在具有明显仿真伪影的数据集中，RGB和几何方法表现相当，但在视觉多样性降低的现实临床环境中，RGB模型准确率下降了12%。 Conclusion: 几何表示能够捕捉更具意义的人体生物特征，提供了一种在手术室中建模人员的稳健方法。 Abstract: Deep neural networks are prone to learning spurious correlations, exploiting dataset-specific artifacts rather than meaningful features for prediction. In surgical operating rooms (OR), these manifest through the standardization of smocks and gowns that obscure robust identifying landmarks, introducing model bias for tasks related to modeling OR personnel. Through gradient-based saliency analysis on two public OR datasets, we reveal that CNN models succumb to such shortcuts, fixating on incidental visual cues such as footwear beneath surgical gowns, distinctive eyewear, or other role-specific identifiers. Avoiding such biases is essential for the next generation of intelligent assistance systems in the OR, which should accurately recognize personalized workflow traits, such as surgical skill level or coordination with other staff members. We address this problem by encoding personnel as 3D point cloud sequences, disentangling identity-relevant shape and motion patterns from appearance-based confounders. Our experiments demonstrate that while RGB and geometric methods achieve comparable performance on datasets with apparent simulation artifacts, RGB models suffer a 12% accuracy drop in realistic clinical settings with decreased visual diversity due to standardizations. This performance gap confirms that geometric representations capture more meaningful biometric features, providing an avenue to developing robust methods of modeling humans in the OR.

[286] TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation

Huawei Sun,Zixu Wang,Hao Feng,Julius Ott,Lorenzo Servadei,Robert Wille

Main category: cs.CV

TL;DR: This paper introduces TRIDE, a radar-camera fusion algorithm with a weather-aware fusion block and text feature integration, achieving significant improvements in depth estimation for autonomous driving.

Details

Motivation: The motivation is to enhance depth estimation for autonomous driving by incorporating radar, camera, and language features while addressing the impact of weather on sensor performance. Method: The method involves a text-generation strategy, feature extraction and fusion techniques, and a weather-aware fusion block that adaptively adjusts radar weighting based on weather conditions. Result: The proposed method achieves a 12.87% improvement in MAE and a 9.08% improvement in RMSE over the state-of-the-art on the nuScenes dataset. Conclusion: The paper concludes that the proposed TRIDE algorithm, which integrates radar, camera, and language features with a weather-aware fusion block, achieves significant performance improvements in depth estimation for autonomous driving. Abstract: Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: https://github.com/harborsarah/TRIDE

[287] S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix

Peng Dai,Feitong Tan,Qiangeng Xu,Yihua Huang,David Futschik,Ruofei Du,Sean Fanello,Yinda Zhang,Xiaojuan Qi

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练和姿态估计的新方法，能够将现成的单目视频生成模型扩展到沉浸式3D视频生成，并通过实验验证了其有效性。

Details

Motivation: 尽管视频生成模型在生成高质量单目视频方面表现出色，但在生成适用于沉浸式应用的3D立体和空间视频方面仍存在挑战，因此需要一种高效且实用的解决方案。 Method: 该方法首先使用估计的深度信息将单目视频变形到预定义的相机视角，然后应用一种新的帧矩阵修复框架，并结合一种双重更新方案提升修复质量，最终将多视角视频转换为立体对或优化为4D高斯函数以生成空间视频。 Result: 实验表明，该方法在多个生成模型（如Sora、Lumiere、WALT和Zeroscope）生成的视频上均表现出比以往方法显著的改进。 Conclusion: 该论文提出了一种无需姿态估计和训练的方法，能够利用现成的单目视频生成模型生成沉浸式的3D视频，并通过实验验证了该方法相较于以往方法的显著改进。 Abstract: While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel \textit{frame matrix} inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a \dualupdate~scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or optimized into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Project page at: https://daipengwa.github.io/S-2VG_ProjectPage/

[288] PrIINeR: Towards Prior-Informed Implicit Neural Representations for Accelerated MRI

Ziad Al-Haj Hemidi,Eytan Kats,Mattias P. Heinrich

Main category: cs.CV

TL;DR: PrIINeR improves accelerated MRI reconstruction by integrating prior knowledge into INRs, outperforming existing methods in quality and artifact reduction.

Details

Motivation: MRI acceleration often leads to image degradation. Existing INR methods struggle at high acceleration factors due to weak prior constraints, causing structural loss and aliasing artefacts. Method: PrIINeR combines population-level knowledge from pre-trained deep learning models with instance-based optimization and enforces dual data consistency within the INR framework. Result: Evaluated on the NYU fastMRI dataset, PrIINeR outperformed state-of-the-art INR-based and learning-based methods, significantly improving structural preservation, fidelity, and reducing aliasing artefacts. Conclusion: PrIINeR provides a reliable solution for high-quality, accelerated MRI reconstruction by integrating prior knowledge into the INR framework. Abstract: Accelerating Magnetic Resonance Imaging (MRI) reduces scan time but often degrades image quality. While Implicit Neural Representations (INRs) show promise for MRI reconstruction, they struggle at high acceleration factors due to weak prior constraints, leading to structural loss and aliasing artefacts. To address this, we propose PrIINeR, an INR-based MRI reconstruction method that integrates prior knowledge from pre-trained deep learning models into the INR framework. By combining population-level knowledge with instance-based optimization and enforcing dual data consistency, PrIINeR aligns both with the acquired k-space data and the prior-informed reconstruction. Evaluated on the NYU fastMRI dataset, our method not only outperforms state-of-the-art INR-based approaches but also improves upon several learning-based state-of-the-art methods, significantly improving structural preservation and fidelity while effectively removing aliasing artefacts.PrIINeR bridges deep learning and INR-based techniques, offering a more reliable solution for high-quality, accelerated MRI reconstruction. The code is publicly available on https://github.com/multimodallearning/PrIINeR.

[289] Information Bottleneck-based Causal Attention for Multi-label Medical Image Recognition

Xiaoxiao Cui,Yiran Li,Kai He,Shanzhi Jiang,Mengli Xue,Wentao Li,Junhong Leng,Zhi Liu,Lizhen Cui,Shuo Li

Main category: cs.CV

TL;DR: This paper proposes IBCA, a novel method for multi-label classification of medical images that improves diagnosis accuracy and interpretability by effectively filtering out irrelevant features and capturing class-specific attention patterns.

Details

Motivation: Current methods for multi-label classification in medical imaging struggle to distinguish class-specific features due to attention to irrelevant features, prompting the need for a more effective approach to interpretability and diagnosis. Method: A structural causal model (SCM) and Information Bottleneck-based Causal Attention (IBCA) were proposed to filter out class-irrelevant information and capture class-specific attention patterns using Gaussian mixture multi-label spatial attention and contrastive enhancement-based causal intervention. Result: IBCA showed significant improvements over the second-best methods, with increases of 6.35% in CR, 7.72% in OR, and 5.02% in mAP for MuReD, and improvements of 1.47% in CR, 1.65% in CF1, and 1.42% in mAP for Endo. Conclusion: The proposed IBCA method outperforms existing methods in multi-label classification of medical images, showing significant improvements in performance metrics on the MuReD and Endo datasets. Abstract: Multi-label classification (MLC) of medical images aims to identify multiple diseases and holds significant clinical potential. A critical step is to learn class-specific features for accurate diagnosis and improved interpretability effectively. However, current works focus primarily on causal attention to learn class-specific features, yet they struggle to interpret the true cause due to the inadvertent attention to class-irrelevant features. To address this challenge, we propose a new structural causal model (SCM) that treats class-specific attention as a mixture of causal, spurious, and noisy factors, and a novel Information Bottleneck-based Causal Attention (IBCA) that is capable of learning the discriminative class-specific attention for MLC of medical images. Specifically, we propose learning Gaussian mixture multi-label spatial attention to filter out class-irrelevant information and capture each class-specific attention pattern. Then a contrastive enhancement-based causal intervention is proposed to gradually mitigate the spurious attention and reduce noise information by aligning multi-head attention with the Gaussian mixture multi-label spatial. Quantitative and ablation results on Endo and MuReD show that IBCA outperforms all methods. Compared to the second-best results for each metric, IBCA achieves improvements of 6.35\% in CR, 7.72\% in OR, and 5.02\% in mAP for MuReD, 1.47\% in CR, and 1.65\% in CF1, and 1.42\% in mAP for Endo.

[290] ME-TST+: Micro-expression Analysis via Temporal State Transition with ROI Relationship Awareness

Zizheng Guo,Bochao Zou,Junbao Zhuo,Huimin Ma

Main category: cs.CV

TL;DR: 本文提出了ME-TST和ME-TST+两种新架构，通过视频级回归和协同策略改进微表情分析，显著提升了性能。

Details

Motivation: 传统方法使用固定窗口长度和硬分类，将微表情检测和识别分为两个独立任务，忽视了二者之间的内在联系。 Method: 利用时间状态转移机制替代传统的窗口级分类，实现视频级回归，并引入多粒度ROI建模和慢快Mamba框架，同时在特征和结果层面提出协同策略。 Result: 实验表明，所提出的方法在微表情分析中取得了最先进的性能。 Conclusion: 该文提出ME-TST和ME-TST+两种基于状态空间模型的架构，成功解决了微表情分析中的关键问题，并达到了最先进的性能。 Abstract: Micro-expressions (MEs) are regarded as important indicators of an individual's intrinsic emotions, preferences, and tendencies. ME analysis requires spotting of ME intervals within long video sequences and recognition of their corresponding emotional categories. Previous deep learning approaches commonly employ sliding-window classification networks. However, the use of fixed window lengths and hard classification presents notable limitations in practice. Furthermore, these methods typically treat ME spotting and recognition as two separate tasks, overlooking the essential relationship between them. To address these challenges, this paper proposes two state space model-based architectures, namely ME-TST and ME-TST+, which utilize temporal state transition mechanisms to replace conventional window-level classification with video-level regression. This enables a more precise characterization of the temporal dynamics of MEs and supports the modeling of MEs with varying durations. In ME-TST+, we further introduce multi-granularity ROI modeling and the slowfast Mamba framework to alleviate information loss associated with treating ME analysis as a time-series task. Additionally, we propose a synergy strategy for spotting and recognition at both the feature and result levels, leveraging their intrinsic connection to enhance overall analysis performance. Extensive experiments demonstrate that the proposed methods achieve state-of-the-art performance. The codes are available at https://github.com/zizheng-guo/ME-TST.

[291] Matrix-3D: Omnidirectional Explorable 3D World Generation

Zhongqi Yang,Wenhang Ge,Yuqi Li,Jiaqi Chen,Haoyuan Li,Mengyin An,Fei Kang,Hua Xue,Baixin Xu,Yuyang Yin,Eric Li,Yang Liu,Yikai Wang,Hao-Xiang Guo,Yahui Zhou

Main category: cs.CV

TL;DR: 本文提出了Matrix-3D框架，通过全景表示实现广覆盖可探索的3D世界生成，结合视频生成与全景3D重建，提升了生成效果和重建精度。

Details

Motivation: 现有3D世界生成方法生成场景范围有限，需要更广覆盖范围的方法。 Method: 训练了一个轨迹引导的全景视频扩散模型，提出两种3D场景重建方法：前馈式全景重建模型和基于优化的重建流程。 Result: 提出了Matrix-3D框架，实现了全景视频生成和3D世界生成的SOTA性能。 Conclusion: Matrix-3D实现了高质量和几何一致性的场景视频生成，并通过两种方法提升了3D场景重建的效果。 Abstract: Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in https://matrix-3d.github.io.

[292] MDD-Net: Multimodal Depression Detection through Mutual Transformer

Md Rezwanul Haque,Md. Milon Islam,S M Taslim Uddin Raju,Hamdi Altaheri,Lobna Nassar,Fakhri Karray

Main category: cs.CV

TL;DR: A new multimodal depression detection network (MDD-Net) effectively detects depression using acoustic and visual data from social media, outperforming current methods.

Details

Motivation: Depression is a major mental health issue, and social media offers a simple way to collect data for mental health research. Method: MDD-Net uses acoustic and visual feature extraction modules, a mutual transformer, and a detection layer to detect depression from multimodal data. Result: Experiments on the D-Vlog dataset show that MDD-Net outperforms existing methods by up to 17.37% in F1-Score. Conclusion: The proposed MDD-Net surpasses state-of-the-art approaches by up to 17.37% for F1-Score in depression detection using multimodal data. Abstract: Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at https://github.com/rezwanh001/Multimodal-Depression-Detection.

[293] 3D Plant Root Skeleton Detection and Extraction

Jiakai Lin,Jinchang Zhang,Ge Jin,Wenzhan Song,Tianming Liu,Guoyu Lu

Main category: cs.CV

TL;DR: Error

Details

Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Plant roots typically exhibit a highly complex and dense architecture, incorporating numerous slender lateral roots and branches, which significantly hinders the precise capture and modeling of the entire root system. Additionally, roots often lack sufficient texture and color information, making it difficult to identify and track root traits using visual methods. Previous research on roots has been largely confined to 2D studies; however, exploring the 3D architecture of roots is crucial in botany. Since roots grow in real 3D space, 3D phenotypic information is more critical for studying genetic traits and their impact on root development. We have introduced a 3D root skeleton extraction method that efficiently derives the 3D architecture of plant roots from a few images. This method includes the detection and matching of lateral roots, triangulation to extract the skeletal structure of lateral roots, and the integration of lateral and primary roots. We developed a highly complex root dataset and tested our method on it. The extracted 3D root skeletons showed considerable similarity to the ground truth, validating the effectiveness of the model. This method can play a significant role in automated breeding robots. Through precise 3D root structure analysis, breeding robots can better identify plant phenotypic traits, especially root structure and growth patterns, helping practitioners select seeds with superior root systems. This automated approach not only improves breeding efficiency but also reduces manual intervention, making the breeding process more intelligent and efficient, thus advancing modern agriculture.

[294] TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

Junzhe Xu,Yuyang Yin,Xi Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的统一模型TBAC-UniImage，通过将预训练的扩散模型与MLLM深度集成，实现多模态理解和生成的更深层次和更细粒度的统一。

Details

Motivation: 现有的基于扩散的统一模型面临两个主要限制：一种方法仅使用MLLM的最终隐藏状态作为生成条件，另一种方法从头开始预训练统一的生成架构，计算成本高昂。 Method: 使用来自MLLM多个不同层的表示作为扩散模型的生成条件。 Result: TBAC-UniImage通过从MLLM的理解过程中不同深度接收指导，将预训练的生成器视为梯子，实现了理解和生成的更深入和更细粒度的统一。 Conclusion: TBAC-UniImage实现了理解和生成的更深层次和更细粒度的统一。 Abstract: This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM's understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.

[295] Hyperspectral Imaging

Danfeng Hong,Chenyu Li,Naoto Yokoya,Bing Zhang,Xiuping Jia,Antonio Plaza,Paolo Gamba,Jon Atli Benediktsson,Jocelyn Chanussot

Main category: cs.CV

TL;DR: Hyperspectral imaging (HSI) captures spatial and spectral data for advanced, non-invasive analysis across multiple domains, with ongoing advancements addressing its technical challenges and expanding its applicability.

Details

Motivation: HSI enables non-invasive, label-free analysis of material, chemical, and biological properties by capturing both spatial and spectral information, making it valuable across a wide range of applications. Method: The paper provides a comprehensive overview of HSI, covering its physical principles, sensor architectures, data acquisition methods, and both classical and modern analysis techniques, such as AI-driven approaches. Result: The paper summarizes HSI's capabilities in uncovering sub-visual features for advanced monitoring, diagnostics, and decision-making, while also addressing persistent challenges and emerging solutions in hardware, data complexity, and computational methods. Conclusion: Hyperspectral imaging (HSI) is becoming a versatile, cross-disciplinary platform with potential for transformative applications in various fields, including science, technology, and society. Abstract: Hyperspectral imaging (HSI) is an advanced sensing modality that simultaneously captures spatial and spectral information, enabling non-invasive, label-free analysis of material, chemical, and biological properties. This Primer presents a comprehensive overview of HSI, from the underlying physical principles and sensor architectures to key steps in data acquisition, calibration, and correction. We summarize common data structures and highlight classical and modern analysis methods, including dimensionality reduction, classification, spectral unmixing, and AI-driven techniques such as deep learning. Representative applications across Earth observation, precision agriculture, biomedicine, industrial inspection, cultural heritage, and security are also discussed, emphasizing HSI's ability to uncover sub-visual features for advanced monitoring, diagnostics, and decision-making. Persistent challenges, such as hardware trade-offs, acquisition variability, and the complexity of high-dimensional data, are examined alongside emerging solutions, including computational imaging, physics-informed modeling, cross-modal fusion, and self-supervised learning. Best practices for dataset sharing, reproducibility, and metadata documentation are further highlighted to support transparency and reuse. Looking ahead, we explore future directions toward scalable, real-time, and embedded HSI systems, driven by sensor miniaturization, self-supervised learning, and foundation models. As HSI evolves into a general-purpose, cross-disciplinary platform, it holds promise for transformative applications in science, technology, and society.

[296] GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking

Xudong Han,Pengcheng Fang,Yueying Tian,Jianhui Yu,Xiaohao Cai,Daniel Roggen,Philip Birch

Main category: cs.CV

TL;DR: GRASPTrack是一种基于深度感知的新型多目标跟踪框架，通过引入3D几何推理机制，有效提升了跟踪在复杂场景中的鲁棒性。

Details

Motivation: 传统的检测跟踪方法由于缺乏几何感知能力，难以应对单目视频中多目标跟踪存在的遮挡和深度模糊问题。 Method: GRASPTrack在标准的检测跟踪（TBD）流程中引入了高保真3D点云生成、基于体素的3D交并比（IoU）计算、深度感知的自适应噪声补偿以及深度增强的观测中心动量策略。 Result: 在MOT17、MOT20和DanceTrack基准测试中，该方法在频繁遮挡和复杂运动模式的复杂场景下显著提升了跟踪鲁棒性并取得了具有竞争力的性能表现。 Conclusion: GRASPTrack通过整合单目深度估计和实例分割，构建了一个具有几何感知的多目标跟踪框架，有效解决了遮挡和深度模糊带来的挑战。 Abstract: Multi-object tracking (MOT) in monocular videos is fundamentally challenged by occlusions and depth ambiguity, issues that conventional tracking-by-detection (TBD) methods struggle to resolve owing to a lack of geometric awareness. To address these limitations, we introduce GRASPTrack, a novel depth-aware MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline to generate high-fidelity 3D point clouds from 2D detections, thereby enabling explicit 3D geometric reasoning. These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union (IoU) for spatial association. To further enhance tracking robustness, our approach incorporates Depth-aware Adaptive Noise Compensation, which dynamically adjusts the Kalman filter process noise based on occlusion severity for more reliable state estimation. Additionally, we propose a Depth-enhanced Observation-Centric Momentum, which extends the motion direction consistency from the image plane into 3D space to improve motion-based association cues, particularly for objects with complex trajectories. Extensive experiments on the MOT17, MOT20, and DanceTrack benchmarks demonstrate that our method achieves competitive performance, significantly improving tracking robustness in complex scenes with frequent occlusions and intricate motion patterns.

[297] Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Zeqian Long,Mingzhe Zheng,Kunyu Feng,Xinhua Zhang,Hongyu Liu,Harry Yang,Linfeng Zhang,Qifeng Chen,Yue Ma

Main category: cs.CV

TL;DR: 提出了Follow-Your-Shape方法，用于精确和可控的对象形状编辑，并引入了ReShapeBench基准用于评估。

Details

Motivation: 现有的基于流的图像编辑模型在涉及大规模形状变换的挑战性场景下表现不佳，通常无法实现预期的形状变化或无意中改变非目标区域。 Method: 通过比较反转和去噪路径之间的逐令牌速度差异来计算轨迹差异图（TDM），并引导调度KV注入机制以确保稳定且准确的编辑。 Result: 实验表明，该方法在需要大规模形状替换的任务中实现了优越的可编辑性和视觉保真度。 Conclusion: Follow-Your-Shape是一个无需训练和掩码的框架，支持精确和可控的对象形状编辑，同时严格保留非目标内容。 Abstract: While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios -- particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.

[298] FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting

Yitong Yang,Yinglin Wang,Changshuo Wang,Huajie Wang,Shuting He

Main category: cs.CV

TL;DR: FantasyStyle是一种基于3DGS的新颖风格迁移框架，通过多视角频率一致性和可控风格化蒸馏，解决了风格冲突和内容泄漏问题，取得了优于现有方法的性能。

Details

Motivation: 现有的3DGS风格迁移方法面临多视角不一致性和对VGG特征的依赖问题，导致风格冲突和内容泄漏，因此需要一种新的方法来提升风格迁移质量。 Method: FantasyStyle包含两个关键组件：多视角频率一致性，通过3D滤波选择性减少低频分量以增强跨视角一致性；可控风格化蒸馏，通过引入负向引导排除不期望的内容泄漏。 Result: 实验表明，FantasyStyle在多种场景和风格下均优于现有方法，实现了更高的风格化质量和视觉真实感。 Conclusion: FantasyStyle通过解决多视角不一致性和内容泄漏问题，在3DGS基础上实现了更高质量的风格迁移，超过了现有技术的性能。 Abstract: The success of 3DGS in generative and editing applications has sparked growing interest in 3DGS-based style transfer. However, current methods still face two major challenges: (1) multi-view inconsistency often leads to style conflicts, resulting in appearance smoothing and distortion; and (2) heavy reliance on VGG features, which struggle to disentangle style and content from style images, often causing content leakage and excessive stylization. To tackle these issues, we introduce \textbf{FantasyStyle}, a 3DGS-based style transfer framework, and the first to rely entirely on diffusion model distillation. It comprises two key components: (1) \textbf{Multi-View Frequency Consistency}. We enhance cross-view consistency by applying a 3D filter to multi-view noisy latent, selectively reducing low-frequency components to mitigate stylized prior conflicts. (2) \textbf{Controllable Stylized Distillation}. To suppress content leakage from style images, we introduce negative guidance to exclude undesired content. In addition, we identify the limitations of Score Distillation Sampling and Delta Denoising Score in 3D style transfer and remove the reconstruction term accordingly. Building on these insights, we propose a controllable stylized distillation that leverages negative guidance to more effectively optimize the 3D Gaussians. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving higher stylization quality and visual realism across various scenes and styles.

[299] Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

Nicholas Klein,Hemlata Tak,James Fullwood,Krishna Regmi,Leonidas Spinoulas,Ganesh Sivaraman,Tianxiang Chen,Elie Khoury

Main category: cs.CV

TL;DR: 本文提出了一种有效的深度伪造视频检测方法，在分类和时间定位任务中表现优异。

Details

Motivation: 随着视觉和音频生成领域新技术的迅速发展，对检测视频中的合成内容提出了更高的要求，尤其是在进行细粒度的局部修改后，对检测算法构成了新的挑战。 Method: 提交了用于深度伪造视频分类和定位的方法到ACM 1M Deepfakes Detection Challenge，并评估了其性能。 Result: 在ACM 1M Deepfakes Detection Challenge中，本文方法在时间定位任务中取得最佳表现，在分类任务中于TestA数据集分割上排名前四。 Conclusion: 本文提出的解决方案在ACM 1M Deepfakes Detection Challenge中的时间定位任务中表现最佳，并在分类任务中排名前四，证明了其在检测深度伪造视频方面的有效性。 Abstract: The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

[300] Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning

Yan Wang,Da-Wei Zhou,Han-Jia Ye

Main category: cs.CV

TL;DR: 本文提出了一种新的类增量学习方法TUNA，通过结合任务特定和通用适配器，解决了错误模块选择和忽略共享知识的问题，在多个数据集上表现优异。

Details

Motivation: 现有的基于预训练模型的类增量学习方法通常冻结预训练网络，并使用适配器等轻量模块来适应增量任务，但推理过程中错误的模块选择会影响性能，且任务特定模块可能忽略共享的通用知识，导致难以区分跨任务的相似类别。 Method: TUNA方法包括训练任务特定适配器以捕捉各任务的关键特征，引入基于熵的选择机制以选择最适合的适配器，并通过适配器融合策略构建一个编码跨任务最具判别能力特征的通用适配器。 Result: 在多个基准数据集上的广泛实验证明了TUNA方法的优越性能，代码已公开。 Conclusion: 本文提出了一种新的基于适配器的类增量学习方法TUNA，该方法通过结合任务特定适配器和通用适配器，在多个基准数据集上实现了最先进的性能。 Abstract: Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Existing pre-trained model-based CIL methods often freeze the pre-trained network and adapt to incremental tasks using additional lightweight modules such as adapters. However, incorrect module selection during inference hurts performance, and task-specific modules often overlook shared general knowledge, leading to errors on distinguishing between similar classes across tasks. To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. Specifically, we train task-specific adapters to capture the most crucial features relevant to their respective tasks and introduce an entropy-based selection mechanism to choose the most suitable adapter. Furthermore, we leverage an adapter fusion strategy to construct a universal adapter, which encodes the most discriminative features shared across tasks. We combine task-specific and universal adapter predictions to harness both specialized and general knowledge during inference. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of our approach. Code is available at: https://github.com/LAMDA-CL/ICCV2025-TUNA

[301] ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction

Chaojun Ni,Guosheng Zhao,Xiaofeng Wang,Zheng Zhu,Wenkang Qin,Xinze Chen,Guanghong Jia,Guan Huang,Wenjun Mei

Main category: cs.CV

TL;DR: ReconDreamer-RL is a framework that uses video diffusion priors and scene reconstruction to improve reinforcement learning for autonomous driving, addressing the sim2real gap and enhancing performance with a 5x reduction in collision ratio compared to imitation learning methods.

Details

Motivation: To bridge the simulation-to-reality (sim2real) gap in autonomous driving training by improving the realism of simulation environments and addressing limitations in training data distribution. Method: The proposed ReconDreamer-RL framework integrates video diffusion priors into scene reconstruction to enhance reinforcement learning for autonomous driving. It includes ReconSimulator for scenario reconstruction, Dynamic Adversary Agent (DAA) for generating corner-case scenarios, and Cousin Trajectory Generator (CTG) to address biased training data. Result: ReconDreamer-RL reduces the Collision Ratio by 5x compared to imitation learning methods and enhances the training of end-to-end autonomous driving models. Conclusion: ReconDreamer-RL improves end-to-end autonomous driving training, outperforming imitation learning methods with a 5x reduction in the Collision Ratio. Abstract: Reinforcement learning for training end-to-end autonomous driving models in closed-loop simulations is gaining growing attention. However, most simulation environments differ significantly from real-world conditions, creating a substantial simulation-to-reality (sim2real) gap. To bridge this gap, some approaches utilize scene reconstruction techniques to create photorealistic environments as a simulator. While this improves realistic sensor simulation, these methods are inherently constrained by the distribution of the training data, making it difficult to render high-quality sensor data for novel trajectories or corner case scenarios. Therefore, we propose ReconDreamer-RL, a framework designed to integrate video diffusion priors into scene reconstruction to aid reinforcement learning, thereby enhancing end-to-end autonomous driving training. Specifically, in ReconDreamer-RL, we introduce ReconSimulator, which combines the video diffusion prior for appearance modeling and incorporates a kinematic model for physical modeling, thereby reconstructing driving scenarios from real-world data. This narrows the sim2real gap for closed-loop evaluation and reinforcement learning. To cover more corner-case scenarios, we introduce the Dynamic Adversary Agent (DAA), which adjusts the trajectories of surrounding vehicles relative to the ego vehicle, autonomously generating corner-case traffic scenarios (e.g., cut-in). Finally, the Cousin Trajectory Generator (CTG) is proposed to address the issue of training data distribution, which is often biased toward simple straight-line movements. Experiments show that ReconDreamer-RL improves end-to-end autonomous driving training, outperforming imitation learning methods with a 5x reduction in the Collision Ratio.

[302] CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data

Chongke Bi,Xin Gao,Jiangkang Deng,Guan

Main category: cs.CV

TL;DR: CD-TVD is a new framework for 3D super-resolution that uses contrastive learning and an improved diffusion model, requiring minimal high-resolution data while achieving accurate and resource-efficient results for large-scale scientific simulations.

Details

Motivation: Existing super-resolution methods require large amounts of high-resolution training data, limiting their applicability to diverse simulation scenarios. CD-TVD aims to reduce this dependency while maintaining accuracy. Method: CD-TVD combines contrastive learning and an improved diffusion-based super-resolution model with a local attention mechanism, pre-trained on historical simulation data and fine-tuned using only one newly generated high-resolution timestep. Result: Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. Conclusion: CD-TVD is a novel framework that minimizes reliance on large-scale high-resolution datasets while maintaining the ability to recover fine-grained details in 3D super-resolution tasks. Abstract: Large-scale scientific simulations require significant resources to generate high-resolution time-varying data (TVD). While super-resolution is an efficient post-processing strategy to reduce costs, existing methods rely on a large amount of HR training data, limiting their applicability to diverse simulation scenarios. To address this constraint, we proposed CD-TVD, a novel framework that combines contrastive learning and an improved diffusion-based super-resolution model to achieve accurate 3D super-resolution from limited time-step high-resolution data. During pre-training on historical simulation data, the contrastive encoder and diffusion superresolution modules learn degradation patterns and detailed features of high-resolution and low-resolution samples. In the training phase, the improved diffusion model with a local attention mechanism is fine-tuned using only one newly generated high-resolution timestep, leveraging the degradation knowledge learned by the encoder. This design minimizes the reliance on large-scale high-resolution datasets while maintaining the capability to recover fine-grained details. Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. The code is available at https://github.com/Xin-Gao-private/CD-TVD.

[303] MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Zhonghao Yan,Muxi Diao,Yuxuan Yang,Jiayuan Xu,Kaizhou Zhang,Ruoyan Jing,Lele Yang,Yanxi Liu,Kongming Liang,Zhanyu Ma

Main category: cs.CV

TL;DR: 本研究提出了MedReasoner框架，通过强化学习优化视觉-语言模型在医学图像定位中的应用，解决了传统方法对监督数据的依赖问题。

Details

Motivation: 当前医学图像定位依赖监督微调，难以处理临床实践中常见的隐式查询。 Method: 引入MedReasoner，一个模块化框架，利用MLLM进行推理并结合分割专家模型生成像素级掩码，通过格式和准确奖励实现对齐。 Result: MedReasoner在U-MRG-14K数据集上达到了最先进的性能，并显示出对未见过的临床查询的强泛化能力。 Conclusion: MedReasoner通过分离推理与分割过程，并利用强化学习优化，展示了在医学图像定位中的显著潜力和最先进的性能。 Abstract: Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

[304] 3D Human Mesh Estimation from Single View RGBD

Ozhan Suat,Bedirhan Uguz,Batuhan Karagoz,Muhammed Can Keles,Emre Akbas

Main category: cs.CV

TL;DR: 提出了一种从单个RGBD视图中估计3D人体网格的方法，称为M$^3$（Masked Mesh Modeling），利用现有的Motion Capture（MoCap）数据集生成完整的3D人体网格。

Details

Motivation: 尽管从RGB图像进行3D人体网格估计取得了显著进展，但RGBD相机仍被低估。现有的RGBD数据集较小，且姿态和形状多样性有限，难以训练出高质量的模型。 Method: 首先利用MoCap数据集生成完整的3D网格，然后通过投影到虚拟相机创建单视角的、部分的网格，模拟RGBD相机提供的深度数据。接着训练一个掩码自编码器来补全这些部分的网格。在推理阶段，M$^3$将传感器提供的深度值与模板人体网格顶点匹配，生成完整的3D人体网格。 Result: M$^3$在SURREAL和CAPE数据集上分别达到了16.8 mm和22.0 mm的每顶点误差（PVE），优于使用全身体点云作为输入的现有方法。在BEHAVE数据集上获得了70.9的PVE，比最近发布的基于RGB的方法提高了18.4 mm。 Conclusion: 该方法充分利用了RGBD相机的深度数据，并通过掩码自编码器有效补全部分网格，实现了更准确的3D人体网格估计。 Abstract: Despite significant progress in 3D human mesh estimation from RGB images; RGBD cameras, offering additional depth data, remain underutilized. In this paper, we present a method for accurate 3D human mesh estimation from a single RGBD view, leveraging the affordability and widespread adoption of RGBD cameras for real-world applications. A fully supervised approach for this problem, requires a dataset with RGBD image and 3D mesh label pairs. However, collecting such a dataset is costly and challenging, hence, existing datasets are small, and limited in pose and shape diversity. To overcome this data scarcity, we leverage existing Motion Capture (MoCap) datasets. We first obtain complete 3D meshes from the body models found in MoCap datasets, and create partial, single-view versions of them by projection to a virtual camera. This simulates the depth data provided by an RGBD camera from a single viewpoint. Then, we train a masked autoencoder to complete the partial, single-view mesh. During inference, our method, which we name as M$^3$ for ``Masked Mesh Modeling'', matches the depth values coming from the sensor to vertices of a template human mesh, which creates a partial, single-view mesh. We effectively recover parts of the 3D human body mesh model that are not visible, resulting in a full body mesh. M$^3$ achieves 16.8 mm and 22.0 mm per-vertex-error (PVE) on the SURREAL and CAPE datasets, respectively; outperforming existing methods that use full-body point clouds as input. We obtain a competitive 70.9 PVE on the BEHAVE dataset, outperforming a recently published RGB based method by 18.4 mm, highlighting the usefulness of depth data. Code will be released.

[305] PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation

Sihan Zhao,Zixuan Wang,Tianyu Luan,Jia Jia,Wentao Zhu,Jiebo Luo,Junsong Yuan,Nan Xi

Main category: cs.CV

TL;DR: This paper proposes PP-Motion, a novel metric for evaluating human motion generation by combining physical alignment and perceptual fidelity, showing superior performance over existing methods.

Details

Motivation: Evaluating the fidelity of generated human motion is crucial but challenging due to the gap between human perception and physical feasibility. Existing methods relying on subjective or binary labeling lack robustness and objectivity. Method: The authors introduced a physical labeling method to evaluate motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. They proposed PP-Motion, a data-driven metric combining physical alignment and perceptual fidelity loss, trained using Pearson's correlation loss. Result: PP-Motion achieved better alignment with both physical laws and human perception compared to previous approaches, offering a more accurate and objective evaluation of motion fidelity. Conclusion: The proposed PP-Motion metric effectively evaluates both physical and perceptual fidelity of human motion by incorporating physical alignment annotations and human perception, outperforming previous methods. Abstract: Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson's correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.

[306] THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening

Hongkun Jin,Hongcheng Jiang,Zejun Zhang,Yuan Zhang,Jia Fu,Tingfeng Li,Kai Luo

Main category: cs.CV

TL;DR: This paper proposes THAT, a novel transformer framework for hyperspectral pansharpening that enhances high-frequency detail learning and token selection, achieving state-of-the-art results.

Details

Motivation: Transformer-based methods face limitations in hyperspectral pansharpening due to redundant token representations, attention dispersion, and lack of multi-scale feature modeling. Hyperspectral images require preservation of high-frequency components and localized details for accurate reconstruction. Method: Token-wise High-frequency Augmentation Transformer (THAT), including Pivotal Token Selective Attention (PTSA) and Multi-level Variance-aware Feed-forward Network (MVFN). Result: Experiments on standard benchmarks show that THAT improves reconstruction quality and efficiency, achieving state-of-the-art performance. Conclusion: The proposed THAT framework achieves state-of-the-art performance in hyperspectral pansharpening by improving high-frequency feature representation and token selection. Abstract: Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components--such as material edges and texture transitions--and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at https://github.com/kailuo93/THAT.

[307] KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning

Md Meftahul Ferdaus,Mahdi Abdelguerfi,Elias Ioup,Steven Sloan,Kendall N. Niles,Ken Pathak

Main category: cs.CV

TL;DR: KARMA是一种高效的语义分割框架，通过创新的结构设计显著减少参数量和计算需求，适合实时基础设施缺陷检测。

Details

Motivation: 结构缺陷的语义分割因缺陷外观变化大、成像条件恶劣和类别不平衡而具有挑战性，当前深度学习方法因参数量大而不适合实时检测系统。 Method: KARMA框架通过组合一维函数而非传统卷积来建模复杂的缺陷模式，包含三个技术创新：参数高效的TiKAN模块、优化的特征金字塔结构和静态-动态原型机制。 Result: 在基准基础设施检测数据集上的实验表明，KARMA的平均IoU性能优于现有最先进方法，同时参数显著减少（0.959M vs. 31.04M），计算量仅为0.264 GFLOPS。 Conclusion: KARMA实现了在保持准确性的同时显著减少参数数量和计算量，使其适合实时部署，为实际的自动化基础设施检查系统提供了可能。 Abstract: Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.

[308] Reinforcement Learning in Vision: A Survey

Weijia Wu,Chen Gao,Joya Chen,Kevin Qinghong Lin,Qingwei Meng,Yiming Zhang,Yuke Qiu,Hong Zhou,Mike Zheng Shou

Main category: cs.CV

TL;DR: This survey provides a comprehensive overview of recent developments in visual reinforcement learning, categorizing key research areas, analyzing trends, and identifying future research directions.

Details

Motivation: Recent advances at the intersection of reinforcement learning and visual intelligence have led to agents capable of perceiving, reasoning, generating, and acting within complex visual scenes. This survey aims to provide a coherent map of the rapidly expanding landscape of visual RL for researchers and practitioners. Method: The paper organizes over 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. It reviews policy-optimization strategies, reward engineering, benchmark progress, algorithmic design, and trends in the field of visual reinforcement learning. Result: The survey formalizes visual RL problems, traces the evolution of policy-optimization strategies, organizes over 200 works into thematic pillars, examines algorithmic design and reward engineering, distills trends such as curriculum-driven training and unified reward modeling, and reviews evaluation protocols while identifying open challenges like sample efficiency and safe deployment. Conclusion: This survey aims to provide a critical and up-to-date synthesis of recent advances in visual reinforcement learning, highlighting the evolution of policy-optimization strategies, categorizing over 200 works into thematic pillars, reviewing evaluation protocols, and identifying open challenges and future directions in the field. Abstract: Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: https://github.com/weijiawu/Awesome-Visual-Reinforcement-Learning.

[309] Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model

Peiqi He,Zhenhao Zhang,Yixiang Zhang,Xiongjun Zhao,Shaoliang Peng

Main category: cs.CV

TL;DR: Spatial-ORMLLM is a large vision-language model for 3D spatial reasoning in operating rooms that uses only RGB modality to infer volumetric and semantic cues, enabling medical tasks with detailed spatial context.

Details

Motivation: Precise spatial modeling in the operating room is essential for clinical tasks, but existing approaches either require multimodal 3D data that is difficult to obtain or fail to capture fine-grained details from 2D data. Method: Spatial-ORMLLM incorporates a Spatial-Enhanced Feature Fusion Block that integrates 2D modality inputs with rich 3D spatial knowledge extracted by an estimation algorithm, feeding the combined features into a visual tower within a unified end-to-end MLLM framework. Result: Spatial-ORMLLM achieves state-of-the-art performance and generalizes robustly to previously unseen surgical scenarios and downstream tasks. Conclusion: Spatial-ORMLLM is effective in 3D spatial reasoning in operating rooms using only RGB modality, achieving state-of-the-art performance and robust generalization. Abstract: Precise spatial modeling in the operating room (OR) is foundational to many clinical tasks, supporting intraoperative awareness, hazard avoidance, and surgical decision-making. While existing approaches leverage large-scale multimodal datasets for latent-space alignment to implicitly learn spatial relationships, they overlook the 3D capabilities of MLLMs. However, this approach raises two issues: (1) Operating rooms typically lack multiple video and audio sensors, making multimodal 3D data difficult to obtain; (2) Training solely on readily available 2D data fails to capture fine-grained details in complex scenes. To address this gap, we introduce Spatial-ORMLLM, the first large vision-language model for 3D spatial reasoning in operating rooms using only RGB modality to infer volumetric and semantic cues, enabling downstream medical tasks with detailed and holistic spatial context. Spatial-ORMLLM incorporates a Spatial-Enhanced Feature Fusion Block, which integrates 2D modality inputs with rich 3D spatial knowledge extracted by the estimation algorithm and then feeds the combined features into the visual tower. By employing a unified end-to-end MLLM framework, it combines powerful spatial features with textual features to deliver robust 3D scene reasoning without any additional expert annotations or sensor inputs. Experiments on multiple benchmark clinical datasets demonstrate that Spatial-ORMLLM achieves state-of-the-art performance and generalizes robustly to previously unseen surgical scenarios and downstream tasks.

[310] SAGOnline: Segment Any Gaussians Online

Wentao Sun,Quanyun Wu,Hanqing Xu,Kyle Gao,Zhengsen Xu,Yiping Chen,Dedong Zhang,Lingfei Ma,John S. Zelek,Jonathan Li

Main category: cs.CV

TL;DR: SAGOnline enables fast and accurate real-time 3D segmentation and tracking in Gaussian scenes by combining 2D video models with efficient 3D labeling techniques.

Details

Motivation: 3D Gaussian Splatting struggles with efficient and consistent 3D segmentation and multi-object tracking. Existing methods are computationally expensive and lack spatial reasoning. Method: SAGOnline uses a decoupled strategy integrating video foundation models (e.g., SAM2) for 2D mask propagation and a GPU-accelerated algorithm for 3D mask generation and instance labeling. Result: SAGOnline achieves 92.7% mIoU on NVOS and 95.2% mIoU on Spin-NeRF, with inference speeds 15–1500 times faster than existing methods. Conclusion: SAGOnline provides a lightweight, zero-shot framework that enables real-time 3D segmentation and multi-object tracking in Gaussian scenes, achieving state-of-the-art performance and efficiency. Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Current methods suffer from prohibitive computational costs, limited 3D spatial reasoning, and an inability to track multiple objects simultaneously. We present Segment Any Gaussians Online (SAGOnline), a lightweight and zero-shot framework for real-time 3D segmentation in Gaussian scenes that addresses these limitations through two key innovations: (1) a decoupled strategy that integrates video foundation models (e.g., SAM2) for view-consistent 2D mask propagation across synthesized views; and (2) a GPU-accelerated 3D mask generation and Gaussian-level instance labeling algorithm that assigns unique identifiers to 3D primitives, enabling lossless multi-object tracking and segmentation across views. SAGOnline achieves state-of-the-art performance on NVOS (92.7% mIoU) and Spin-NeRF (95.2% mIoU) benchmarks, outperforming Feature3DGS, OmniSeg3D-gs, and SA3D by 15--1500 times in inference speed (27 ms/frame). Qualitative results demonstrate robust multi-object segmentation and tracking in complex scenes. Our contributions include: (i) a lightweight and zero-shot framework for 3D segmentation in Gaussian scenes, (ii) explicit labeling of Gaussian primitives enabling simultaneous segmentation and tracking, and (iii) the effective adaptation of 2D video foundation models to the 3D domain. This work allows real-time rendering and 3D scene understanding, paving the way for practical AR/VR and robotic applications.

[311] Learning User Preferences for Image Generation Model

Wenyi Mo,Ying Ba,Tianyu Zhang,Yalong Bai,Biye Li

Main category: cs.CV

TL;DR: This paper proposes a method for predicting user preferences using Multimodal Large Language Models, enabling personalized and group-specific preference modeling for improved image generation.

Details

Motivation: Existing methods often neglect individual variability and the dynamic, multifaceted nature of personal taste, relying on general human preferences or static user profiles. Method: The approach uses Multimodal Large Language Models with contrastive preference loss and learnable preference tokens to model individual and shared user preferences. Result: Experiments show that the model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and improving image generation alignment with individual tastes. Conclusion: The proposed approach effectively captures personalized and group-specific preferences, enhancing preference prediction accuracy and generating images aligned with individual tastes. Abstract: User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste. To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user ''likes'' and ''dislikes'', while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users. Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes. The project page is \texttt{https://learn-user-pref.github.io/}.

[312] OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution

Zhiqiang Wu,Zhaomang Sun,Tong Zhou,Bingtao Fu,Ji Cong,Yitong Dong,Huaqi Zhang,Xuan Tang,Mingsong Chen,Xian Wei

Main category: cs.CV

TL;DR: This study introduces the OMGSR framework for Real-ISR tasks, which improves performance by injecting the LQ image latent distribution at a mid-timestep and using specialized loss functions. The framework achieves strong results at multiple resolutions.

Details

Motivation: Recent one-step Real-ISR models face limitations due to a fundamental gap between the LQ image latent distribution and the Gaussian noisy latent distribution. This study aims to bridge this gap by leveraging the alignment of the noisy latent distribution at DDPM/FM mid-timesteps with the LQ image latent distribution. Method: The study proposes the OMGSR framework, which injects the LQ image latent distribution at a pre-computed mid-timestep. It also introduces the Latent Distribution Refinement loss and the Overlap-Chunked LPIPS/GAN loss to enhance performance. The framework is applied to DDPM/FM-based generative models with two variants: OMGSR-S (SD-Turbo) and OMGSR-F (FLUX.1-dev). Result: The proposed OMGSR framework demonstrates improved performance in Real-ISR tasks. OMGSR-S/F achieves balanced/excellent performance at 512-resolution, with OMGSR-F showing overwhelming dominance in all reference metrics. A 1k-resolution OMGSR-F model yields excellent results in image detail generation, and 2k-resolution images are successfully generated using a two-stage Tiled VAE & Diffusion method. Conclusion: OMGSR-S/F achieves balanced/excellent performance across quantitative and qualitative metrics at 512-resolution, with OMGSR-F showing overwhelming dominance in all reference metrics. Abstract: Denoising Diffusion Probabilistic Models (DDPM) and Flow Matching (FM) generative models show promising potential for one-step Real-World Image Super-Resolution (Real-ISR). Recent one-step Real-ISR models typically inject a Low-Quality (LQ) image latent distribution at the initial timestep. However, a fundamental gap exists between the LQ image latent distribution and the Gaussian noisy latent distribution, limiting the effective utilization of generative priors. We observe that the noisy latent distribution at DDPM/FM mid-timesteps aligns more closely with the LQ image latent distribution. Based on this insight, we present One Mid-timestep Guidance Real-ISR (OMGSR), a universal framework applicable to DDPM/FM-based generative models. OMGSR injects the LQ image latent distribution at a pre-computed mid-timestep, incorporating the proposed Latent Distribution Refinement loss to alleviate the latent distribution gap. We also design the Overlap-Chunked LPIPS/GAN loss to eliminate checkerboard artifacts in image generation. Within this framework, we instantiate OMGSR for DDPM/FM-based generative models with two variants: OMGSR-S (SD-Turbo) and OMGSR-F (FLUX.1-dev). Experimental results demonstrate that OMGSR-S/F achieves balanced/excellent performance across quantitative and qualitative metrics at 512-resolution. Notably, OMGSR-F establishes overwhelming dominance in all reference metrics. We further train a 1k-resolution OMGSR-F to match the default resolution of FLUX.1-dev, which yields excellent results, especially in the details of the image generation. We also generate 2k-resolution images by the 1k-resolution OMGSR-F using our two-stage Tiled VAE & Diffusion.

[313] Cut2Next: Generating Next Shot via In-Context Tuning

Jingwen He,Hongbo Liu,Jiajun Li,Ziqi Huang,Yu Qiao,Wanli Ouyang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文介绍了一种名为Cut2Next的框架，用于生成符合专业编辑模式并保持电影连续性的高质量后续镜头。

Details

Motivation: 当前方法往往忽视了推动叙事流程的重要编辑模式（如镜头/反镜头、切入切出），导致输出缺乏叙事复杂性和真正的电影完整性。 Method: 使用扩散变压器（DiT）框架Cut2Next，通过一种新的分层多提示策略进行上下文调优，包括关系提示和单独提示，并结合上下文感知条件注入（CACI）和分层注意力掩码（HAM）的架构创新。 Result: 实验表明Cut2Next在视觉一致性和文本保真度方面表现出色，尤其是在遵循预期编辑模式和整体电影连续性方面，得到了用户的强烈偏好。 Conclusion: Cut2Next能够生成高质量、富有叙事表现力和电影连贯性的后续镜头，得到了用户研究的验证。 Abstract: Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

[314] StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

Shuyuan Tu,Yueming Pan,Yinming Huang,Xintong Han,Zhen Xing,Qi Dai,Chong Luo,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: This paper introduces StableAvatar, a novel video diffusion transformer that generates infinite-length, high-quality videos with natural audio synchronization and identity consistency, overcoming the limitations of current diffusion models by introducing a Time-step-aware Audio Adapter, an Audio Native Guidance Mechanism, and a Dynamic Weighted Sliding-window Strategy.

Details

Motivation: Current diffusion models struggle to synthesize long videos with natural audio synchronization and identity consistency, primarily due to their reliance on third-party audio extractors and lack of audio-related priors, which leads to latent distribution error accumulation. Method: The paper introduces a Time-step-aware Audio Adapter, an Audio Native Guidance Mechanism, and a Dynamic Weighted Sliding-window Strategy to improve audio modeling, prevent error accumulation, enhance audio synchronization, and ensure smoothness in infinite-length videos. Result: Experiments on benchmarks show that StableAvatar is effective both qualitatively and quantitatively in generating high-quality, infinite-length videos with natural audio synchronization and identity consistency. Conclusion: StableAvatar is the first end-to-end video diffusion transformer that can synthesize infinite-length, high-quality videos with natural audio synchronization and identity consistency. Abstract: Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.

[315] ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Shuting He,Guangquan Jie,Changshuo Wang,Yun Zhou,Shuming Hu,Guanbin Li,Henghui Ding

Main category: cs.CV

TL;DR: The paper introduces R3DGS, a new task for segmenting 3D objects based on natural language descriptions, proposes the ReferSplat framework to address its challenges, and presents the Ref-LERF dataset.

Details

Motivation: The motivation is to develop a model capable of segmenting target objects in a 3D Gaussian scene based on natural language descriptions, particularly focusing on objects that may be occluded or not directly visible in a novel view, to advance embodied AI. Method: The authors proposed ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm, and introduced the first dataset for this task called Ref-LERF. Result: ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Conclusion: The paper concludes that ReferSplat effectively addresses the challenges of 3D multi-modal understanding and spatial relationship modeling in the newly introduced R3DGS task, achieving state-of-the-art performance. Abstract: We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at https://github.com/heshuting555/ReferSplat.

[316] Learning an Implicit Physics Model for Image-based Fluid Simulation

Emily Yue-Ting Jia,Jiageng Mao,Zhiyuan Gao,Yajie Zhao,Yue Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于物理信息神经网络的新方法，能够从单张图片生成符合物理规律的4D动画，实验表明其效果优于现有技术。

Details

Motivation: 人类能够从单张图片想象4D场景，而现有方法通常使用简单的2D运动估计，导致动画不符合物理规律，因此需要一种更真实的方法。 Method: 使用物理信息神经网络预测每个表面点的运动，通过基于物理原理（包括Navier-Stokes方程）的损失项进行引导，并预测基于特征的3D高斯分布来捕捉外观。 Result: 实验表明该方法在生成物理上合理的动画方面表现出显著改进，且能够从任意摄像机视角渲染。 Conclusion: 作者提出了一种基于物理信息神经网络的新方法，可以从单张图片生成符合物理规律的4D场景，实验结果表明该方法在生成物理上合理的动画方面优于现有方法。 Abstract: Humans possess an exceptional ability to imagine 4D scenes, encompassing both motion and 3D geometry, from a single still image. This ability is rooted in our accumulated observations of similar scenes and an intuitive understanding of physics. In this paper, we aim to replicate this capacity in neural networks, specifically focusing on natural fluid imagery. Existing methods for this task typically employ simplistic 2D motion estimators to animate the image, leading to motion predictions that often defy physical principles, resulting in unrealistic animations. Our approach introduces a novel method for generating 4D scenes with physics-consistent animation from a single image. We propose the use of a physics-informed neural network that predicts motion for each surface point, guided by a loss term derived from fundamental physical principles, including the Navier-Stokes equations. To capture appearance, we predict feature-based 3D Gaussians from the input image and its estimated depth, which are then animated using the predicted motions and rendered from any desired camera perspective. Experimental results highlight the effectiveness of our method in producing physically plausible animations, showcasing significant performance improvements over existing methods. Our project page is https://physfluid.github.io/ .

eess.IV [Back]

[317] Transfer Learning with EfficientNet for Accurate Leukemia Cell Classification

Faisal Ahmed

Main category: eess.IV

TL;DR: 该研究通过使用迁移学习与预训练卷积神经网络（CNN）和数据增强技术，解决了急性淋巴细胞白血病（ALL）诊断中的类别不平衡问题，并开发了准确和稳健的诊断工具。

Details

Motivation: 从外周血涂片图像中准确分类急性淋巴细胞白血病（ALL）对于早期诊断和有效治疗计划至关重要。 Method: 研究使用迁移学习与预训练卷积神经网络（CNN）来提升诊断性能，并应用了广泛的数据增强技术以解决数据集中的类别不平衡问题。 Result: EfficientNet-B3取得了最佳结果，F1得分为94.30%，准确率为92.02%，AUC为94.79%，在C-NMC挑战赛中表现优于先前报道的方法。 Conclusion: 研究证明结合数据增强和先进的迁移学习模型，特别是EfficientNet-B3，在开发血液恶性肿瘤检测的准确和稳健诊断工具方面的有效性。 Abstract: Accurate classification of Acute Lymphoblastic Leukemia (ALL) from peripheral blood smear images is essential for early diagnosis and effective treatment planning. This study investigates the use of transfer learning with pretrained convolutional neural networks (CNNs) to improve diagnostic performance. To address the class imbalance in the dataset of 3,631 Hematologic and 7,644 ALL images, we applied extensive data augmentation techniques to create a balanced training set of 10,000 images per class. We evaluated several models, including ResNet50, ResNet101, and EfficientNet variants B0, B1, and B3. EfficientNet-B3 achieved the best results, with an F1-score of 94.30%, accuracy of 92.02%, andAUCof94.79%,outperformingpreviouslyreported methods in the C-NMCChallenge. Thesefindings demonstrate the effectiveness of combining data augmentation with advanced transfer learning models, particularly EfficientNet-B3, in developing accurate and robust diagnostic tools for hematologic malignancy detection.

[318] LWT-ARTERY-LABEL: A Lightweight Framework for Automated Coronary Artery Identification

Shisheng Zhang,Ramtin Gharleghi,Sonit Singh,Daniel Moses,Dona Adikari,Arcot Sowmya,Susann Beier

Main category: eess.IV

TL;DR: 本文介绍了一种结合解剖学知识和规则拓扑约束的轻量级方法，用于自动冠状动脉标记，并在性能上达到了现有最佳水平。

Details

Motivation: 冠状动脉疾病的计算机断层扫描冠状动脉造影分析，如计算建模中的动脉特征识别，既费时又费力。传统的基于知识的标记方法无法充分利用数据驱动的见解，而最近的深度学习方法通常需要大量的计算资源，并忽略了关键的临床知识。 Method: 提出了一种轻量级的方法，结合了解剖知识和基于规则的拓扑约束，用于有效的冠状动脉标签。 Result: 所提出的方法在基准数据集上实现了最先进的性能。 Conclusion: 本文提出了一种结合解剖知识和基于规则的拓扑约束的轻量级冠状动脉标签方法，该方法在基准数据集上达到了最先进的性能，为自动化冠状动脉标签提供了一种有前景的替代方案。 Abstract: Coronary artery disease (CAD) remains the leading cause of death globally, with computed tomography coronary angiography (CTCA) serving as a key diagnostic tool. However, coronary arterial analysis using CTCA, such as identifying artery-specific features from computational modelling, is labour-intensive and time-consuming. Automated anatomical labelling of coronary arteries offers a potential solution, yet the inherent anatomical variability of coronary trees presents a significant challenge. Traditional knowledge-based labelling methods fall short in leveraging data-driven insights, while recent deep-learning approaches often demand substantial computational resources and overlook critical clinical knowledge. To address these limitations, we propose a lightweight method that integrates anatomical knowledge with rule-based topology constraints for effective coronary artery labelling. Our approach achieves state-of-the-art performance on benchmark datasets, providing a promising alternative for automated coronary artery labelling.

[319] Fusion-Based Brain Tumor Classification Using Deep Learning and Explainable AI, and Rule-Based Reasoning

Melika Filvantorkaman,Mohsen Piri,Maral Filvan Torkaman,Ashkan Zabihi,Hamidreza Moradi

Main category: eess.IV

TL;DR: This study introduces an interpretable deep learning framework combining MobileNetV2 and DenseNet121 with XAI for accurate brain tumor classification from MRI scans.

Details

Motivation: Accurate and interpretable classification of brain tumors from MRI is critical for effective diagnosis and treatment planning, and current methods may lack either performance or transparency. Method: The study used an ensemble-based deep learning framework combining MobileNetV2 and DenseNet121 CNNs with a soft voting strategy. It integrated an XAI module using Grad-CAM++ and a Clinical Decision Rule Overlay (CDRO). The models were trained and evaluated using stratified 5-fold cross-validation on the Figshare dataset. Result: The ensemble classifier outperformed individual CNNs with an accuracy of 91.7%, precision of 91.9%, recall of 91.7%, and F1-score of 91.6%. Grad-CAM++ visualizations aligned with expert annotations (Dice up to 0.88, IoU up to 0.78), and clinical rule activation validated predictions. Radiologists rated the interpretability highly (Likert-scale usefulness mean = 4.4, heatmap-region correspondence mean = 4.0). Conclusion: The study successfully developed an interpretable deep learning framework for brain tumor classification, combining MobileNetV2 and DenseNet121 with an XAI module, showing high performance and clinical relevance. Abstract: Accurate and interpretable classification of brain tumors from magnetic resonance imaging (MRI) is critical for effective diagnosis and treatment planning. This study presents an ensemble-based deep learning framework that combines MobileNetV2 and DenseNet121 convolutional neural networks (CNNs) using a soft voting strategy to classify three common brain tumor types: glioma, meningioma, and pituitary adenoma. The models were trained and evaluated on the Figshare dataset using a stratified 5-fold cross-validation protocol. To enhance transparency and clinical trust, the framework integrates an Explainable AI (XAI) module employing Grad-CAM++ for class-specific saliency visualization, alongside a symbolic Clinical Decision Rule Overlay (CDRO) that maps predictions to established radiological heuristics. The ensemble classifier achieved superior performance compared to individual CNNs, with an accuracy of 91.7%, precision of 91.9%, recall of 91.7%, and F1-score of 91.6%. Grad-CAM++ visualizations revealed strong spatial alignment between model attention and expert-annotated tumor regions, supported by Dice coefficients up to 0.88 and IoU scores up to 0.78. Clinical rule activation further validated model predictions in cases with distinct morphological features. A human-centered interpretability assessment involving five board-certified radiologists yielded high Likert-scale scores for both explanation usefulness (mean = 4.4) and heatmap-region correspondence (mean = 4.0), reinforcing the framework's clinical relevance. Overall, the proposed approach offers a robust, interpretable, and generalizable solution for automated brain tumor classification, advancing the integration of deep learning into clinical neurodiagnostics.

[320] Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments

Gian Mario Favero,Ge Ya Luo,Nima Fathi,Justin Szeto,Douglas L. Arnold,Brennan Nichyporuk,Chris Pal,Tal Arbel

Main category: eess.IV

TL;DR: This paper introduces a treatment-aware spatio-temporal diffusion model that predicts future lesion evolution in Multiple Sclerosis (MS) using multi-modal patient data, including MRI and treatment information, with potential applications in real-world clinical tasks.

Details

Motivation: The motivation behind the study is to address the heterogeneous progression of diseases like Multiple Sclerosis (MS) through image-based personalized medicine, particularly focusing on the prediction of lesion evolution. Method: The research uses a voxel-space approach incorporating multi-modal patient data, including MRI and treatment information, to forecast NET2 lesion masks at a future time point using a spatio-temporal diffusion model. Result: Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs show that the generative model can accurately predict future NET2 lesion masks for patients with MS. The model also shows potential in downstream tasks like lesion count and location estimation, binary lesion activity classification, and generating counterfactual lesion masks for different treatments. Conclusion: The study concludes that the treatment-aware spatio-temporal diffusion model can accurately predict NET2 lesion masks for patients with relapsing-remitting MS across different treatments and has potential in real-world clinical applications. Abstract: Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progression such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.

[321] Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities

Anindya Bijoy Das,Shahnewaz Karim Sakib,Shibbir Ahmed

Main category: eess.IV

TL;DR: 研究了LLMs在医学图像处理中的幻觉问题，探讨了其对临床可靠性的影响及改进方法。

Details

Motivation: 大型语言模型（LLMs）在医学成像任务中的应用越来越多，但这些模型常常产生幻觉，可能误导临床决策。 Method: 分析LLMs在图像到文本和文本到图像两个方向上的幻觉现象，使用专家标准评估输出结果。 Result: 研究揭示了在解释性和生成性任务中LLMs的常见幻觉模式，并评估了影响临床可靠性的因素。 Conclusion: 该研究发现LLMs在医学成像任务中存在常见的幻觉模式，并讨论了改进LLM驱动的医学成像系统安全性和可信度的途径。 Abstract: Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.

[322] 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression

Yuke Xing,William Gordon,Qi Yang,Kaifa Yang,Jiarui Wang,Yiling Xu

Main category: eess.IV

TL;DR: This paper introduces 3DGS-VBench, a large-scale dataset for assessing the visual quality of compressed 3D Gaussian Splatting models, enabling better evaluation and development of compression techniques.

Details

Motivation: The motivation is to address the lack of systematic quality assessment for 3D Gaussian Splatting (3DGS) compression techniques, despite their storage challenges and widespread use in real-time novel view synthesis. Method: The authors created 3DGS-VBench, a large-scale VQA dataset with 660 compressed 3DGS models and video sequences from 11 scenes across 6 state-of-the-art compression algorithms. They collected annotations from 50 participants to obtain MOS scores and validated dataset reliability. Result: A large-scale VQA dataset was established with systematically designed compression parameters, benchmarking 6 compression algorithms on storage efficiency and visual quality, and evaluating 15 quality assessment metrics across multiple paradigms. Conclusion: The study concludes that the proposed 3DGS-VBench dataset serves as a reliable benchmark for evaluating 3D Gaussian Splatting compression algorithms and enables specialized Video Quality Assessment (VQA) model training, advancing compression and quality assessment research. Abstract: 3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual fidelity, but its substantial storage requirements hinder practical deployment, prompting state-of-the-art (SOTA) 3DGS methods to incorporate compression modules. However, these 3DGS generative compression techniques introduce unique distortions lacking systematic quality assessment research. To this end, we establish 3DGS-VBench, a large-scale Video Quality Assessment (VQA) Dataset and Benchmark with 660 compressed 3DGS models and video sequences generated from 11 scenes across 6 SOTA 3DGS compression algorithms with systematically designed parameter levels. With annotations from 50 participants, we obtained MOS scores with outlier removal and validated dataset reliability. We benchmark 6 3DGS compression algorithms on storage efficiency and visual quality, and evaluate 15 quality assessment metrics across multiple paradigms. Our work enables specialized VQA model training for 3DGS, serving as a catalyst for compression and quality assessment research. The dataset is available at https://github.com/YukeXing/3DGS-VBench.

[323] SAGCNet: Spatial-Aware Graph Completion Network for Missing Slice Imputation in Population CMR Imaging

Junkai Liu,Nay Aung,Theodoros N. Arvanitis,Stefan K. Piechnik,Joao A C Lima,Steffen E. Petersen,Le Zhang

Main category: eess.IV

TL;DR: This paper introduces SAGCNet, a novel method for synthesizing missing MRI slices by effectively modeling inter-slice relationships and leveraging 3D spatial context, outperforming current state-of-the-art techniques.

Details

Motivation: Missing or unusable MRI slices hinder diagnostic accuracy; existing methods struggle with modeling local inter-slice correlations and exploring 3D spatial information. Method: SAGCNet incorporates a volumetric slice graph completion module and a volumetric spatial adapter to model inter-slice relationships and capture 3D spatial context. Result: SAGCNet achieves superior performance in both quantitative and qualitative evaluations on cardiac MRI datasets. Conclusion: The proposed SAGCNet effectively synthesizes absent CMR slices and outperforms existing methods even with limited data. Abstract: Magnetic resonance imaging (MRI) provides detailed soft-tissue characteristics that assist in disease diagnosis and screening. However, the accuracy of clinical practice is often hindered by missing or unusable slices due to various factors. Volumetric MRI synthesis methods have been developed to address this issue by imputing missing slices from available ones. The inherent 3D nature of volumetric MRI data, such as cardiac magnetic resonance (CMR), poses significant challenges for missing slice imputation approaches, including (1) the difficulty of modeling local inter-slice correlations and dependencies of volumetric slices, and (2) the limited exploration of crucial 3D spatial information and global context. In this study, to mitigate these issues, we present Spatial-Aware Graph Completion Network (SAGCNet) to overcome the dependency on complete volumetric data, featuring two main innovations: (1) a volumetric slice graph completion module that incorporates the inter-slice relationships into a graph structure, and (2) a volumetric spatial adapter component that enables our model to effectively capture and utilize various forms of 3D spatial context. Extensive experiments on cardiac MRI datasets demonstrate that SAGCNet is capable of synthesizing absent CMR slices, outperforming competitive state-of-the-art MRI synthesis methods both quantitatively and qualitatively. Notably, our model maintains superior performance even with limited slice data.

[324] Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications

Zelin Qiu,Xi Wang,Zhuoyao Xie,Juan Zhou,Yu Wang,Lingjie Yang,Xinrui Jiang,Juyoung Bae,Moo Hyun Son,Qiang Ye,Dexuan Chen,Rui Zhang,Tao Li,Neeraj Ramesh Mahboobani,Varut Vardhanabhuti,Xiaohui Duan,Yinghua Zhao,Hao Chen

Main category: eess.IV

TL;DR: PRISM是一种通过大规模多序列MRI进行预训练的基础模型，它解决了MRI序列间异质性带来的挑战，提高了深度学习模型在不同采集参数下的泛化能力和临床应用潜力。

Details

Motivation: 多序列磁共振成像（MRI）能够清晰地显示不同类型的组织，但MRI序列之间的固有异质性对深度学习模型的泛化能力提出了重大挑战。这种挑战削弱了模型在面对不同采集参数时的表现，从而严重限制了它们的临床应用。 Method: 研究人员提出了PRISM，这是一种通过大规模多序列MRI进行预训练的基础模型。他们收集了64个数据集，并构建了迄今为止最大的多器官多序列MRI预训练语料库。他们提出了一种新的预训练范式，将MRI中的解剖不变特征与序列特定变化分离开来，同时保留了高层次的语义表示。 Result: PRISM在44个下游任务中的一致表现优于非预训练模型和现有的基础模型，在44个下游基准测试中取得了39个第一的显著结果。这些结果强调了PRISM在不同MRI协议下获得强大且可泛化表示的能力。 Conclusion: PRISM提供了一个可扩展的多序列MRI分析框架，增强了人工智能在放射学中的转化潜力。它在多种成像协议下表现出一致的性能，强化了其临床适用性。 Abstract: Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability.

[325] HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation

Xuepeng Liu,Zheng Jiang,Pinan Zhu,Hanyu Liu,Chao Li

Main category: eess.IV

TL;DR: HaDM-ST is a new framework for high-resolution spatial transcriptomics that effectively uses H&E images and addresses key challenges in the field, leading to improved spatial and gene-level accuracy.

Details

Motivation: Spatial transcriptomics (ST) has limited resolution due to current platforms, and while recent methods enhance resolution via H&E-stained histology, several challenges persist. Method: HaDM-ST includes a semantic distillation network, a spatial alignment module, and a channel-aware adversarial learner for high-resolution ST generation. Result: HaDM-ST consistently outperforms prior methods in experiments on 200 genes across diverse tissues and species, enhancing spatial fidelity and gene-level coherence in predictions. Conclusion: HaDM-ST is a high-resolution ST generation framework that effectively integrates H&E images and low-resolution ST, addressing three major challenges in the field. Abstract: Spatial transcriptomics (ST) reveals spatial heterogeneity of gene expression, yet its resolution is limited by current platforms. Recent methods enhance resolution via H&E-stained histology, but three major challenges persist: (1) isolating expression-relevant features from visually complex H&E images; (2) achieving spatially precise multimodal alignment in diffusion-based frameworks; and (3) modeling gene-specific variation across expression channels. We propose HaDM-ST (Histology-assisted Differential Modeling for ST Generation), a high-resolution ST generation framework conditioned on H&E images and low-resolution ST. HaDM-ST includes: (i) a semantic distillation network to extract predictive cues from H&E; (ii) a spatial alignment module enforcing pixel-wise correspondence with low-resolution ST; and (iii) a channel-aware adversarial learner for fine-grained gene-level modeling. Experiments on 200 genes across diverse tissues and species show HaDM-ST consistently outperforms prior methods, enhancing spatial fidelity and gene-level coherence in high-resolution ST predictions.

[326] DiffVC-OSD: One-Step Diffusion-based Perceptual Neural Video Compression Framework

Wenzhuo Ma,Zhenzhong Chen

Main category: eess.IV

TL;DR: DiffVC-OSD 是一种基于单步扩散的视频压缩框架，通过优化时间依赖性和压缩性能，实现了更高效的视频压缩。

Details

Motivation: 传统的多步扩散方法在视频压缩中存在计算效率低的问题，因此提出了一种更高效的方法来提升感知质量和压缩性能。 Method: 提出了 DiffVC-OSD，这是一种基于单步扩散模型的感知神经视频压缩框架，并设计了时间上下文适配器和端到端微调策略以提高压缩性能。 Result: 实验表明，DiffVC-OSD 在感知压缩性能上达到先进水平，同时显著提高了解码速度并降低了比特率。 Conclusion: DiffVC-OSD 提供了最先进的感知视频压缩性能，解码速度提高了约 20 倍，并减少了 86.92% 的比特率。 Abstract: In this work, we first propose DiffVC-OSD, a One-Step Diffusion-based Perceptual Neural Video Compression framework. Unlike conventional multi-step diffusion-based methods, DiffVC-OSD feeds the reconstructed latent representation directly into a One-Step Diffusion Model, enhancing perceptual quality through a single diffusion step guided by both temporal context and the latent itself. To better leverage temporal dependencies, we design a Temporal Context Adapter that encodes conditional inputs into multi-level features, offering more fine-grained guidance for the Denoising Unet. Additionally, we employ an End-to-End Finetuning strategy to improve overall compression performance. Extensive experiments demonstrate that DiffVC-OSD achieves state-of-the-art perceptual compression performance, offers about 20$\times$ faster decoding and a 86.92\% bitrate reduction compared to the corresponding multi-step diffusion-based variant.

[327] Anatomy-Aware Low-Dose CT Denoising via Pretrained Vision Models and Semantic-Guided Contrastive Learning

Runze Wang,Zeli Chen,Zhiyun Song,Wei Fang,Jiajin Zhang,Danyang Tu,Yuxing Tang,Minfeng Xu,Xianghua Ye,Le Lu,Dakai Jin

Main category: eess.IV

TL;DR: ALDEN improves low-dose CT denoising by incorporating anatomical semantics through adversarial and contrastive learning, leading to better diagnostic efficacy and reduced radiation exposure.

Details

Motivation: To address the problem of suboptimal denoising outcomes due to the ignorance of anatomical semantics in most deep learning-based denoising methods for low-dose computed tomography. Method: ALDEN integrates semantic features of pretrained vision models with adversarial and contrastive learning, using an anatomy-aware discriminator and a semantic-guided contrastive learning module. Result: ALDEN achieves state-of-the-art performance in LDCT denoising, offering superior anatomy preservation and substantially reducing the over-smoothing issue present in previous works. Conclusion: ALDEN, an anatomy-aware LDCT denoising method, achieves state-of-the-art performance by preserving anatomical structures and reducing over-smoothing issues, further validated by a multi-organ segmentation task. Abstract: To reduce radiation exposure and improve the diagnostic efficacy of low-dose computed tomography (LDCT), numerous deep learning-based denoising methods have been developed to mitigate noise and artifacts. However, most of these approaches ignore the anatomical semantics of human tissues, which may potentially result in suboptimal denoising outcomes. To address this problem, we propose ALDEN, an anatomy-aware LDCT denoising method that integrates semantic features of pretrained vision models (PVMs) with adversarial and contrastive learning. Specifically, we introduce an anatomy-aware discriminator that dynamically fuses hierarchical semantic features from reference normal-dose CT (NDCT) via cross-attention mechanisms, enabling tissue-specific realism evaluation in the discriminator. In addition, we propose a semantic-guided contrastive learning module that enforces anatomical consistency by contrasting PVM-derived features from LDCT, denoised CT and NDCT, preserving tissue-specific patterns through positive pairs and suppressing artifacts via dual negative pairs. Extensive experiments conducted on two LDCT denoising datasets reveal that ALDEN achieves the state-of-the-art performance, offering superior anatomy preservation and substantially reducing over-smoothing issue of previous work. Further validation on a downstream multi-organ segmentation task (encompassing 117 anatomical structures) affirms the model's ability to maintain anatomical awareness.

[328] Towards Human-AI Collaboration System for the Detection of Invasive Ductal Carcinoma in Histopathology Images

Shuo Han,Ahmed Karam Eldaly,Solomon Sunday Oyelere

Main category: eess.IV

TL;DR: 本文提出了一种人工智能与医学专家协作的深度学习系统，用于检测侵袭性导管癌，通过循环反馈机制显著提高了诊断准确率。

Details

Motivation: 侵袭性导管癌是乳腺癌中最常见的形式，早期准确诊断对提高患者生存率至关重要。结合医学专业知识和人工智能技术有望提高IDC检测的精准度和效率。 Method: 提出了一种基于EfficientNetV2S模型的人工智能与人类专家协作（human-in-the-loop, HITL）深度学习系统，通过AI生成初步诊断结果，并由医学专家进行复核和修正，再将修正后的数据重新用于模型训练，形成反馈循环。 Result: EfficientNetV2S模型在现有文献方法中取得了最先进的性能，整体准确率达到93.65%。通过引入人类专家对误分类图像的修正，模型性能在四组实验中得到了进一步提升。 Conclusion: 论文得出结论，人类与人工智能（AI）的协作方法能够提升侵袭性导管癌（IDC）检测的准确性，为未来的AI辅助医学诊断提供了有前景的方向。 Abstract: Invasive ductal carcinoma (IDC) is the most prevalent form of breast cancer, and early, accurate diagnosis is critical to improving patient survival rates by guiding treatment decisions. Combining medical expertise with artificial intelligence (AI) holds significant promise for enhancing the precision and efficiency of IDC detection. In this work, we propose a human-in-the-loop (HITL) deep learning system designed to detect IDC in histopathology images. The system begins with an initial diagnosis provided by a high-performance EfficientNetV2S model, offering feedback from AI to the human expert. Medical professionals then review the AI-generated results, correct any misclassified images, and integrate the revised labels into the training dataset, forming a feedback loop from the human back to the AI. This iterative process refines the model's performance over time. The EfficientNetV2S model itself achieves state-of-the-art performance compared to existing methods in the literature, with an overall accuracy of 93.65\%. Incorporating the human-in-the-loop system further improves the model's accuracy using four experimental groups with misclassified images. These results demonstrate the potential of this collaborative approach to enhance AI performance in diagnostic systems. This work contributes to advancing automated, efficient, and highly accurate methods for IDC detection through human-AI collaboration, offering a promising direction for future AI-assisted medical diagnostics.

Johanna P. Müller,Anika Knupfer,Pedro Blöss,Edoardo Berardi Vittur,Bernhard Kainz,Jana Hutter

Main category: eess.IV

TL;DR: The paper introduces a new diffusion-based framework for generating synthetic uterine MRI images that are anatomically precise and of high fidelity, aiming to address data scarcity and privacy issues in gynaecological imaging and enhance diagnostic model training.

Details

Motivation: The motivation is to overcome the limitations of existing diffusion models in producing anatomically precise female pelvic images, which is crucial for gynaecological imaging due to data scarcity and patient privacy concerns. Method: The method involves a novel diffusion-based framework for uterine MRI synthesis, integrating both unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D. Result: The result is the generation of anatomically coherent, high fidelity synthetic images. These images showed substantial gains in diagnostic accuracy on a key classification task and were validated for clinical realism through a blinded expert evaluation. Conclusion: The paper concludes that their novel diffusion-based framework for uterine MRI synthesis effectively generates high-quality synthetic images that closely mimic real scans, providing valuable resources for training robust diagnostic models and advancing equitable AI in gynaecology. Abstract: Despite significant progress in generative modelling, existing diffusion models often struggle to produce anatomically precise female pelvic images, limiting their application in gynaecological imaging, where data scarcity and patient privacy concerns are critical. To overcome these barriers, we introduce a novel diffusion-based framework for uterine MRI synthesis, integrating both unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D. Our approach generates anatomically coherent, high fidelity synthetic images that closely mimic real scans and provide valuable resources for training robust diagnostic models. We evaluate generative quality using advanced perceptual and distributional metrics, benchmarking against standard reconstruction methods, and demonstrate substantial gains in diagnostic accuracy on a key classification task. A blinded expert evaluation further validates the clinical realism of our synthetic images. We release our models with privacy safeguards and a comprehensive synthetic uterine MRI dataset to support reproducible research and advance equitable AI in gynaecology.

[330] A Physics-Driven Neural Network with Parameter Embedding for Generating Quantitative MR Maps from Weighted Images

Lingjing Chen,Chengxiu Zhang,Yinqiao Yi,Yida Wang,Yang Song,Xu Yan,Shengfang Xu,Dalin Zhu,Mengqiu Cao,Yan Zhou,Chenglong Wang,Guang Yang

Main category: eess.IV

TL;DR: A physics-driven neural network was developed to integrate MRI sequence parameters into the deep learning model via parameter embedding to improve the accuracy and generalizability of quantitative MRI image synthesis. The model showed superior performance and generalization capability, indicating great potential for accelerating qMRI and enhancing its clinical utility.

Details

Motivation: To improve the accuracy and generalizability of quantitative image synthesis from clinical weighted MRI by integrating MRI sequence parameters into a deep learning-based approach. Method: A physics-driven neural network that embeds MRI sequence parameters (TR, TE, and TI) directly into the model via parameter embedding. The model takes conventional T1-weighted, T2-weighted, and T2-FLAIR images as input and synthesizes T1, T2, and proton density (PD) quantitative maps. Result: The proposed method achieved high performance with PSNR values exceeding 34 dB and SSIM values above 0.92 for all synthesized parameter maps. It outperformed conventional deep learning models in accuracy and robustness, including data with previously unseen brain structures and lesions. Conclusion: Incorporating MRI sequence parameters via parameter embedding allows the neural network to better learn the physical characteristics of MR signals, significantly enhancing the performance and reliability of quantitative MRI synthesis. This method shows great potential for accelerating qMRI and improving its clinical utility. Abstract: We propose a deep learning-based approach that integrates MRI sequence parameters to improve the accuracy and generalizability of quantitative image synthesis from clinical weighted MRI. Our physics-driven neural network embeds MRI sequence parameters -- repetition time (TR), echo time (TE), and inversion time (TI) -- directly into the model via parameter embedding, enabling the network to learn the underlying physical principles of MRI signal formation. The model takes conventional T1-weighted, T2-weighted, and T2-FLAIR images as input and synthesizes T1, T2, and proton density (PD) quantitative maps. Trained on healthy brain MR images, it was evaluated on both internal and external test datasets. The proposed method achieved high performance with PSNR values exceeding 34 dB and SSIM values above 0.92 for all synthesized parameter maps. It outperformed conventional deep learning models in accuracy and robustness, including data with previously unseen brain structures and lesions. Notably, our model accurately synthesized quantitative maps for these unseen pathological regions, highlighting its superior generalization capability. Incorporating MRI sequence parameters via parameter embedding allows the neural network to better learn the physical characteristics of MR signals, significantly enhancing the performance and reliability of quantitative MRI synthesis. This method shows great potential for accelerating qMRI and improving its clinical utility.

[331] RedDino: A foundation model for red blood cell analysis

Luca Zedda,Andrea Loddo,Cecilia Di Ruberto,Carsten Marr

Main category: eess.IV

TL;DR: RedDino是一个专为红细胞图像分析设计的自监督学习模型，它在红细胞形状分类任务中表现优异，有助于推进血液疾病诊断工具的发展。

Details

Motivation: 尽管基础模型在医学诊断中有很大潜力，但针对红细胞分析的全面AI解决方案仍然稀缺。 Method: RedDino使用了RBC特定的DINOv2自监督学习框架，并在一个包含125万张RBC图像的数据集上进行了训练。 Result: RedDino在红细胞形状分类任务中表现优于现有的最先进模型，并通过了线性探测和最近邻分类等评估方法验证了其强大的特征表示和泛化能力。 Conclusion: RedDino是一个针对红细胞分析的预训练模型，它通过捕捉细微的形态特征解决了计算血液学中的关键挑战，推动了可靠诊断工具的发展。 Abstract: Red blood cells (RBCs) are essential to human health, and their precise morphological analysis is important for diagnosing hematological disorders. Despite the promise of foundation models in medical diagnostics, comprehensive AI solutions for RBC analysis remain scarce. We present RedDino, a self-supervised foundation model designed for RBC image analysis. RedDino uses an RBC-specific adaptation of the DINOv2 self-supervised learning framework and is trained on a curated dataset of 1.25 million RBC images from diverse acquisition modalities and sources. Extensive evaluations show that RedDino outperforms existing state-of-the-art models on RBC shape classification. Through assessments including linear probing and nearest neighbor classification, we confirm its strong feature representations and generalization ability. Our main contributions are: (1) a foundation model tailored for RBC analysis, (2) ablation studies exploring DINOv2 configurations for RBC modeling, and (3) a detailed evaluation of generalization performance. RedDino addresses key challenges in computational hematology by capturing nuanced morphological features, advancing the development of reliable diagnostic tools. The source code and pretrained models for RedDino are available at https://github.com/Snarci/RedDino, and the pretrained models can be downloaded from our Hugging Face collection at https://huggingface.co/collections/Snarcy/reddino-689a13e29241d2e5690202fc

Table of Contents

cs.CL [Back]

[1] Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction

[2] Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models

[3] CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

[4] The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

[5] Factor Augmented Supervised Learning with Text Embeddings

[6] Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

[7] LLM Unlearning Without an Expert Curated Dataset

[8] BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

[9] Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

[10] Measuring Stereotype and Deviation Biases in Large Language Models

[11] Testing the Limits of Machine Translation from One Book

[12] Do Biased Models Have Biased Thoughts?

[13] Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

[14] Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis

[15] Many-Turn Jailbreaking

[16] SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

[17] Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems

[18] Text to Speech System for Meitei Mayek Script

[19] ESNERA: Empirical and semantic named entity alignment for named entity dataset merging

[20] The ReQAP System for Question Answering over Personal Information

[21] Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores

[22] Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

[23] Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction

[24] Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

[25] Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings

[26] SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

[27] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context

[28] Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

[29] Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution

[30] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens

[31] Gradient Surgery for Safe LLM Fine-Tuning

[32] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

[33] Improved Personalized Headline Generation via Denoising Fake Interests from Implicit Feedback

[34] Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

[35] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

[36] Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment

[37] Enhancing Rumor Detection Methods with Propagation Structure Infused Language Model

[38] How Does a Deep Neural Network Look at Lexical Stress?

[39] Prompt Tuning for Few-Shot Continual Learning Named Entity Recognition

[40] The 2D+ Dynamic Articulatory Model DYNARTmo: Tongue-Palate Contact Area Estimation

[41] Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models

[42] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

[43] "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas

[44] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

[45] CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

[46] HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways

[47] ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

[48] Strategies of Code-switching in Human-Machine Dialogs

[49] Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance

[50] Grounding Multilingual Multimodal LLMs With Cultural Knowledge

[51] Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs

[52] Positional Biases Shift as Inputs Approach Context Window Limits

[53] ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models

[54] Augmenting Bias Detection in LLMs Using Topological Data Analysis

[55] Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews

[56] From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR

[57] IBPS: Indian Bail Prediction System

[58] Keyword-Centric Prompting for One-Shot Event Detection with Self-Generated Rationale Enhancements

[59] InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

[60] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval

[61] What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction

[62] Exploring Causal Effect of Social Bias on Faithfulness Hallucinations in Large Language Models

[63] SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

[64] Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

[65] Can You Trick the Grader? Adversarial Persuasion of LLM Judges

[66] Evaluating Compositional Approaches for Focus and Sentiment Analysis

[67] Evaluating Large Language Models as Expert Annotators

[68] LLMs for Law: Evaluating Legal-Specific LLMs on Contract Understanding

[69] Large Language Models for Czech Aspect-Based Sentiment Analysis

[70] Few-shot Cross-lingual Aspect-Based Sentiment Analysis with Sequence-to-Sequence Models

[71] Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity

[72] Challenges and opportunities in portraying emotion in generated sign language

[73] Expert Preference-based Evaluation of Automated Related Work Generation

[74] Large Language Models for Subjective Language Understanding: A Survey

[75] Toward Machine Interpreting: Lessons from Human Interpreting Studies

[76] Understanding Syntactic Generalization in Structure-inducing Language Models

[77] Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

[78] The Medical Metaphors Corpus (MCC)