Skip to content

Table of Contents

cs.CL [Back]

[1] Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs

Andrea Piergentili,Beatrice Savoldi,Matteo Negri,Luisa Bentivogli

Main category: cs.CL

TL;DR: 本文首次系统评估了最先进的大语言模型在意大利语性别中立重写(GNR)任务中的表现,提出一个衡量中立性和语义保真度的二维框架,发现开源权重的大模型优于现有专用模型,而经过微调的小模型性能相当甚至更优,同时探讨了训练数据在中立性与语义保持之间的权衡。

Details Motivation: 在语法性别语言(如意大利语)中实现性别中立重写具有挑战性,现有方法不足,需系统评估并改进大语言模型在此任务上的表现。 Method: 提出一个两维评估框架,比较多种大模型的少样本提示效果,对选定模型进行微调,并采用针对性清洗提升任务相关性。 Result: 开源权重的大语言模型优于现有的专用GNR模型;经微调的较小模型性能达到或超过最佳开源模型,且体积更小;研究揭示了训练数据优化中中立性与语义保持的权衡。 Conclusion: 通过适当的微调和数据清洗,小规模模型可在意大利语GNR任务上媲美甚至超越大规模模型,为高效、高保真性别中立重写提供了可行路径。 Abstract: Gender-neutral rewriting (GNR) aims to reformulate text to eliminate unnecessary gender specifications while preserving meaning, a particularly challenging task in grammatical-gender languages like Italian. In this work, we conduct the first systematic evaluation of state-of-the-art large language models (LLMs) for Italian GNR, introducing a two-dimensional framework that measures both neutrality and semantic fidelity to the input. We compare few-shot prompting across multiple LLMs, fine-tune selected models, and apply targeted cleaning to boost task relevance. Our findings show that open-weight LLMs outperform the only existing model dedicated to GNR in Italian, whereas our fine-tuned models match or exceed the best open-weight LLM's performance at a fraction of its size. Finally, we discuss the trade-off between optimizing the training data for neutrality and meaning preservation.

[2] Op-Fed: Opinion, Stance, and Monetary Policy Annotations on FOMC Transcripts Using Active Learning

Alisa Kanganis,Katherine A. Keith

Main category: cs.CL

TL;DR: 本文介绍了Op-Fed数据集,包含1044个来自FOMC会议记录的人工标注句子及其上下文,旨在识别对货币政策的立场。由于类别不平衡(少于8%的句子表达非中立立场)和句子间依赖性(65%需句外上下文),作者设计了五阶段分层标注框架,并采用主动学习提升正例数量。实验显示当前最优闭源大模型在立场分类上零样本准确率仅为0.61,显著低于人类基线0.89,表明该任务仍具挑战性。

Details Motivation: 识别FOMC会议记录中对货币政策的意见和立场对于理解政策制定过程至关重要,但现有方法面临类别不平衡和上下文依赖的挑战,因此需要高质量标注数据集来推动相关研究。 Method: 提出五阶段分层标注框架以解耦意见、货币政策及立场等要素,并结合主动学习策略选择高价值样本进行标注,提升数据集中正例比例和上下文覆盖。 Result: 构建了包含1044个标注句子的Op-Fed数据集;实验表明闭源大模型在立场分类任务上的零样本准确率为0.61,远低于人类标注者0.89的准确率。 Conclusion: Op-Fed为研究货币政策立场识别提供了宝贵资源,凸显了当前模型在理解复杂语境下立场表达方面的不足,未来可用于模型训练、置信度校准和进一步标注工作。 Abstract: The U.S. Federal Open Market Committee (FOMC) regularly discusses and sets monetary policy, affecting the borrowing and spending decisions of millions of people. In this work, we release Op-Fed, a dataset of 1044 human-annotated sentences and their contexts from FOMC transcripts. We faced two major technical challenges in dataset creation: imbalanced classes -- we estimate fewer than 8% of sentences express a non-neutral stance towards monetary policy -- and inter-sentence dependence -- 65% of instances require context beyond the sentence-level. To address these challenges, we developed a five-stage hierarchical schema to isolate aspects of opinion, monetary policy, and stance towards monetary policy as well as the level of context needed. Second, we selected instances to annotate using active learning, roughly doubling the number of positive instances across all schema aspects. Using Op-Fed, we found a top-performing, closed-weight LLM achieves 0.80 zero-shot accuracy in opinion classification but only 0.61 zero-shot accuracy classifying stance towards monetary policy -- below our human baseline of 0.89. We expect Op-Fed to be useful for future model training, confidence calibration, and as a seed dataset for future annotation efforts.

[3] Overview of Dialog System Evaluation Track: Dimensionality, Language, Culture and Safety at DSTC 12

John Mendonça,Lining Zhang,Rahul Mallidi,Alon Lavie,Isabel Trancoso,Luis Fernando D'Haro,João Sedoc

Main category: cs.CL

TL;DR: DSTC12 Track 1探讨了对话系统评估中的多维度、多语言、多文化和安全性问题,包含两个子任务:对话级多维自动评估指标和跨文化安全检测。结果显示当前方法在文化感知安全方面仍有显著不足。

Details Motivation: 传统对话评估指标不足且安全性定义狭窄或存在文化偏见,亟需更全面、跨文化和多语言的评估框架。 Method: 设立两个子任务:1)基于10个对话维度的自动评估指标;2)多语言与跨文化安全检测,并提供数据集、基线模型及评估结果。 Result: 任务1中Llama-3-8B基线平均Spearman相关性为0.1681,提升空间大;任务2中参赛队伍在多语言安全上表现优于基线(最高ROC-AUC达0.9648),但基线在文化子集上更优(0.5126),显示文化感知安全仍具挑战。 Conclusion: 当前对话系统评估在文化敏感性和多维建模方面仍不成熟,需进一步研究以实现更公平、鲁棒的评估体系。 Abstract: The rapid advancement of Large Language Models (LLMs) has intensified the need for robust dialogue system evaluation, yet comprehensive assessment remains challenging. Traditional metrics often prove insufficient, and safety considerations are frequently narrowly defined or culturally biased. The DSTC12 Track 1, "Dialog System Evaluation: Dimensionality, Language, Culture and Safety," is part of the ongoing effort to address these critical gaps. The track comprised two subtasks: (1) Dialogue-level, Multi-dimensional Automatic Evaluation Metrics, and (2) Multilingual and Multicultural Safety Detection. For Task 1, focused on 10 dialogue dimensions, a Llama-3-8B baseline achieved the highest average Spearman's correlation (0.1681), indicating substantial room for improvement. In Task 2, while participating teams significantly outperformed a Llama-Guard-3-1B baseline on the multilingual safety subset (top ROC-AUC 0.9648), the baseline proved superior on the cultural subset (0.5126 ROC-AUC), highlighting critical needs in culturally-aware safety. This paper describes the datasets and baselines provided to participants, as well as submission evaluation results for each of the two proposed subtasks.

[4] Latent Traits and Cross-Task Transfer: Deconstructing Dataset Interactions in LLM Fine-tuning

Shambhavi Krishna,Atharva Naik,Chaitali Agarwal,Sudharshan Govindan,Taesung Lee,Haw-Shiuan Chang

Main category: cs.CL

TL;DR: 提出一种分析框架,通过构建迁移学习矩阵和降维技术来研究大语言模型中的跨任务交互,揭示隐藏的统计因素比表面数据特征更影响迁移效果。

Details Motivation: 由于大语言模型常面临训练中未见过的任务,获取所有高质量训练数据不现实,因此需要依赖具有不同特征数据集的迁移学习,并应对分布外请求。 Method: 构建迁移学习矩阵并结合降维技术,训练10个模型以识别潜在能力(如推理、情感分类、自然语言理解、算术),分析跨任务的相互作用。 Result: 发现性能提升往往无法用数据集表面相似性或源数据质量解释,而源数据集的隐藏统计因素(如类别分布、生成长度倾向)和特定语言特征更具影响力。 Conclusion: 该工作揭示了迁移学习中复杂的动态机制,为更可预测和高效的大语言模型适应提供了新思路。 Abstract: Large language models are increasingly deployed across diverse applications. This often includes tasks LLMs have not encountered during training. This implies that enumerating and obtaining the high-quality training data for all tasks is infeasible. Thus, we often need to rely on transfer learning using datasets with different characteristics, and anticipate out-of-distribution requests. Motivated by this practical need, we propose an analysis framework, building a transfer learning matrix and dimensionality reduction, to dissect these cross-task interactions. We train and analyze 10 models to identify latent abilities (e.g., Reasoning, Sentiment Classification, NLU, Arithmetic) and discover the side effects of the transfer learning. Our findings reveal that performance improvements often defy explanations based on surface-level dataset similarity or source data quality. Instead, hidden statistical factors of the source dataset, such as class distribution and generation length proclivities, alongside specific linguistic features, are actually more influential. This work offers insights into the complex dynamics of transfer learning, paving the way for more predictable and effective LLM adaptation.

[5] Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs

Zhuoxuan Zhang,Jinhao Duan,Edward Kim,Kaidi Xu

Main category: cs.CL

TL;DR: 该研究发现大语言模型(LLM)在内部表征中线性编码问题的歧义性,且仅需少数神经元(甚至一个)即可检测和控制歧义,通过操纵这些歧义编码神经元(AENs)可使模型从直接回答转为拒绝回答。

Details Motivation: 现实中的问题普遍存在歧义,但大语言模型通常以自信的方式作答而不主动澄清。研究旨在探索模型是否能在内部识别歧义,并实现对其行为的可控干预。 Method: 在模型的预填充阶段,识别出编码歧义信息的少数神经元(AENs),训练探针进行歧义检测,并分析其在不同数据集上的泛化能力及在各层中的分布;通过干预AENs来控制模型输出行为。 Result: AEN探针在歧义检测任务上表现优异,超越基于提示和表征的基线方法;AENs主要出现在浅层,表明歧义信号被早期编码;通过操控AENs可有效引导模型 abstain(不回答)。 Conclusion: 大语言模型在内部形成了紧凑且可解释的歧义表征,利用这一特性可实现对模型行为的精细控制,为提升模型可靠性与交互性提供了新路径。 Abstract: Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model's pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding Neurons (AENs) achieve strong performance on ambiguity detection and generalize across datasets, outperforming prompting-based and representation-based baselines. Layerwise analysis reveals that AENs emerge from shallow layers, suggesting early encoding of ambiguity signals in the model's processing pipeline. Finally, we show that through manipulating AENs, we can control LLM's behavior from direct answering to abstention. Our findings reveal that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.

[6] CL$^2$GEC: A Multi-Discipline Benchmark for Continual Learning in Chinese Literature Grammatical Error Correction

Shang Qin,Jingheng Ye,Yinghui Li,Hai-Tao Zheng,Qi Li,Jinxiao Shan,Zhixing Li,Hong-Gee Kim

Main category: cs.CL

TL;DR: 本文提出了CL^2GEC,首个面向中文文学语法错误纠正的持续学习基准,包含10个学科的1万个人工标注句子,用于评估跨学科语法纠错中的持续学习能力。

Details Motivation: 现有中文语法纠错研究缺乏针对多学科场景的专用基准,且未考虑领域间语言差异和灾难性遗忘问题,难以适应实际学术写作中动态变化的需求。 Method: 构建了一个涵盖10个学科、1万句人工标注的多领域中文语法纠错数据集CL^2GEC,设计了模拟连续学习场景的评估框架,并在顺序微调、参数高效适配及四种典型持续学习算法下进行实验。 Result: 实验表明基于正则化的方法在缓解遗忘方面优于回放法和朴素顺序训练,在标准GEC指标和持续学习专用指标上均表现出更优的性能。 Conclusion: CL^2GEC为跨学科中文语法纠错提供了首个持续学习评测基准,推动了适应性语法纠错系统在多领域学术写作中的发展。 Abstract: The growing demand for automated writing assistance in diverse academic domains highlights the need for robust Chinese Grammatical Error Correction (CGEC) systems that can adapt across disciplines. However, existing CGEC research largely lacks dedicated benchmarks for multi-disciplinary academic writing, overlooking continual learning (CL) as a promising solution to handle domain-specific linguistic variation and prevent catastrophic forgetting. To fill this crucial gap, we introduce CL$^2$GEC, the first Continual Learning benchmark for Chinese Literature Grammatical Error Correction, designed to evaluate adaptive CGEC across multiple academic fields. Our benchmark includes 10,000 human-annotated sentences spanning 10 disciplines, each exhibiting distinct linguistic styles and error patterns. CL$^2$GEC focuses on evaluating grammatical error correction in a continual learning setting, simulating sequential exposure to diverse academic disciplines to reflect real-world editorial dynamics. We evaluate large language models under sequential tuning, parameter-efficient adaptation, and four representative CL algorithms, using both standard GEC metrics and continual learning metrics adapted to task-level variation. Experimental results reveal that regularization-based methods mitigate forgetting more effectively than replay-based or naive sequential approaches. Our benchmark provides a rigorous foundation for future research in adaptive grammatical error correction across diverse academic domains.

[7] AgentCTG: Harnessing Multi-Agent Collaboration for Fine-Grained Precise Control in Text Generation

Xinxu Zhou,Jiaqi Bai,Zhenqi Sun,Fanxiang Zeng,Yue Liu

Main category: cs.CL

TL;DR: 本文提出了一种新的可扩展框架AgentCTG,通过模拟多智能体工作流中的控制机制来增强对文本生成的精确和复杂控制,在多个公开数据集上达到最先进水平,并通过新提出的角色驱动重写任务验证了其在实际应用中的有效性。

Details Motivation: 为了应对自然语言处理中受控文本生成在细粒度条件控制、成本、可扩展性、领域知识学习和精确控制方面的挑战。 Method: 提出AgentCTG框架,模拟多智能体工作流中的控制与调节机制,探索不同智能体间的协作方法,并引入自动提示模块以提升生成效果。 Result: 在多个公开数据集上达到最先进的结果;在新提出的角色驱动重写任务中表现出色;在线导航角色扮演应用中显著提升驾驶体验和内容传递质量。 Conclusion: AgentCTG能够有效实现复杂和精确的文本生成控制,具备良好的可扩展性和实际应用潜力,提升了在线交互的沉浸感、个性化和用户参与度。 Abstract: Although significant progress has been made in many tasks within the field of Natural Language Processing (NLP), Controlled Text Generation (CTG) continues to face numerous challenges, particularly in achieving fine-grained conditional control over generation. Additionally, in real scenario and online applications, cost considerations, scalability, domain knowledge learning and more precise control are required, presenting more challenge for CTG. This paper introduces a novel and scalable framework, AgentCTG, which aims to enhance precise and complex control over the text generation by simulating the control and regulation mechanisms in multi-agent workflows. We explore various collaboration methods among different agents and introduce an auto-prompt module to further enhance the generation effectiveness. AgentCTG achieves state-of-the-art results on multiple public datasets. To validate its effectiveness in practical applications, we propose a new challenging Character-Driven Rewriting task, which aims to convert the original text into new text that conform to specific character profiles and simultaneously preserve the domain knowledge. When applied to online navigation with role-playing, our approach significantly enhances the driving experience through improved content delivery. By optimizing the generation of contextually relevant text, we enable a more immersive interaction within online communities, fostering greater personalization and user engagement.

[8] Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Suyuchen Wang,Jinlin Wang,Xinyu Wang,Shiqi Li,Xiangru Tang,Sirui Hong,Xiao-Wen Chang,Chenglin Wu,Bang Liu

Main category: cs.CL

TL;DR: 提出了一种名为CARE的新型检索增强推理框架,通过模型自身的检索能力显式地将上下文证据融入推理过程,显著提升了大语言模型在问答任务中的检索准确性和答案生成性能。

Details Motivation: 大语言模型在基于给定信息回答问题时常出现上下文不一致的问题,现有方法依赖昂贵的监督微调或外部检索,未能有效提升对给定上下文的利用。 Method: 提出CARE框架,利用模型自身检索能力,在推理过程中通过策略性检索上下文token来整合证据,仅需少量标注数据进行训练。 Result: 在多个真实和反事实的问答基准上实验表明,该方法显著优于监督微调、传统检索增强生成及外部检索方法。 Conclusion: CARE为大语言模型在知识密集型任务中的准确性、可靠性和效率带来了基础性进步。 Abstract: Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model's own retrieval capabilities. Our method requires limited labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.

[9] Can Large Language Models Robustly Perform Natural Language Inference for Japanese Comparatives?

Yosuke Mikami,Daiki Matsuoka,Hitomi Yanaka

Main category: cs.CL

TL;DR: 本文构建了一个专注于比较级的日语自然语言推理(NLI)数据集,评估了大语言模型在零样本和少样本设置下的表现,发现模型性能受提示格式和示例标签影响,且在处理日语特有语言现象时存在困难,而包含逻辑语义表示的提示有助于提升表现。

Details Motivation: 大语言模型在处理涉及数值和逻辑表达的自然语言推理任务时仍面临挑战,尤其是在非主流语言如日语中的比较级推理方面研究不足。 Method: 构建一个专注于比较级的日语NLI数据集,并在零样本和少样本设置下评估多种大语言模型的表现,分析不同提示格式和示例标签对结果的影响。 Result: 模型表现对提示格式敏感,少样本示例中的标签会影响输出;模型难以处理日语特有的语言现象;包含逻辑语义表示的提示能帮助模型解决原本难以推断的问题。 Conclusion: 尽管大语言模型在NLI任务中表现良好,但在处理日语比较级等复杂语言现象时仍有局限,提示设计尤其是引入逻辑语义表示可显著改善模型推理能力。 Abstract: Large Language Models (LLMs) perform remarkably well in Natural Language Inference (NLI). However, NLI involving numerical and logical expressions remains challenging. Comparatives are a key linguistic phenomenon related to such inference, but the robustness of LLMs in handling them, especially in languages that are not dominant in the models' training data, such as Japanese, has not been sufficiently explored. To address this gap, we construct a Japanese NLI dataset that focuses on comparatives and evaluate various LLMs in zero-shot and few-shot settings. Our results show that the performance of the models is sensitive to the prompt formats in the zero-shot setting and influenced by the gold labels in the few-shot examples. The LLMs also struggle to handle linguistic phenomena unique to Japanese. Furthermore, we observe that prompts containing logical semantic representations help the models predict the correct labels for inference problems that they struggle to solve even with few-shot examples.

[10] Integrating Text and Time-Series into (Large) Language Models to Predict Medical Outcomes

Iyadh Ben Cheikh Larbi,Ajay Madhavan Ravichandran,Aljoscha Burchardt,Roland Roller

Main category: cs.CL

TL;DR: 本研究探索了大语言模型在处理临床分类任务中的潜力,特别是结合临床文本和结构化电子健康记录(EHR)数据的联合建模。通过使用基于DSPy的提示优化方法调整指令调优的LLM,实现了与专用多模态系统相当的性能,同时降低了复杂性并提高了跨任务适应性。

Details Motivation: 尽管大语言模型在文本生成方面表现出色,但其在处理包含时间序列等结构化数据的临床分类任务中的能力尚未被充分探索。因此,研究旨在提升LLM在临床决策支持中的适用性和灵活性。 Method: 采用基于DSPy的提示优化技术,对指令调优的大语言模型进行适应性调整,使其能够同时处理临床文本和结构化的EHR输入数据。 Result: 该方法在多个临床分类任务上达到了与专门设计的多模态系统相当的性能水平,同时具备更低的系统复杂度和更强的任务间适应能力。 Conclusion: 通过提示优化,大语言模型可以有效融合临床文本与结构化数据,在保持高性能的同时提供更简洁、灵活的解决方案,适用于多样化的临床分类任务。 Abstract: Large language models (LLMs) excel at text generation, but their ability to handle clinical classification tasks involving structured data, such as time series, remains underexplored. In this work, we adapt instruction-tuned LLMs using DSPy-based prompt optimization to process clinical notes and structured EHR inputs jointly. Our results show that this approach achieves performance on par with specialized multimodal systems while requiring less complexity and offering greater adaptability across tasks.

[11] DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models

Xiao Zheng

Main category: cs.CL

TL;DR: 提出了一种名为DSCC-HS的新型框架,通过在自回归解码过程中动态干预来主动抑制大语言模型的幻觉问题,无需修改目标模型,在多个基准上达到最先进的事实一致性表现。

Details Motivation: 大语言模型的幻觉问题是其可靠部署的主要障碍,现有方法(如RAG)多为被动应对,缺乏在生成过程中主动抑制幻觉的机制。 Method: 受双过程认知理论启发,DSCC-HS在推理过程中使用一个紧凑的代理模型,以对抗性角色训练为事实对齐代理(FAP)和幻觉检测代理(HDP),并在每一步解码时将两者的logits差作为实时引导向量注入目标模型,实现动态校准。 Result: 在TruthfulQA上实现了99.2%的事实一致性率(FCR),在长文本生成基准BioGEN上取得了46.50的最高FActScore,显著优于现有方法。 Conclusion: DSCC-HS是一种无需修改目标模型、即插即用且高效的幻觉抑制框架,为提升大语言模型的事实性提供了原理性解决方案。 Abstract: Large Language Model (LLM) hallucination is a significant barrier to their reliable deployment. Current methods like Retrieval-Augmented Generation (RAG) are often reactive. We introduce **Dynamic Self-reinforcing Calibration for Hallucination Suppression (DSCC-HS)**, a novel, proactive framework that intervenes during autoregressive decoding. Inspired by dual-process cognitive theory, DSCC-HS uses a compact proxy model, trained in adversarial roles as a Factual Alignment Proxy (FAP) and a Hallucination Detection Proxy (HDP). During inference, these proxies dynamically steer a large target model by injecting a real-time steering vector, which is the difference between FAP and HDP logits, at each decoding step. This plug-and-play approach requires no modification to the target model. Our experiments on TruthfulQA and BioGEN show DSCC-HS achieves state-of-the-art performance. On TruthfulQA, it reached a 99.2% Factual Consistency Rate (FCR). On the long-form BioGEN benchmark, it attained the highest FActScore of 46.50. These results validate DSCC-HS as a principled and efficient solution for enhancing LLM factuality.

[12] Automated Triaging and Transfer Learning of Incident Learning Safety Reports Using Large Language Representational Models

Peter Beidler,Mark Nguyen,Kevin Lybarger,Ola Holmberg,Eric Ford,John Kang

Main category: cs.CL

TL;DR: 本研究开发了基于自然语言处理(NLP)的模型,用于在放射肿瘤学中自动识别高严重性事件报告。使用支持向量机(SVM)和BlueBERT模型,在单机构数据上表现良好,但跨机构泛化能力有限;通过迁移学习(BlueBERT_TRANSFER)显著提升了跨机构性能,接近人工判断水平。

Details Motivation: 手动审查医疗安全事件报告耗时且依赖专家知识,因此需要自动化工具来高效识别高严重性事件,提升医疗质量和安全。 Method: 采用两个文本数据集(本机构7,094条和IAEA SAFRON 571条)训练和评估NLP模型,包括SVM和BlueBERT。通过跨机构测试评估泛化能力,并设计BlueBERT_TRANSFER模型进行两阶段微调以提升性能,同时分析人工优化文本对模型的影响。 Result: 在本机构测试中,SVM和BlueBERT的AUROC分别为0.82和0.81;直接跨机构应用时性能下降(SVM 0.42,BlueBERT 0.56),而BlueBERT_TRANSFER将跨机构AUROC提升至0.78。在人工编辑的子集上,SVM和BlueBERT_TRANSFER的性能(AUROC 0.85和0.74)与人类相当(AUROC 0.81)。 Conclusion: 成功开发了可跨机构应用的NLP模型,能够有效识别放射肿瘤学中的高严重性事件报告,且在经过迁移学习后性能显著提升,接近人类专家水平。 Abstract: PURPOSE: Incident reports are an important tool for safety and quality improvement in healthcare, but manual review is time-consuming and requires subject matter expertise. Here we present a natural language processing (NLP) screening tool to detect high-severity incident reports in radiation oncology across two institutions. METHODS AND MATERIALS: We used two text datasets to train and evaluate our NLP models: 7,094 reports from our institution (Inst.), and 571 from IAEA SAFRON (SF), all of which had severity scores labeled by clinical content experts. We trained and evaluated two types of models: baseline support vector machines (SVM) and BlueBERT which is a large language model pretrained on PubMed abstracts and hospitalized patient data. We assessed for generalizability of our model in two ways. First, we evaluated models trained using Inst.-train on SF-test. Second, we trained a BlueBERT_TRANSFER model that was first fine-tuned on Inst.-train then on SF-train before testing on SF-test set. To further analyze model performance, we also examined a subset of 59 reports from our Inst. dataset, which were manually edited for clarity. RESULTS Classification performance on the Inst. test achieved AUROC 0.82 using SVM and 0.81 using BlueBERT. Without cross-institution transfer learning, performance on the SF test was limited to an AUROC of 0.42 using SVM and 0.56 using BlueBERT. BlueBERT_TRANSFER, which was fine-tuned on both datasets, improved the performance on SF test to AUROC 0.78. Performance of SVM, and BlueBERT_TRANSFER models on the manually curated Inst. reports (AUROC 0.85 and 0.74) was similar to human performance (AUROC 0.81). CONCLUSION: In summary, we successfully developed cross-institution NLP models on incident report text from radiation oncology centers. These models were able to detect high-severity reports similarly to humans on a curated dataset.

[13] DSPC: Dual-Stage Progressive Compression Framework for Efficient Long-Context Reasoning

Yaxin Gao,Yao Lu,Zongfei Zhang,Jiaqi Nie,Shanqing Yu,Qi Xuan

Main category: cs.CL

TL;DR: 提出了一种无需训练的双阶段渐进式提示压缩方法DSPC,有效减少大语言模型的计算开销,同时保持语义完整性,在多个任务上优于现有方法。

Details Motivation: 为了解决提示词膨胀导致的高计算成本问题,同时避免现有压缩方法需要额外训练带来的计算开销。 Method: 采用双阶段压缩:第一阶段基于TF-IDF进行句子级别的粗粒度过滤;第二阶段结合注意力贡献、跨模型损失差异和位置重要性进行细粒度的令牌修剪。 Result: 在LLaMA-3.1-8B-Instruct和GPT-3.5-Turbo上验证,使用仅1/3的令牌,在Longbench的FewShot任务上达到49.17的性能,超过LongLLMLingua基线7.76。 Conclusion: DSPC是一种高效、无需训练的提示压缩方法,能够在显著减少令牌使用的同时提升或保持模型性能。 Abstract: Large language models (LLMs) have achieved remarkable success in many natural language processing (NLP) tasks. To achieve more accurate output, the prompts used to drive LLMs have become increasingly longer, which incurs higher computational costs. To address this prompt inflation problem, prompt compression has been proposed. However, most existing methods require training a small auxiliary model for compression, incurring a significant amount of additional computation. To avoid this, we propose a two-stage, training-free approach, called Dual-Stage Progressive Compression (DSPC). In the coarse-grained stage, semantic-related sentence filtering removes sentences with low semantic value based on TF-IDF. In the fine-grained stage, token importance is assessed using attention contribution, cross-model loss difference, and positional importance, enabling the pruning of low-utility tokens while preserving semantics. We validate DSPC on LLaMA-3.1-8B-Instruct and GPT-3.5-Turbo under a constrained token budget and observe consistent improvements. For instance, in the FewShot task of the Longbench dataset, DSPC achieves a performance of 49.17 by using only 3x fewer tokens, outperforming the best state-of-the-art baseline LongLLMLingua by 7.76.

[14] Implementing a Logical Inference System for Japanese Comparatives

Yosuke Mikami,Daiki Matsuoka,Hitomi Yanaka

Main category: cs.CL

TL;DR: 本文提出了一种基于组合语义的逻辑推理系统ccg-jcomp,用于处理日语中的比较句,并在包含比较表达的日语自然语言推断数据集上验证了其有效性。

Details Motivation: 由于日语和英语在形态和语义上的差异,现有的针对英语比较句的逻辑推理系统难以直接应用于日语,因此需要构建专门针对日语比较句的推理系统。 Method: 基于组合语义构建了一个名为ccg-jcomp的逻辑推理系统,并在日语自然语言推断数据集上进行评估,同时与现有大语言模型的性能进行了对比。 Result: 实验结果表明,ccg-jcomp系统在处理日语比较句方面具有较高的准确性,且表现优于或可媲美现有的大语言模型。 Conclusion: ccg-jcomp为日语比较句的自然语言推断提供了一种有效且鲁棒的逻辑方法,展示了基于组合语义的方法在数值和逻辑表达式处理中的潜力。 Abstract: Natural Language Inference (NLI) involving comparatives is challenging because it requires understanding quantities and comparative relations expressed by sentences. While some approaches leverage Large Language Models (LLMs), we focus on logic-based approaches grounded in compositional semantics, which are promising for robust handling of numerical and logical expressions. Previous studies along these lines have proposed logical inference systems for English comparatives. However, it has been pointed out that there are several morphological and semantic differences between Japanese and English comparatives. These differences make it difficult to apply such systems directly to Japanese comparatives. To address this gap, this study proposes ccg-jcomp, a logical inference system for Japanese comparatives based on compositional semantics. We evaluate the proposed system on a Japanese NLI dataset containing comparative expressions. We demonstrate the effectiveness of our system by comparing its accuracy with that of existing LLMs.

[15] Exploring Data and Parameter Efficient Strategies for Arabic Dialect Identifications

Vani Kanjirangat,Ljiljana Dolamic,Fabio Rinaldi

Main category: cs.CL

TL;DR: 本文探讨了多种数据高效和参数高效的阿拉伯语方言识别(ADI)方法,包括软提示策略(如前缀调优、提示调优等)和LoRA重参数化,并评估了大语言模型在零样本和少样本设置下方言识别的能力。

Details Motivation: 由于资源限制和模型效率需求,探索在少量标注数据下仍能有效识别阿拉伯语方言的方法具有重要意义。 Method: 研究采用了软提示(如Prefix-tuning、Prompt-tuning、P-tuning及其变体)、LoRA以及硬提示的零样本/少样本推断方法,在阿拉伯语专用编码器模型和开源解码器模型(如Phi-3.5和SILMA)上进行实验。 Result: 实验结果表明,大语言模型在零样本或少样本设置下难以区分方言细微差异;软提示的编码器模型表现更优,而基于LoRA的微调模型表现最佳,甚至优于全量微调。 Conclusion: LoRA等参数高效微调方法在阿拉伯语方言识别任务中效果显著,是资源受限场景下的优选方案。 Abstract: This paper discusses our exploration of different data-efficient and parameter-efficient approaches to Arabic Dialect Identification (ADI). In particular, we investigate various soft-prompting strategies, including prefix-tuning, prompt-tuning, P-tuning, and P-tuning V2, as well as LoRA reparameterizations. For the data-efficient strategy, we analyze hard prompting with zero-shot and few-shot inferences to analyze the dialect identification capabilities of Large Language Models (LLMs). For the parameter-efficient PEFT approaches, we conducted our experiments using Arabic-specific encoder models on several major datasets. We also analyzed the n-shot inferences on open-source decoder-only models, a general multilingual model (Phi-3.5), and an Arabic-specific one(SILMA). We observed that the LLMs generally struggle to differentiate the dialectal nuances in the few-shot or zero-shot setups. The soft-prompted encoder variants perform better, while the LoRA-based fine-tuned models perform best, even surpassing full fine-tuning.

[16] Teaching According to Talents! Instruction Tuning LLMs with Competence-Aware Curriculum Learning

Yangning Li,Tingwei Lu,Yinghui Li,Yankai Chen,Wei-Chieh Huang,Wenhao Jiang,Hui Wang,Hai-Tao Zheng,Philip S. Yu

Main category: cs.CL

TL;DR: 提出了一种名为CAMPUS的多视角能力感知课程学习框架,用于高效指令调优,通过动态选择子课程、能力感知调整和多难度调度,克服了传统方法因静态难度指标导致的课程僵化问题。

Details Motivation: 现有的课程学习方法依赖静态启发式难度指标,无法适应模型训练过程中能力的变化,导致学习路径固定且可能次优,因此需要一种能够动态调整的课程学习机制。 Method: 提出了CAMPUS框架,包含三个核心组件:动态子课程选择、能力感知的课程进度调整以及基于多种难度的调度策略,从而实现对指令数据的高效组织与训练。 Result: 在多个实验中,CAMPUS相较于当前最先进的基线方法在指令调优效率和最终模型性能上均表现出显著优势。 Conclusion: CAMPUS通过引入动态和多维度的课程设计,有效提升了大语言模型在有限数据下的指令调优效率,缓解了课程僵化问题。 Abstract: Efficient instruction tuning aims to enhance the ultimate performance of large language models (LLMs) trained on a given instruction dataset. Curriculum learning as a typical data organization strategy has shown preliminary effectiveness in instruction tuning. However, current curriculum tuning methods suffer from the curriculum rigidity, since they rely solely on static heuristic difficulty metrics. These methods fail to adapt to the evolving capabilities of models during training, resulting in a fixed and potentially sub-optimal learning trajectory. To address the issue, Competence-Aware Multi-Perspective cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS offers several advantages: (1) Dynamic selection for sub-curriculum. (2) Competency-aware adjustment to the curriculum schedule. (3) Multiple difficulty-based scheduling. Extensive experiments prove the superior performance of CAMPUS, compared to other state-of-the-art baselines for efficient instruction tuning.

[17] Measuring Gender Bias in Job Title Matching for Grammatical Gender Languages

Laura García-Sardiña,Hermenegildo Fabregat,Daniel Deniz,Rabih Zbib

Main category: cs.CL

TL;DR: 本文研究了语法性别在职位标题中的显式分配如何影响自动职位排名系统的结果,提出使用控制性别的排名比较指标(如RBO)来评估性别偏见,并构建了四种语法性别语言的测试集,用于评估多语言模型的性别偏见,结果表明现有模型均存在不同程度的偏见。

Details Motivation: 探讨语法性别对自动职位排名系统的影响,揭示现有系统中可能存在的性别偏见问题。 Method: 提出使用RBO等控制性别的排名比较指标,构建包含男性和女性形式职位标题并标注性别和相关性的多语言测试集,评估现成多语言模型的性别偏见。 Result: 生成并共享了四种语法性别语言的测试集,实验结果显示所有被测多语言模型均表现出不同程度的性别偏见。 Conclusion: 语法性别的显式表达会影响职位排名系统的公平性,需引入专门指标和数据集来评估和缓解此类偏见。 Abstract: This work sets the ground for studying how explicit grammatical gender assignment in job titles can affect the results of automatic job ranking systems. We propose the usage of metrics for ranking comparison controlling for gender to evaluate gender bias in job title ranking systems, in particular RBO (Rank-Biased Overlap). We generate and share test sets for a job title matching task in four grammatical gender languages, including occupations in masculine and feminine form and annotated by gender and matching relevance. We use the new test sets and the proposed methodology to evaluate the gender bias of several out-of-the-box multilingual models to set as baselines, showing that all of them exhibit varying degrees of gender bias.

[18] Geometric Uncertainty for Detecting and Correcting Hallucinations in LLMs

Edward Phillips,Sean Wu,Soheila Molaei,Danielle Belgrave,Anshul Thakur,David Clifton

Main category: cs.CL

TL;DR: 提出了一种基于几何框架的黑盒方法,通过原型分析实现全局和局部不确定性估计,有效检测大语言模型的幻觉现象,尤其在医疗等高风险领域表现优越。

Details Motivation: 大语言模型虽表现优异但存在生成错误内容(幻觉)的问题,现有黑盒不确定性量化方法无法同时提供全局和局部不确定性估计。 Method: 基于黑盒访问下的响应采样,采用原型分析构建几何框架;提出几何体积(Geometric Volume)衡量响应嵌入凸包体积作为全局不确定性指标,以及几何怀疑度(Geometric Suspicion)对单个响应可靠性排序作为局部指标。 Result: 该方法在短答案问答数据集上表现与现有方法相当或更优,在医疗数据集上显著优于现有方法,且理论上证明了凸包体积与熵之间的联系。 Conclusion: 所提出的几何框架能够有效实现黑盒设置下的全局与局部不确定性估计,为大语言模型的幻觉检测提供了可解释、实用的新工具,尤其适用于高风险领域。 Abstract: Large language models demonstrate impressive results across diverse tasks but are still known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy for hallucination detection, but no existing black-box approach provides estimates for both global and local uncertainty. The former attributes uncertainty to a batch of responses, while the latter attributes uncertainty to individual responses. Current local methods typically rely on white-box access to internal model states, whilst black-box methods only provide global uncertainty estimates. We introduce a geometric framework to address this, based on archetypal analysis of batches of responses sampled with only black-box model access. At the global level, we propose Geometric Volume, which measures the convex hull volume of archetypes derived from response embeddings. At the local level, we propose Geometric Suspicion, which ranks responses by reliability and enables hallucination reduction through preferential response selection. Unlike prior dispersion methods which yield only a single global score, our approach provides semantic boundary points which have utility for attributing reliability to individual responses. Experiments show that our framework performs comparably to or better than prior methods on short form question-answering datasets, and achieves superior results on medical datasets where hallucinations carry particularly critical risks. We also provide theoretical justification by proving a link between convex hull volume and entropy.

[19] Findings of the Third Automatic Minuting (AutoMin) Challenge

Kartik Shinde,Laurent Besacier,Ondrej Bojar,Thibaut Thonet,Tirthankar Ghosal

Main category: cs.CL

TL;DR: 本文介绍了2025年AutoMin第三届自动会议纪要共享任务,包括结构化会议纪要生成和基于会议记录的问答任务,涵盖英语和捷克语,并评估了当前大语言模型的表现。

Details Motivation: 推动自动会议纪要生成和跨语言问答技术的发展,提供多语言、多领域的评测平台。 Method: 设置两个任务:会议纪要生成(英文和捷克文,项目会议与欧洲议会会议)和基于会议记录的问答(单语和跨语言设置),并引入多个基线系统进行评估。 Result: 2025年参与度较往年下降,仅一个团队参加纪要任务,两个团队参与问答任务;但通过引入多个基线系统,实现了对当前大语言模型的全面评估。 Conclusion: 尽管参赛队伍较少,但该任务为评估多语言、多领域下的大语言模型提供了有价值的基准数据和评测框架。 Abstract: This paper presents the third edition of AutoMin, a shared task on automatic meeting summarization into minutes. In 2025, AutoMin featured the main task of minuting, the creation of structured meeting minutes, as well as a new task: question answering (QA) based on meeting transcripts. The minuting task covered two languages, English and Czech, and two domains: project meetings and European Parliament sessions. The QA task focused solely on project meetings and was available in two settings: monolingual QA in English, and cross-lingual QA, where questions were asked and answered in Czech based on English meetings. Participation in 2025 was more limited compared to previous years, with only one team joining the minuting task and two teams participating in QA. However, as organizers, we included multiple baseline systems to enable a comprehensive evaluation of current (2025) large language models (LLMs) on both tasks.

[20] Large Language Models Discriminate Against Speakers of German Dialects

Minh Duc Bui,Carolin Holtermann,Valentin Hofmann,Anne Lauscher,Katharina von der Wense

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型(LLM)对德国方言使用者是否存在社会刻板印象,发现LLM在命名和使用上均表现出显著的方言偏见,且明确提及方言使用者会加剧这种偏见。

Details Motivation: 方言虽具文化价值,但使用者常面临负面刻板印象。本文旨在探究这些社会偏见是否被大型语言模型所复制。 Method: 基于社会语言学对方言认知的研究,设计两项任务:关联任务和决策任务,并构建包含七种德语方言及其标准德语对照的新评估语料库,以评估LLM的方言命名与使用偏见。 Result: 1) 所有测试的LLM在关联任务中均表现出显著的方言命名和使用偏见,体现为负面形容词关联;2) 模型在决策任务中也再现了这些偏见;3) 与以往研究不同,明确标注语言群体(如方言使用者)比隐性线索更强烈地放大偏见。 Conclusion: 大型语言模型不仅复制了社会对方言使用者的负面刻板印象,而且在显式提及方言身份时偏见更严重,提示需在模型训练和应用中警惕语言多样性相关的公平性问题。 Abstract: Dialects represent a significant component of human culture and are found across all regions of the world. In Germany, more than 40% of the population speaks a regional dialect (Adler and Hansen, 2022). However, despite cultural importance, individuals speaking dialects often face negative societal stereotypes. We examine whether such stereotypes are mirrored by large language models (LLMs). We draw on the sociolinguistic literature on dialect perception to analyze traits commonly associated with dialect speakers. Based on these traits, we assess the dialect naming bias and dialect usage bias expressed by LLMs in two tasks: an association task and a decision task. To assess a model's dialect usage bias, we construct a novel evaluation corpus that pairs sentences from seven regional German dialects (e.g., Alemannic and Bavarian) with their standard German counterparts. We find that: (1) in the association task, all evaluated LLMs exhibit significant dialect naming and dialect usage bias against German dialect speakers, reflected in negative adjective associations; (2) all models reproduce these dialect naming and dialect usage biases in their decision making; and (3) contrary to prior work showing minimal bias with explicit demographic mentions, we find that explicitly labeling linguistic demographics--German dialect speakers--amplifies bias more than implicit cues like dialect usage.

[21] Do LLMs Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLMs

Yang Liu,Chenhui Chu

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLM)在不同类型社会偏见场景中与人类价值观的对齐情况,发现模型参数规模并不一定降低误对齐率,且不同模型家族在判断一致性上存在差异。

Details Motivation: 探究大语言模型在涉及社会偏见的不同类型场景下与人类价值观的对齐程度是否存在差异,尤其是在负面与非负面问题中的表现。 Method: 通过分析四个模型家族的12个大语言模型在四个数据集上的表现,评估其在社会偏见场景中的误对齐率、攻击成功率及解释能力,并比较模型生成解释与人类理解的一致性。 Result: 大规模参数的模型不一定误对齐率更低;同一模型家族的LLM在判断上更一致;LLM对特定类型场景存在对齐偏好;各LLM对HVSB的理解无显著差异,但更偏好自身生成的解释;微调的小型语言模型能生成更易读但模型间一致性较低的解释。 Conclusion: 大语言模型在社会偏见问题上的对齐性受场景类型和模型家族影响,参数规模并非决定因素,且模型自我偏好可能影响解释可靠性。 Abstract: Large language models (LLMs) can lead to undesired consequences when misaligned with human values, especially in scenarios involving complex and sensitive social biases. Previous studies have revealed the misalignment of LLMs with human values using expert-designed or agent-based emulated bias scenarios. However, it remains unclear whether the alignment of LLMs with human values differs across different types of scenarios (e.g., scenarios containing negative vs. non-negative questions). In this study, we investigate the alignment of LLMs with human values regarding social biases (HVSB) in different types of bias scenarios. Through extensive analysis of 12 LLMs from four model families and four datasets, we demonstrate that LLMs with large model parameter scales do not necessarily have lower misalignment rate and attack success rate. Moreover, LLMs show a certain degree of alignment preference for specific types of scenarios and the LLMs from the same model family tend to have higher judgment consistency. In addition, we study the understanding capacity of LLMs with their explanations of HVSB. We find no significant differences in the understanding of HVSB across LLMs. We also find LLMs prefer their own generated explanations. Additionally, we endow smaller language models (LMs) with the ability to explain HVSB. The generation results show that the explanations generated by the fine-tuned smaller LMs are more readable, but have a relatively lower model agreeability.

[22] Combining Evidence and Reasoning for Biomedical Fact-Checking

Mariano Barone,Antonio Romano,Giuseppe Riccio,Marco Postiglione,Vincenzo Moscato

Main category: cs.CL

TL;DR: 提出了一种名为CER(结合证据与推理)的新框架,用于生物医学事实核查,整合了科学证据检索、大语言模型推理和监督式真实性预测,在多个数据集上实现了最先进的性能。

Details Motivation: 生物医学声明的验证因术语复杂、需要领域专业知识以及对科学证据的高度依赖而具有挑战性,现有方法难以有效应对医疗领域的错误信息。 Method: CER框架结合了高质量生物医学证据的检索技术、大语言模型的推理能力以及监督学习的真实性判断模块,通过生成与检索到的科学证据对齐的解释来减少幻觉风险。 Result: 在HealthFC、BioASQ-7b和SciFact等专家标注的数据集上评估显示,CER在准确性和跨数据集泛化方面均达到最先进水平。 Conclusion: CER通过融合证据检索与语言模型推理,为生物医学事实核查提供了一个可靠、可解释且可复现的解决方案,有助于应对医疗错误信息的传播。 Abstract: Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https: //github.com/PRAISELab-PicusLab/CER.

[23] Combating Biomedical Misinformation through Multi-modal Claim Detection and Evidence-based Verification

Mariano Barone,Antonio Romano,Giuseppe Riccio,Marco Postiglione,Vincenzo Moscato

Main category: cs.CL

TL;DR: 提出了一种名为CER(结合证据与推理)的新框架,用于生物医学事实核查,通过整合科学证据检索、大语言模型推理和监督式真实性预测,在多个专家标注数据集上实现了最先进的性能。

Details Motivation: 生物医学声明的验证因术语复杂、需要领域专业知识以及对科学证据的高度依赖而具有挑战性,现有方法难以有效应对医疗领域的错误信息问题。 Method: CER框架结合了高质量生物医学证据的检索、基于大语言模型的推理过程以及监督学习的真实度预测模块,利用生成模型的能力同时通过证据检索减少幻觉风险。 Result: 在HealthFC、BioASQ-7b和SciFact等数据集上的评估显示,CER在准确性和跨数据集泛化能力方面均达到最先进水平。 Conclusion: CER通过融合检索、推理和验证,为生物医学领域的自动化事实核查提供了一个可靠且可复现的解决方案。 Abstract: Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https://github.com/PRAISELab-PicusLab/CER

[24] Do Large Language Models Understand Word Senses?

Domenico Meconi,Simone Stirpe,Federico Martelli,Leonardo Lavalle,Roberto Navigli

Main category: cs.CL

TL;DR: 本文评估了指令调优的大型语言模型(LLM)在词义消歧(WSD)和生成任务中的表现,发现GPT-4o和DeepSeek-V3等领先模型在WSD任务上与专用系统性能相当,并在生成上下文词义解释方面达到高达98%的准确率。

Details Motivation: 尽管已有大量评估工作,但大型语言模型是否真正理解词义仍缺乏深入研究。本文旨在填补这一空白,系统评估LLM在词义消歧和上下文词义生成任务中的能力。 Method: 评估了指令调优的LLM在词义消歧(WSD)任务上的表现,并与专用WSD系统进行比较;同时测试两个顶尖开源和闭源LLM在定义生成、自由解释和例句生成三种生成任务中的词义理解能力。 Result: 在WSD任务中,GPT-4o和DeepSeek-V3的表现与最先进的专用系统相当,且在不同领域和难度下更具鲁棒性;在生成任务中,LLM对上下文词义的解释准确率高达98%,其中自由解释任务表现最佳。 Conclusion: 当前领先的LLM不仅在词义消歧任务上媲美专用系统,还能高度准确地生成上下文相关的词义解释,表明其具备较强的词义理解能力,尤其在与其生成特性相匹配的任务中表现突出。 Abstract: Understanding the meaning of words in context is a fundamental capability for Large Language Models (LLMs). Despite extensive evaluation efforts, the extent to which LLMs show evidence that they truly grasp word senses remains underexplored. In this paper, we address this gap by evaluating both i) the Word Sense Disambiguation (WSD) capabilities of instruction-tuned LLMs, comparing their performance to state-of-the-art systems specifically designed for the task, and ii) the ability of two top-performing open- and closed-source LLMs to understand word senses in three generative settings: definition generation, free-form explanation, and example generation. Notably, we find that, in the WSD task, leading models such as GPT-4o and DeepSeek-V3 achieve performance on par with specialized WSD systems, while also demonstrating greater robustness across domains and levels of difficulty. In the generation tasks, results reveal that LLMs can explain the meaning of words in context up to 98\% accuracy, with the highest performance observed in the free-form explanation task, which best aligns with their generative capabilities.

[25] Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

Dayeon Ki,Marine Carpuat,Paul McNamee,Daniel Khashabi,Eugene Yang,Dawn Lawrie,Kevin Duh

Main category: cs.CL

TL;DR: 该研究探讨了多语言检索增强生成(mRAG)系统中不同文档语言混合对生成和引用行为的影响,发现模型在英文查询下更倾向于引用英文来源,且这种偏好会影响对相关性的权衡。

Details Motivation: 探究多语言环境下文档语言混合是否会对生成和引用产生非预期影响,尤其是在保持文档相关性不变的情况下语言偏好如何作用。 Method: 提出一种控制性方法,利用模型内部机制测量语言偏好,固定文档相关性等变量,在八种语言和六种开源模型上进行实验。 Result: 发现模型在英文查询时更倾向于引用英文文档,该偏差在低资源语言和上下文中间位置的文档中更明显;有时模型会为了语言偏好而牺牲文档相关性。 Conclusion: 语言模型在多语言上下文中存在语言引用偏好,引用决策并不总是基于信息量,这对mRAG系统设计具有重要启示。 Abstract: Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.

[26] Long-context Reference-based MT Quality Estimation

Sami Ul Haq,Chinonso Cynthia Osuji,Sheila Castilho,Brian Davis

Main category: cs.CL

TL;DR: 本文提出了基于COMET框架的翻译质量评估系统,通过引入增强的长上下文数据来预测片段级错误跨度标注(ESA)分数。

Details Motivation: 为了提升自动翻译质量评估与人工判断的相关性,特别是在处理长上下文时的表现。 Method: 使用领域内人工标注句子构建长上下文训练数据,并通过加权平均其得分;整合多种人工评分数据集(MQM、SQM、DA),归一化评分尺度后训练多语言回归模型,利用源文本、假设和参考译文预测质量分数。 Result: 实验结果表明,相比仅使用短片段训练的模型,引入长上下文信息能够提高与人工判断的相关性。 Conclusion: 结合长上下文信息和多数据集融合的策略有效提升了自动翻译质量评估系统的性能。 Abstract: In this paper, we present our submission to the Tenth Conference on Machine Translation (WMT25) Shared Task on Automated Translation Quality Evaluation. Our systems are built upon the COMET framework and trained to predict segment-level Error Span Annotation (ESA) scores using augmented long-context data. To construct long-context training data, we concatenate in-domain, human-annotated sentences and compute a weighted average of their scores. We integrate multiple human judgment datasets (MQM, SQM, and DA) by normalising their scales and train multilingual regression models to predict quality scores from the source, hypothesis, and reference translations. Experimental results show that incorporating long-context information improves correlations with human judgments compared to models trained only on short segments.

[27] Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency

Colin Hong,Xu Guo,Anand Chaanan Singh,Esha Choukse,Dmitrii Ustiugov

Main category: cs.CL

TL;DR: 提出Slim-SC,一种基于推理链间思想级别相似性的逐步剪枝策略,可在保持或提升准确率的同时,显著减少Self-Consistency的推理延迟和KV缓存使用。

Details Motivation: Self-Consistency(SC)虽能提升LLM推理性能,但计算开销大,限制了其广泛应用;现有加速方法依赖模型置信度或启发式规则,缺乏充分实证支持。 Method: 通过理论与实证分析SC的低效性,提出Slim-SC,利用推理链在思想层面的相似性进行逐步剪枝,剔除冗余推理路径。 Result: 在三个STEM数据集和两种主流LLM架构上验证,Slim-SC最高减少45%推理延迟和26% KV缓存使用(R1-Distill),同时保持或提升准确性。 Conclusion: Slim-SC为Self-Consistency提供了一种简单而高效的测试时扩展替代方案,显著降低计算成本而不牺牲性能。 Abstract: Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited empirical support. For the first time, we theoretically and empirically analyze the inefficiencies of SC and reveal actionable opportunities for improvement. Building on these insights, we propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level. Experiments on three STEM reasoning datasets and two recent LLM architectures show that Slim-SC reduces inference latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill, while maintaining or improving accuracy, thus offering a simple yet efficient TTS alternative for SC.

[28] Early Stopping Chain-of-thoughts in Large Language Models

Minjia Mao,Bowen Yin,Yu Zhu,Xiao Fang

Main category: cs.CL

TL;DR: 提出ES-CoT方法,通过检测答案收敛性在推理时提前终止链式思维生成,显著减少推理开销,同时保持准确性。

Details Motivation: 长链式思维(CoT)虽然提升了复杂问题的解决能力,但带来了高昂的推理成本,因此需要一种高效且低损耗的缩短CoT的方法。 Method: 在每一步推理结束时让LLM输出当前最终答案,跟踪连续相同答案的运行长度,当运行长度突增并超过阈值时提前终止生成。 Result: 在五个推理数据集和三个LLM上的实验表明,ES-CoT平均减少约41%的推理token,同时保持与标准CoT相当的准确率,并兼容自洽性提示且对超参数鲁棒。 Conclusion: ES-CoT是一种实用且有效的高效推理方法,能够在几乎不损失性能的情况下显著降低推理成本。 Abstract: Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. In this study, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with minimal performance loss. At the end of each reasoning step, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. Once the run length exhibits a sharp increase and exceeds a minimum threshold, the generation is terminated. We provide both empirical and theoretical support for this heuristic: step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on five reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by about 41\% on average while maintaining accuracy comparable to standard CoT. Further, ES-CoT integrates seamlessly with self-consistency prompting and remains robust across hyperparameter choices, highlighting it as a practical and effective approach for efficient reasoning.

[29] Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

Hasan Abed Al Kader Hammoud,Mohammad Zbeeb,Bernard Ghanem

Main category: cs.CL

TL;DR: Hala是一个基于翻译和微调流程构建的阿拉伯语中心指令和翻译模型系列,通过压缩教师模型生成高质量双语监督数据,并利用轻量级语言模型翻译英文指令集以构建大规模阿拉伯语指令数据集,在多个参数规模上训练并采用slerp融合策略,在阿拉伯语基准测试中达到SOTA性能。

Details Motivation: 推动阿拉伯语自然语言处理的研究,解决现有模型在阿拉伯语指令跟随任务上的不足,提供高效、高质量的阿拉伯语为中心的模型。 Method: 采用“翻译-微调”流程:首先将强大的AR↔EN教师模型压缩至FP8以提升效率,然后使用该模型生成高保真双语监督数据;接着在数据上微调轻量级语言模型LFM2-1.2B,并用其将高质量英文指令集翻译为阿拉伯语,构建百万级指令数据集;最后在不同参数规模(350M至9B)上训练Hala模型,并应用slerp合并技术平衡阿拉伯语专业化与基础模型能力。 Result: Hala模型在阿拉伯语中心的基准测试中,在‘nano’(≤2B)和‘small’(7-9B)类别均取得最先进的结果,优于其基础模型,且具备更高的推理吞吐量。 Conclusion: Hala通过高效的翻译-微调流程和slerp融合策略,显著提升了阿拉伯语指令跟随和翻译任务的表现,为阿拉伯语NLP研究提供了可复现的模型、数据和方法。 Abstract: We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" ($\leq$2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.

[30] Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

Sami Ul Haq,Sheila Castilho,Yvette Graham

Main category: cs.CL

TL;DR: 该研究比较了基于文本和基于音频的机器翻译质量评估方法,发现音频评估在某些情况下能更有效地区分翻译系统性能,建议将语音评估纳入未来的机器翻译评估框架。

Details Motivation: 现有的机器翻译质量评估主要依赖文本,而许多实际应用涉及语音输出,因此需要更自然的基于语音的评估方式。 Method: 通过Amazon Mechanical Turk收集众包判断,对比10个WMT机器翻译系统的文本与音频评估结果,并进行统计显著性检验和自重复实验以验证音频评估的可靠性。 Result: 基于音频的评估结果与文本评估整体一致,但在某些系统间表现出显著差异,表明语音模态更具丰富性和自然性。 Conclusion: 应将基于语音的评估方法纳入未来的机器翻译评估体系,以提高评估的真实性和有效性。 Abstract: Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.

[31] You Are What You Train: Effects of Data Composition on Training Context-aware Machine Translation Models

Paweł Mąka,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis

Main category: cs.CL

TL;DR: 本文通过构建受控比例的上下文相关训练数据,验证了训练数据稀疏性是影响机器翻译中上下文利用的主要瓶颈,并提出两种训练策略,在单语和多语言设置下分别将ctxPro评估准确率提高了6和8个百分点。

Details Motivation: 由于标准训练数据中富含上下文的样本稀少,导致模型难以有效利用上下文进行翻译,本文旨在系统验证这一假设并探索改进方法。 Method: 在单语和多语言场景下,构建具有可控比例上下文相关样本的训练数据集,并评估不同训练策略对模型上下文利用能力的影响。 Result: 证实了训练数据稀疏性与模型性能之间存在强关联,且某一上下文现象的改进不会泛化到其他现象;观察到一定程度的跨语言迁移,但同语族内语言间的迁移并不显著增强;提出的两种训练策略分别在单语和多语言设置下提升了6和8个百分点的准确率。 Conclusion: 训练数据中上下文相关样本的稀疏性是制约上下文利用的关键因素,针对性的训练策略可有效提升模型在上下文翻译任务中的表现。 Abstract: Achieving human-level translations requires leveraging context to ensure coherence and handle complex phenomena like pronoun disambiguation. Sparsity of contextually rich examples in the standard training data has been hypothesized as the reason for the difficulty of context utilization. In this work, we systematically validate this claim in both single- and multilingual settings by constructing training datasets with a controlled proportions of contextually relevant examples. We demonstrate a strong association between training data sparsity and model performance confirming sparsity as a key bottleneck. Importantly, we reveal that improvements in one contextual phenomenon do no generalize to others. While we observe some cross-lingual transfer, it is not significantly higher between languages within the same sub-family. Finally, we propose and empirically evaluate two training strategies designed to leverage the available data. These strategies improve context utilization, resulting in accuracy gains of up to 6 and 8 percentage points on the ctxPro evaluation in single- and multilingual settings respectively.

[32] Enhancing Multi-Agent Debate System Performance via Confidence Expression

Zijie Lin,Bryan Hooi

Main category: cs.CL

TL;DR: 提出在多智能体辩论(MAD)系统中引入置信度表达以提升大语言模型的决策性能。

Details Motivation: 现有MAD系统中,LLM难以有效传达其知识或推理优势,缺乏置信度表达导致错误坚持或过早收敛。 Method: 设计ConfMAD框架,在整个辩论过程中集成置信度表达机制。 Result: 实验结果表明该方法有效,并分析了置信度对辩论动态的影响。 Conclusion: 引入置信度表达能提升MAD系统的辩论效果和整体性能,为构建信心感知的MAD系统提供启示。 Abstract: Generative Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Recent research has introduced Multi-Agent Debate (MAD) systems, which leverage multiple LLMs to simulate human debate and thereby improve task performance. However, while some LLMs may possess superior knowledge or reasoning capabilities for specific tasks, they often struggle to clearly communicate this advantage during debates, in part due to a lack of confidence expression. Moreover, inappropriate confidence expression can cause agents in MAD systems to either stubbornly maintain incorrect beliefs or converge prematurely on suboptimal answers, ultimately reducing debate effectiveness and overall system performance. To address these challenges, we propose incorporating confidence expression into MAD systems to allow LLMs to explicitly communicate their confidence levels. To validate this approach, we develop ConfMAD, a MAD framework that integrates confidence expression throughout the debate process. Experimental results demonstrate the effectiveness of our method, and we further analyze how confidence influences debate dynamics, offering insights into the design of confidence-aware MAD systems.

[33] SSL-SSAW: Self-Supervised Learning with Sigmoid Self-Attention Weighting for Question-Based Sign Language Translation

Zekang Liu,Wei Feng,Fanhua Shang,Lianyu Hu,Jichao Feng,Liqing Gao

Main category: cs.CL

TL;DR: 本文提出了基于问题的中文手语翻译(QB-SLT)新任务,通过引入对话上下文提升翻译性能,并提出一种结合对比学习与Sigmoid自注意力加权的跨模态自监督融合方法(SSL-SSAW),在新构建的数据集上实现了最先进性能。

Details Motivation: 手语翻译中缺乏有效的上下文建模,而对话信息(如提问)在真实交流中自然存在且易于标注,可作为有效上下文辅助翻译。 Method: 提出SSL-SSAW方法:利用对比学习对齐多模态特征,设计Sigmoid自注意力加权模块(SSAW)来自适应融合问题文本与手语序列特征,并通过自监督学习利用问题文本增强表示能力。 Result: 在CSL-Daily-QA和PHOENIX-2014T-QA两个新构建的数据集上验证了方法的有效性,SSL-SSAW取得了SOTA性能;可视化结果表明对话信息有助于提升翻译质量;使用易获取的问题辅助可达到甚至超过传统依赖词表注释(gloss)的效果。 Conclusion: 引入问题作为上下文是一种高效的手语翻译范式,所提方法能有效融合多模态信息并提升翻译质量,降低了对手语转录标注的依赖,推动了实际场景中的手语翻译应用。 Abstract: Sign Language Translation (SLT) bridges the communication gap between deaf people and hearing people, where dialogue provides crucial contextual cues to aid in translation. Building on this foundational concept, this paper proposes Question-based Sign Language Translation (QB-SLT), a novel task that explores the efficient integration of dialogue. Unlike gloss (sign language transcription) annotations, dialogue naturally occurs in communication and is easier to annotate. The key challenge lies in aligning multimodality features while leveraging the context of the question to improve translation. To address this issue, we propose a cross-modality Self-supervised Learning with Sigmoid Self-attention Weighting (SSL-SSAW) fusion method for sign language translation. Specifically, we employ contrastive learning to align multimodality features in QB-SLT, then introduce a Sigmoid Self-attention Weighting (SSAW) module for adaptive feature extraction from question and sign language sequences. Additionally, we leverage available question text through self-supervised learning to enhance representation and translation capabilities. We evaluated our approach on newly constructed CSL-Daily-QA and PHOENIX-2014T-QA datasets, where SSL-SSAW achieved SOTA performance. Notably, easily accessible question assistance can achieve or even surpass the performance of gloss assistance. Furthermore, visualization results demonstrate the effectiveness of incorporating dialogue in improving translation quality.

[34] Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST

Monica Sekoyan,Nithin Rao Koluguri,Nune Tadevosyan,Piotr Zelasko,Travis Bartley,Nick Karpov,Jagadeesh Balam,Boris Ginsburg

Main category: cs.CL

TL;DR: 本文介绍了Canary-1B-v2,一种快速、鲁棒的多语言自动语音识别(ASR)和语音到文本翻译(AST)模型,基于FastConformer编码器和Transformer解码器,支持25种主要欧洲语言,在1.7M小时数据上训练,性能优于Whisper-large-v3且速度快10倍,并发布了参数更少的Parakeet-TDT-0.6B-v3模型。

Details Motivation: 开发一个高效、快速且多语言支持广泛的ASR与AST模型,以克服现有模型在速度、幻觉和多语言性能上的局限。 Method: 采用FastConformer编码器与Transformer解码器架构,结合两阶段预训练与微调,使用动态数据平衡策略,并引入非语音音频减少幻觉;时间戳生成采用NeMo强制对齐器(NFA)与辅助CTC模型。 Result: Canary-1B-v2在英语ASR上优于Whisper-large-v3且速度快10倍,在多语言ASR和AST任务中表现与Seamless-M4T-v2-large和基于LLM的系统相当;nGPT编码器在大数据下表现良好,而FastConformer在微调后更优。 Conclusion: Canary-1B-v2是一个高效、高性能的多语言ASR/AST模型,适合实际部署,同时发布的Parakeet-TDT-0.6B-v3进一步降低了模型规模与计算需求。 Abstract: This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic data balancing, as well as experiments with an nGPT encoder. Results show nGPT scales well with massive data, while FastConformer excels after fine-tuning. For timestamps, Canary-1B-v2 uses the NeMo Forced Aligner (NFA) with an auxiliary CTC model, providing reliable segment-level timestamps for ASR and AST. Evaluations show Canary-1B-v2 outperforms Whisper-large-v3 on English ASR while being 10x faster, and delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large and LLM-based systems. We also release Parakeet-TDT-0.6B-v3, a successor to v2, offering multilingual ASR across the same 25 languages with just 600M parameters.

[35] CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

Brian Yan,Injy Hamed,Shuichiro Shimizu,Vasista Lodagala,William Chen,Olga Iakovenko,Bashar Talafha,Amir Hussein,Alexander Polok,Kalvin Chang,Dominik Klement,Sara Althubaiti,Puyuan Peng,Matthew Wiesner,Thamar Solorio,Ahmed Ali,Sanjeev Khudanpur,Shinji Watanabe,Chih-Chen Chen,Zhen Wu,Karim Benharrak,Anuj Diwan,Samuele Cornell,Eunjung Yeo,Kwanghee Choi,Carlos Carvalho,Karen Rosero

Main category: cs.CL

TL;DR: 本文介绍了CS-FLEURS,一个用于开发和评估跨语言代码混合语音识别与翻译系统的新数据集,涵盖52种语言中的113种代码混合语言对,并提供多个测试集和训练集。

Details Motivation: 推动低资源语言下的代码混合语音识别与翻译研究,弥补现有数据集在语言多样性和实际应用场景上的不足。 Method: 构建包含四种测试集和一个训练集的数据集,采用合成语音、生成式TTS和拼接式TTS技术生成多语言代码混合语音数据。 Result: 发布了覆盖113种语言对的CS-FLEURS数据集,包含四个测试集和一个128小时的生成式TTS训练集,支持更广泛的代码混合语音研究。 Conclusion: CS-FLEURS有助于拓展未来代码混合语音技术的研究范围,特别是在低资源语言环境下的应用。 Abstract: We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.

[36] AssoCiAm: A Benchmark for Evaluating Association Thinking while Circumventing Ambiguity

Yifan Liu,Wenkuan Zhao,Shanshan Zhong,Jinghui Qin,Mingfu Liang,Zhongzhan Huang,Wushao Wen

Main category: cs.CL

TL;DR: 本文提出了一种新的基准AssoCiAm,用于评估多模态大语言模型(MLLMs)的联想能力,通过分解内部和外部模糊性并采用混合计算方法,解决了现有评估框架中因联想任务固有模糊性导致的不可靠问题。实验表明认知与联想能力之间存在强正相关,并验证了该方法在提高评估准确性和可靠性方面的有效性。

Details Motivation: 现有的评估MLLMs联想能力的框架往往忽略了联想任务中存在的固有模糊性,这种模糊性源于联想的发散性,影响了评估的可靠性。因此,需要一种能够克服这一问题的新方法来更准确地衡量模型的创造力基础——联想能力。 Method: 将模糊性分解为内部模糊性和外部模糊性两种类型,提出了AssoCiAm基准,采用混合计算方法来规避模糊性对评估的影响,并在多个MLLM上进行了广泛实验,分析了认知与联想能力之间的关系以及模糊性对模型行为的影响。 Result: 实验发现MLLM的认知能力与联想能力之间存在显著的正相关关系;同时,评估过程中的模糊性会使模型的行为更加随机化;所提出的AssoCiAm方法能有效提升评估的准确性与可靠性。 Conclusion: AssoCiAm提供了一个更可靠的方法来评估MLLM的联想能力,通过处理模糊性问题改进了现有评估框架的不足,有助于推动具备创造力的AGI系统的发展。 Abstract: Recent advancements in multimodal large language models (MLLMs) have garnered significant attention, offering a promising pathway toward artificial general intelligence (AGI). Among the essential capabilities required for AGI, creativity has emerged as a critical trait for MLLMs, with association serving as its foundation. Association reflects a model' s ability to think creatively, making it vital to evaluate and understand. While several frameworks have been proposed to assess associative ability, they often overlook the inherent ambiguity in association tasks, which arises from the divergent nature of associations and undermines the reliability of evaluations. To address this issue, we decompose ambiguity into two types-internal ambiguity and external ambiguity-and introduce AssoCiAm, a benchmark designed to evaluate associative ability while circumventing the ambiguity through a hybrid computational method. We then conduct extensive experiments on MLLMs, revealing a strong positive correlation between cognition and association. Additionally, we observe that the presence of ambiguity in the evaluation process causes MLLMs' behavior to become more random-like. Finally, we validate the effectiveness of our method in ensuring more accurate and reliable evaluations. See Project Page for the data and codes.

[37] Synthesizing Behaviorally-Grounded Reasoning Chains: A Data-Generation Framework for Personal Finance LLMs

Akhil Theerthala

Main category: cs.CL

TL;DR: 提出了一种结合金融背景与行为金融学的新框架,用于构建端到端个性化财务顾问的监督数据,在Qwen-3-8B模型上实现媲美更大模型的性能,成本降低80%。

Details Motivation: 现有个性化财务建议系统依赖大型代理管道,维护成本高且收益不足,需更高效、可复现的方法提升实际财务建议质量。 Method: 构建融合金融情境与行为金融学的框架,生成19k样本的推理数据集,并对Qwen-3-8B模型进行微调,通过测试集和盲评LLM评审团评估性能。 Result: 该8B模型在事实准确性、流畅性和个性化指标上表现媲美14-32B参数的基线模型,且成本降低80%。 Conclusion: 通过精心的数据构建和行为因素整合,中小规模模型可在个性化财务建议任务中达到大模型水平的表现,显著降低成本,具备更高实用性和可扩展性。 Abstract: Personalized financial advice requires consideration of user goals, constraints, risk tolerance, and jurisdiction. Prior LLM work has focused on support systems for investors and financial planners. Simultaneously, numerous recent studies examine broader personal finance tasks, including budgeting, debt management, retirement, and estate planning, through agentic pipelines that incur high maintenance costs, yielding less than 25% of their expected financial returns. In this study, we introduce a novel and reproducible framework that integrates relevant financial context with behavioral finance studies to construct supervision data for end-to-end advisors. Using this framework, we create a 19k sample reasoning dataset and conduct a comprehensive fine-tuning of the Qwen-3-8B model on the dataset. Through a held-out test split and a blind LLM-jury study, we demonstrate that through careful data curation and behavioral integration, our 8B model achieves performance comparable to significantly larger baselines (14-32B parameters) across factual accuracy, fluency, and personalization metrics while incurring 80% lower costs than the larger counterparts.

[38] Framing Migration: A Computational Analysis of UK Parliamentary Discourse

Vahid Ghafouri,Robert McNeil,Teodor Yankov,Madeleine Sumption,Luc Rocher,Scott A. Hale,Adam Mahdi

Main category: cs.CL

TL;DR: 本研究利用开源大语言模型对英国议会75年来的移民相关言论进行大规模计算分析,并与美国国会言论进行对比,揭示了两国在移民话语上的趋势差异。

Details Motivation: 理解长期政治话语中对移民的态度演变,以及不同国家间政治极化程度的差异。 Method: 使用开放权重的大语言模型(LLM)对英美议会辩论文本进行立场标注和情感趋势追踪,并在英国语境中构建半自动化框架提取细粒度叙事框架。 Result: 发现美国移民话语日益两极分化,而英国跨党派态度相对一致,但工党和保守党之间的意识形态差距持续存在并在2025年达到最负面水平;英国的叙事从社会融合转向边境管控和非法移民等安全化议题,同时讨论重点从国内法转向国际法与人权。 Conclusion: 大语言模型能够有效支持政治与历史语境下的可扩展、细粒度话语分析,为理解长期政策话语演变提供新方法。 Abstract: We present a large-scale computational analysis of migration-related discourse in UK parliamentary debates spanning over 75 years and compare it with US congressional discourse. Using open-weight LLMs, we annotate each statement with high-level stances toward migrants and track the net tone toward migrants across time and political parties. For the UK, we extend this with a semi-automated framework for extracting fine-grained narrative frames to capture nuances of migration discourse. Our findings show that, while US discourse has grown increasingly polarised, UK parliamentary attitudes remain relatively aligned across parties, with a persistent ideological gap between Labour and the Conservatives, reaching its most negative level in 2025. The analysis of narrative frames in the UK parliamentary statements reveals a shift toward securitised narratives such as border control and illegal immigration, while longer-term integration-oriented frames such as social integration have declined. Moreover, discussions of national law about immigration have been replaced over time by international law and human rights, revealing nuances in discourse trends. Taken together broadly, our findings demonstrate how LLMs can support scalable, fine-grained discourse analysis in political and historical contexts.

[39] Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Alejandro Hernández-Cano,Alexander Hägele,Allen Hao Huang,Angelika Romanou,Antoni-Joan Solergibert,Barna Pasztor,Bettina Messmer,Dhia Garbaya,Eduard Frank Ďurech,Ido Hakimi,Juan García Giraldo,Mete Ismayilzada,Negar Foroutan,Skander Moalla,Tiancheng Chen,Vinko Sabolčec,Yixuan Xu,Michael Aerni,Badr AlKhamissi,Ines Altemir Marinas,Mohammad Hossein Amani,Matin Ansaripour,Ilia Badanin,Harold Benoit,Emanuela Boros,Nicholas Browning,Fabian Bösch,Maximilian Böther,Niklas Canova,Camille Challier,Clement Charmillot,Jonathan Coles,Jan Deriu,Arnout Devos,Lukas Drescher,Daniil Dzenhaliou,Maud Ehrmann,Dongyang Fan,Simin Fan,Silin Gao,Miguel Gila,María Grandury,Diba Hashemi,Alexander Hoyle,Jiaming Jiang,Mark Klein,Andrei Kucharavy,Anastasiia Kucherenko,Frederike Lübeck,Roman Machacek,Theofilos Manitaras,Andreas Marfurt,Kyle Matoba,Simon Matrenok,Henrique Mendoncça,Fawzi Roberto Mohamed,Syrielle Montariol,Luca Mouchel,Sven Najem-Meyer,Jingwei Ni,Gennaro Oliva,Matteo Pagliardini,Elia Palme,Andrei Panferov,Léo Paoletti,Marco Passerini,Ivan Pavlov,Auguste Poiroux,Kaustubh Ponkshe,Nathan Ranchin,Javi Rando,Mathieu Sauser,Jakhongir Saydaliev,Muhammad Ali Sayfiddinov,Marian Schneider,Stefano Schuppli,Marco Scialanga,Andrei Semenov,Kumar Shridhar,Raghav Singhal,Anna Sotnikova,Alexander Sternfeld,Ayush Kumar Tarun,Paul Teiletche,Jannis Vamvas,Xiaozhe Yao,Hao Zhao Alexander Ilic,Ana Klimovic,Andreas Krause,Caglar Gulcehre,David Rosenthal,Elliott Ash,Florian Tramèr,Joost VandeVondele,Livio Veraldi,Martin Rajman,Thomas Schulthess,Torsten Hoefler,Antoine Bosselut,Martin Jaggi,Imanol Schlag

Main category: cs.CL

TL;DR: Apertus是一个完全开源的大语言模型套件,专注于数据合规和多语言支持,使用符合robots.txt规范的公开数据进行预训练,并采用Goldfish目标减少记忆化,同时在8B和70B规模上实现多语言基准的先进性能。

Details Motivation: 解决当前开源模型生态中数据合规性不足和多语言表示有限的问题,确保内容所有者权利被尊重,并提升非英语语言的支持。 Method: 仅使用公开可用的数据进行预训练,回溯遵守robots.txt排除规则,过滤非许可、有毒和敏感信息,并采用Goldfish目标抑制对训练数据的逐字记忆。模型在15T token、超过1800种语言的数据上训练,其中约40%为非英语内容。 Result: Apertus模型在多语言基准测试中达到或超越同类开源模型的表现,尤其在非英语任务上表现优异;8B和70B版本均实现了与现有最先进全开源模型相媲美的性能。 Conclusion: Apertus通过透明、合规和可复现的方式推动了开源大模型的发展,为多语言支持和负责任的数据使用设立了新标准。 Abstract: We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

cs.CV [Back]

[40] Proximity-Based Evidence Retrieval for Uncertainty-Aware Neural Networks

Hassan Gharoun,Mohammad Sadegh Khorshidi,Kasra Ranjbarigderi,Fang Chen,Amir H. Gandomi

Main category: cs.CV

TL;DR: 提出一种基于证据检索的不确定性感知决策机制,通过实例自适应的阈值替代全局固定阈值,利用近邻示例的预测分布融合生成每实例的信念阈值,提升决策可靠性与可解释性。

Details Motivation: 传统基于预测熵的固定阈值在不确定性感知决策中易导致高置信错误,缺乏透明性和可审计性,需更可靠且可解释的替代方案。 Method: 对每个测试实例,在嵌入空间中检索邻近示例,使用Dempster-Shafer理论融合其预测分布,形成实例依赖的融合信念作为动态阈值进行决策。 Result: 在CIFAR-10/100上使用BiT和ViT骨干网络实验表明,相比预测熵阈值法,在更少高置信错误和可持续审查负荷下实现了更高或相当的性能,且少量证据即可取得显著增益。 Conclusion: 证据条件化的标签机制为操作性不确定性感知决策提供了比固定熵阈值更可靠、可解释的替代方法。 Abstract: This work proposes an evidence-retrieval mechanism for uncertainty-aware decision-making that replaces a single global cutoff with an evidence-conditioned, instance-adaptive criterion. For each test instance, proximal exemplars are retrieved in an embedding space; their predictive distributions are fused via Dempster-Shafer theory. The resulting fused belief acts as a per-instance thresholding mechanism. Because the supporting evidences are explicit, decisions are transparent and auditable. Experiments on CIFAR-10/100 with BiT and ViT backbones show higher or comparable uncertainty-aware performance with materially fewer confidently incorrect outcomes and a sustainable review load compared with applying threshold on prediction entropy. Notably, only a few evidences are sufficient to realize these gains; increasing the evidence set yields only modest changes. These results indicate that evidence-conditioned tagging provides a more reliable and interpretable alternative to fixed prediction entropy thresholds for operational uncertainty-aware decision-making.

[41] Hybrid Quantum-Classical Model for Image Classification

Muhammad Adnan Shahzad

Main category: cs.CV

TL;DR: 本研究系统比较了混合量子-经典神经网络与纯经典模型在MNIST、CIFAR100和STL10数据集上的性能,发现混合模型在准确性、训练效率和参数使用方面均优于经典CNN,尤其在复杂数据集上优势更显著。

Details Motivation: 探索混合量子-经典神经网络是否能在性能、效率和鲁棒性方面超越传统经典模型,特别是在图像分类任务中利用量子计算潜力。 Method: 将参数化量子电路嵌入经典深度学习架构中,构建混合模型,并与标准卷积神经网络(CNN)在多个基准数据集上进行对比实验,评估指标包括准确率、训练时间、资源消耗和对抗鲁棒性。 Result: 混合模型在验证准确率上显著优于经典模型(如CIFAR100提升+9.44%,STL10提升+10.29%),训练速度快5–12倍,参数减少6–32%,内存和CPU占用更低;在简单数据集上对抗鲁棒性更强,但在复杂数据集上两者均脆弱。 Conclusion: 混合量子-经典架构在准确性、训练效率和参数可扩展性方面具有明显优势,尤其适用于复杂视觉任务,显示出量子增强深度学习的潜力。 Abstract: This study presents a systematic comparison between hybrid quantum-classical neural networks and purely classical models across three benchmark datasets (MNIST, CIFAR100, and STL10) to evaluate their performance, efficiency, and robustness. The hybrid models integrate parameterized quantum circuits with classical deep learning architectures, while the classical counterparts use conventional convolutional neural networks (CNNs). Experiments were conducted over 50 training epochs for each dataset, with evaluations on validation accuracy, test accuracy, training time, computational resource usage, and adversarial robustness (tested with $\epsilon=0.1$ perturbations).Key findings demonstrate that hybrid models consistently outperform classical models in final accuracy, achieving {99.38\% (MNIST), 41.69\% (CIFAR100), and 74.05\% (STL10) validation accuracy, compared to classical benchmarks of 98.21\%, 32.25\%, and 63.76\%, respectively. Notably, the hybrid advantage scales with dataset complexity, showing the most significant gains on CIFAR100 (+9.44\%) and STL10 (+10.29\%). Hybrid models also train 5--12$\times$ faster (e.g., 21.23s vs. 108.44s per epoch on MNIST) and use 6--32\% fewer parameters} while maintaining superior generalization to unseen test data.Adversarial robustness tests reveal that hybrid models are significantly more resilient on simpler datasets (e.g., 45.27\% robust accuracy on MNIST vs. 10.80\% for classical) but show comparable fragility on complex datasets like CIFAR100 ($\sim$1\% robustness for both). Resource efficiency analyses indicate that hybrid models consume less memory (4--5GB vs. 5--6GB for classical) and lower CPU utilization (9.5\% vs. 23.2\% on average).These results suggest that hybrid quantum-classical architectures offer compelling advantages in accuracy, training efficiency, and parameter scalability, particularly for complex vision tasks.

[42] Research on Expressway Congestion Warning Technology Based on YOLOv11-DIoU and GRU-Attention

Tong Yulin,Liang Xuechen

Main category: cs.CV

TL;DR: 本文提出了一种集成技术框架,用于解决高速公路交通拥堵感知与预测中的关键问题。通过改进YOLOv11和DeepSort算法提升车辆检测精度,并采用GRU-Attention模型实现高准确率的拥堵预警。

Details Motivation: 现有交通拥堵检测与预测系统在遮挡情况下车辆感知精度低,且难以捕捉长时间序列依赖关系,导致预测效果不佳。 Method: 在交通流感知方面,将YOLOv11升级为YOLOv11-DIoU,并优化DeepSort算法;在拥堵预警方面,构建GRU-Attention模型,结合流量、密度和速度数据进行训练。 Result: YOLOv11-DIoU达到95.7% mAP,DeepSort的MOTA为93.8%,GRU-Attention模型测试准确率达99.7%,10分钟提前预警误差≤1分钟,独立视频验证预警准确率为95%。 Conclusion: 该框架有效提升了交通感知精度与拥堵预测性能,可为高速公路拥堵控制提供量化支持,具有良好的智能交通应用前景。 Abstract: Expressway traffic congestion severely reduces travel efficiency and hinders regional connectivity. Existing "detection-prediction" systems have critical flaws: low vehicle perception accuracy under occlusion and loss of long-sequence dependencies in congestion forecasting. This study proposes an integrated technical framework to resolve these issues.For traffic flow perception, two baseline algorithms were optimized. Traditional YOLOv11 was upgraded to YOLOv11-DIoU by replacing GIoU Loss with DIoU Loss, and DeepSort was improved by fusing Mahalanobis (motion) and cosine (appearance) distances. Experiments on Chang-Shen Expressway videos showed YOLOv11-DIoU achieved 95.7\% mAP (6.5 percentage points higher than baseline) with 5.3\% occlusion miss rate. DeepSort reached 93.8\% MOTA (11.3 percentage points higher than SORT) with only 4 ID switches. Using the Greenberg model (for 10-15 vehicles/km high-density scenarios), speed and density showed a strong negative correlation (r=-0.97), conforming to traffic flow theory. For congestion warning, a GRU-Attention model was built to capture congestion precursors. Trained 300 epochs with flow, density, and speed, it achieved 99.7\% test accuracy (7-9 percentage points higher than traditional GRU). In 10-minute advance warnings for 30-minute congestion, time error was $\leq$ 1 minute. Validation with an independent video showed 95\% warning accuracy, over 90\% spatial overlap of congestion points, and stable performance in high-flow ($>$5 vehicles/second) scenarios.This framework provides quantitative support for expressway congestion control, with promising intelligent transportation applications.

[43] Parking Space Ground Truth Test Automation by Artificial Intelligence Using Convolutional Neural Networks

Tony Rohe,Martin Margreiter,Markus Moertl

Main category: cs.CV

TL;DR: 本研究利用卷积神经网络实现基于车载超声波传感器数据的实时路边停车服务测试过程自动化,显著减少了人工参与,人力时间减少达99.58%。

Details Motivation: 为提升基于众包车辆数据的实时停车信息服务质量,需优化现有的人工地面实况测试流程,降低人力成本并提高效率。 Method: 采用机器学习方法,特别是卷积神经网络,对众包采集的超声波传感器数据进行图像模式识别,以自动化方式完成数据分析与数据库增强,替代传统人工分析环节。 Result: 实现了高度自动化的测试分析流程,评估指标显示人力投入时间减少了99.58%,大幅提升了测试效率和系统可扩展性。 Conclusion: 该自动化方法显著提高了停车服务测试的效率和准确性,具备良好的应用前景,未来可进一步拓展至其他智能交通系统的数据分析任务。 Abstract: This research is part of a study of a real-time, cloud-based on-street parking service using crowd-sourced in-vehicle fleet data. The service provides real-time information about available parking spots by classifying crowd-sourced detections observed via ultrasonic sensors. The goal of this research is to optimize the current parking service quality by analyzing the automation of the existing test process for ground truth tests. Therefore, methods from the field of machine learning, especially image pattern recognition, are applied to enrich the database and substitute human engineering work in major areas of the analysis process. After an introduction into the related areas of machine learning, this paper explains the methods and implementations made to achieve a high level of automation, applying convolutional neural networks. Finally, predefined metrics present the performance level achieved, showing a time reduction of human resources up to 99.58 %. The overall improvements are discussed, summarized, and followed by an outlook for future development and potential application of the analysis automation tool.

[44] An Empirical Analysis of VLM-based OOD Detection: Mechanisms, Advantages, and Sensitivity

Yuxiao Lee,Xiaofeng Cao,Wei Ye,Jiangchao Yao,Jingkuan Song,Heng Tao Shen

Main category: cs.CV

TL;DR: 本文系统分析了视觉-语言模型(VLM)在零样本异常检测中的机制、优势与敏感性,揭示了其在语义新颖性上的优势,但也发现其对提示词表述高度敏感。

Details Motivation: 尽管VLM在零样本OOD检测中表现良好,但其有效原因、相对于单模态方法的优势以及行为鲁棒性仍缺乏深入理解。 Method: 通过ID和OOD提示语,实证分析VLM嵌入空间中的操作特性,量化其相对于单模态方法的性能,并评估其对图像噪声和提示变化的鲁棒性。 Result: 发现了VLM利用语义新颖性的关键机制,证明其优于单模态方法,并揭示了其对提示表述的高度敏感性。 Conclusion: VLM在OOD检测中具有显著优势,但其对提示的敏感性是潜在弱点,需在未来设计中加以考虑以提升鲁棒性和可靠性。 Abstract: Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot out-of-distribution (OOD) detection capabilities, vital for reliable AI systems. Despite this promising capability, a comprehensive understanding of (1) why they work so effectively, (2) what advantages do they have over single-modal methods, and (3) how is their behavioral robustness -- remains notably incomplete within the research community. This paper presents a systematic empirical analysis of VLM-based OOD detection using in-distribution (ID) and OOD prompts. (1) Mechanisms: We systematically characterize and formalize key operational properties within the VLM embedding space that facilitate zero-shot OOD detection. (2) Advantages: We empirically quantify the superiority of these models over established single-modal approaches, attributing this distinct advantage to the VLM's capacity to leverage rich semantic novelty. (3) Sensitivity: We uncovers a significant and previously under-explored asymmetry in their robustness profile: while exhibiting resilience to common image noise, these VLM-based methods are highly sensitive to prompt phrasing. Our findings contribute a more structured understanding of the strengths and critical vulnerabilities inherent in VLM-based OOD detection, offering crucial, empirically-grounded guidance for developing more robust and reliable future designs.

[45] Curvature as a tool for evaluating dimensionality reduction and estimating intrinsic dimension

Charlotte Beylier,Parvaneh Joharinad,Jürgen Jost,Nahid Torbati

Main category: cs.CV

TL;DR: 提出一种基于曲率的离散度量空间几何轮廓构建方法,并用于评估数据表示的有效性和估计数据集的内在维度。

Details Motivation: 利用新发展的截面曲率概念,捕捉离散度量空间中三元点与其他点之间的度量关系,从而更好地理解数据结构。 Method: 基于新的截面曲率概念构建曲率轮廓,并据此提出定量指标评估降维等数据表示方法的效果,同时估计数据集的内在维度。 Result: 实验表明该曲率分析方法能有效估计数据集的内在维度,并可用于评估降维技术的有效性及探索实证网络的大尺度几何结构。 Conclusion: 所提出的曲率基方法为分析离散度量空间提供了新工具,在评估数据表示和理解网络几何方面具有应用潜力。 Abstract: Utilizing recently developed abstract notions of sectional curvature, we introduce a method for constructing a curvature-based geometric profile of discrete metric spaces. The curvature concept that we use here captures the metric relations between triples of points and other points. More significantly, based on this curvature profile, we introduce a quantitative measure to evaluate the effectiveness of data representations, such as those produced by dimensionality reduction techniques. Furthermore, Our experiments demonstrate that this curvature-based analysis can be employed to estimate the intrinsic dimensionality of datasets. We use this to explore the large-scale geometry of empirical networks and to evaluate the effectiveness of dimensionality reduction techniques.

[46] Landcover classification and change detection using remote sensing and machine learning: a case study of Western Fiji

Yadvendra Gurjar,Ruoni Wan,Ehsan Farahbakhsh,Rohitash Chandra

Main category: cs.CV

TL;DR: 本研究利用机器学习与遥感技术,分析2013至2024年斐济纳迪的土地利用与土地覆盖变化,旨在为城市化监测提供技术支持。

Details Motivation: 斐济作为发展中国家正经历快速城市化,需要有效技术手段监测土地变化以支持规划与管理。 Method: 使用Landsat-8卫星影像,结合Google Earth Engine平台,采用k均值聚类生成土地覆盖图,并利用卷积神经网络进行监督分类与变化检测。 Result: 成功构建了纳迪地区的土地覆盖变化可视化图谱,清晰展示了城市扩张动态。 Conclusion: 该框架可为发展中国家城市化监测提供有效的技术支撑,具有推广潜力。 Abstract: As a developing country, Fiji is facing rapid urbanisation, which is visible in the massive development projects that include housing, roads, and civil works. In this study, we present machine learning and remote sensing frameworks to compare land use and land cover change from 2013 to 2024 in Nadi, Fiji. The ultimate goal of this study is to provide technical support in land cover/land use modelling and change detection. We used Landsat-8 satellite image for the study region and created our training dataset with labels for supervised machine learning. We used Google Earth Engine and unsupervised machine learning via k-means clustering to generate the land cover map. We used convolutional neural networks to classify the selected regions' land cover types. We present a visualisation of change detection, highlighting urban area changes over time to monitor changes in the map.

[47] Real-Time Detection and Tracking of Foreign Object Intrusions in Power Systems via Feature-Based Edge Intelligence

Xinan Wang,Di Shi,Fengyu Wang

Main category: cs.CV

TL;DR: 提出了一种用于电力传输系统中实时异物入侵检测与跟踪的三阶段框架,结合YOLOv7分割、ConvNeXt特征提取和特征辅助IoU跟踪,并支持边缘设备部署与增量更新。

Details Motivation: 为解决电力传输系统中异物入侵检测在复杂环境下的实时性、准确性和可扩展性问题,特别是在遮挡和运动干扰下的多目标跟踪难题。 Method: 采用三阶段框架:1)YOLOv7进行快速目标定位;2)基于ConvNeXt的特征提取器使用三元组损失训练以生成判别性嵌入;3)特征辅助的IoU跟踪器提升跟踪鲁棒性;并利用混合精度推理优化边缘部署。 Result: 在真实监控和无人机视频数据集上表现出高精度和强鲁棒性,NVIDIA Jetson设备上的硬件测试验证了其在边缘计算场景下的高效性和可扩展性。 Conclusion: 该框架在保证实时性能的同时实现了高精度的异物检测与稳定跟踪,支持无需重训练的增量更新,适用于低成本边缘设备的大规模现场部署。 Abstract: This paper presents a novel three-stage framework for real-time foreign object intrusion (FOI) detection and tracking in power transmission systems. The framework integrates: (1) a YOLOv7 segmentation model for fast and robust object localization, (2) a ConvNeXt-based feature extractor trained with triplet loss to generate discriminative embeddings, and (3) a feature-assisted IoU tracker that ensures resilient multi-object tracking under occlusion and motion. To enable scalable field deployment, the pipeline is optimized for deployment on low-cost edge hardware using mixed-precision inference. The system supports incremental updates by adding embeddings from previously unseen objects into a reference database without requiring model retraining. Extensive experiments on real-world surveillance and drone video datasets demonstrate the framework's high accuracy and robustness across diverse FOI scenarios. In addition, hardware benchmarks on NVIDIA Jetson devices confirm the framework's practicality and scalability for real-world edge applications.

[48] EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing

Tianyu Chen,Yasi Zhang,Zhi Zhang,Peiyu Yu,Shu Wang,Zhendong Wang,Kevin Lin,Xiaofei Wang,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Jianwen Xie,Oscar Leong,Lijuan Wang,Ying Nian Wu,Mingyuan Zhou

Main category: cs.CV

TL;DR: 本文提出EdiVal-Agent,一种自动化、可扩展且细粒度的多轮指令式图像编辑评估框架,结合视觉语言模型与对象检测器等专家工具,从对象中心视角实现更可靠和可解释的评估,并构建EdiVal-Bench基准测试,涵盖9类指令和11种先进编辑模型。

Details Motivation: 现有图像编辑评估方法依赖配对参考图像或零样本视觉语言模型,存在覆盖范围有限、继承生成模型偏见或评估不精确的问题,缺乏可靠且可解释的评估手段。 Method: EdiVal-Agent首先将图像分解为语义对象,生成多样化上下文感知的编辑指令;在评估时,结合视觉语言模型与开放词汇对象检测器评估指令遵循性,使用语义级特征提取器评估内容一致性,并利用人类偏好模型判断视觉质量。 Result: 实验表明,结合对象检测器的VLM比单独使用VLM或基于CLIP的指标更贴近人类判断;EdiVal-Bench覆盖9种指令类型和11种主流编辑模型,能有效识别现有模型的失败模式。 Conclusion: EdiVal-Agent通过模块化设计实现了更准确、可扩展的图像编辑评估,支持未来工具集成,有助于推动下一代编辑模型的发展。 Abstract: Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images -- resulting in limited coverage and inheriting biases from prior generative models -- or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline's modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: https://tianyucodings.github.io/EdiVAL-page/.

[49] MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha,Norman Müller,Johannes Schönberger,Lorenzo Porzi,Yuchen Zhang,Tobias Fischer,Arno Knapitsch,Duncan Zauss,Ethan Weber,Nelson Antunes,Jonathon Luiten,Manuel Lopez-Antequera,Samuel Rota Bulò,Christian Richardt,Deva Ramanan,Sebastian Scherer,Peter Kontschieder

Main category: cs.CV

TL;DR: MapAnything 是一个基于 Transformer 的统一前馈模型,能够通过单次前馈传递处理多种 3D 视觉任务,通过因子化表示实现全局一致的度量重建。

Details Motivation: 现有的 3D 视觉任务通常需要专用模型,缺乏一个能统一处理多任务的通用框架。因此,作者希望构建一个能够联合处理多种输入和任务的统一模型。 Method: 提出 MapAnything,采用基于 Transformer 的前馈架构,输入包括图像及可选的几何信息(如相机内参、姿态、深度等),通过因子化的场景表示(如深度图、射线图、相机姿态和尺度因子)直接回归出度量级 3D 场景几何与相机参数。 Result: 在多个 3D 视觉任务上(如无标定 SfM、多视图立体匹配、单目深度估计等)表现出色,性能优于或媲美专用前馈模型,同时支持高效联合训练。 Conclusion: MapAnything 实现了多任务 3D 重建的统一建模,展现出作为通用 3D 重建骨干网络的潜力。 Abstract: We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.

[50] Semantic-Enhanced Cross-Modal Place Recognition for Robust Robot Localization

Yujia Lin,Nicholas Evans

Main category: cs.CV

TL;DR: 提出了一种语义增强的跨模态定位框架SCM-PR,结合RGB图像和LiDAR地图的语义与几何信息,在复杂场景下实现了更鲁棒的视觉定位。

Details Motivation: 现有基于RGB的视觉定位方法对光照、天气等变化敏感,而当前跨模态方法在复杂场景、高分辨率匹配和视角变化下表现不佳,需提升鲁棒性和精度。 Method: 设计了VMamba骨干网络提取RGB特征,引入语义感知特征融合模块(SAFF),结合语义与几何信息的LiDAR描述符,并在NetVLAD中加入跨模态语义注意力机制;采用多视图语义几何匹配和语义一致性损失进行对比学习。 Result: 在KITTI和KITTI-360数据集上实验表明,SCM-PR在跨模态定位任务中达到最先进的性能。 Conclusion: 通过融合高层语义与几何信息,SCM-PR有效提升了复杂环境下的跨模态定位鲁棒性与准确性,优于现有方法。 Abstract: Ensuring accurate localization of robots in environments without GPS capability is a challenging task. Visual Place Recognition (VPR) techniques can potentially achieve this goal, but existing RGB-based methods are sensitive to changes in illumination, weather, and other seasonal changes. Existing cross-modal localization methods leverage the geometric properties of RGB images and 3D LiDAR maps to reduce the sensitivity issues highlighted above. Currently, state-of-the-art methods struggle in complex scenes, fine-grained or high-resolution matching, and situations where changes can occur in viewpoint. In this work, we introduce a framework we call Semantic-Enhanced Cross-Modal Place Recognition (SCM-PR) that combines high-level semantics utilizing RGB images for robust localization in LiDAR maps. Our proposed method introduces: a VMamba backbone for feature extraction of RGB images; a Semantic-Aware Feature Fusion (SAFF) module for using both place descriptors and segmentation masks; LiDAR descriptors that incorporate both semantics and geometry; and a cross-modal semantic attention mechanism in NetVLAD to improve matching. Incorporating the semantic information also was instrumental in designing a Multi-View Semantic-Geometric Matching and a Semantic Consistency Loss, both in a contrastive learning framework. Our experimental work on the KITTI and KITTI-360 datasets show that SCM-PR achieves state-of-the-art performance compared to other cross-modal place recognition methods.

[51] Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization

Hao Xu,Xiaolin Wu,Xi Zhang

Main category: cs.CV

TL;DR: 提出了一种基于场景自适应的格向量量化(SALVQ)方法,用于提升3D高斯点阵(3DGS)数据压缩性能,在保持低复杂度的同时显著提高率失真效率,并支持单模型多码率压缩。

Details Motivation: 现有的3DGS压缩方法多采用均匀标量量化(USQ),虽简单但效率有限,探索更高效且低开销的量化方法以提升压缩性能。 Method: 用格向量量化(LVQ)替代USQ,并针对每个场景优化格基,实现场景自适应的LVQ(SALVQ);通过缩放格基向量动态调整格密度,支持多码率压缩。 Result: SALVQ在多个3DGS压缩架构中均提升了率失真性能,计算开销小,且单模型可支持多码率,减少了训练时间和内存消耗。 Conclusion: SALVQ是一种高效、灵活且易于集成的量化方法,显著提升了3DGS压缩的率失真效率,同时具备实际部署优势。 Abstract: 3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ's adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.

[52] MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Liu Liu,Alexandra Kudaeva,Marco Cipriano,Fatimeh Al Ghannam,Freya Tan,Gerard de Melo,Andres Sevtsuk

Main category: cs.CV

TL;DR: 本文提出了一个名为MINGLE的三阶段框架,用于从街景图像中检测社会性互动群体,结合视觉语言模型与空间聚合算法,实现对抽象人际关联的定位,并发布了包含10万张标注图像的大规模数据集。

Details Motivation: 理解公共空间中的群体社交互动对城市规划至关重要,但现有方法难以捕捉人际关系、距离和共运动等复杂视觉线索,因此需要新任务和模型来实现基于抽象社会关系的群体区域检测。 Method: 提出MINGLE框架:第一阶段使用现成的人体检测与深度估计;第二阶段利用视觉语言模型(VLM)进行成对社会关系分类;第三阶段通过轻量级空间聚合算法将有关联的个体聚类为社会群体。同时构建了一个含10万张街景图像的数据集,融合人工标注与模型生成标签。 Result: 成功实现了对城市街景中社会互动群体的空间定位,所提方法能有效识别基于抽象社会关系的群体结构,新数据集覆盖丰富的现实场景,支持社会互动分析的进一步研究。 Conclusion: MINGLE为从视觉数据中理解社会性群体行为提供了有效方案,所提出的任务、模型与数据集推动了计算机视觉在社会感知与城市智能规划方向的发展。 Abstract: Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

[53] BiasMap: Leveraging Cross-Attentions to Discover and Mitigate Hidden Social Biases in Text-to-Image Generation

Rajatsubhra Chakraborty,Xujun Che,Depeng Xu,Cori Faklaris,Xi Niu,Shuhan Yuan

Main category: cs.CV

TL;DR: 提出了一种名为BiasMap的模型无关框架,用于揭示稳定扩散模型中潜在的概念级表征偏差,并通过能量引导的扩散采样进行偏差缓解。

Details Motivation: 现有工作主要关注输出层面的人口统计分布,无法保证在偏差缓解后概念表示的解耦,因此需要更深入的方法来发现和解决概念级的表征偏差。 Method: 利用交叉注意力归因图揭示人口统计特征(如性别、种族)与语义(如职业)之间的结构纠缠,并使用交并比(IoU)量化空间上的概念纠缠;进一步通过能量引导的扩散采样修改潜在噪声空间以减少SoftIoU。 Result: 实验表明,现有公平性干预方法可能缩小输出分布差距,但常未能解除概念级耦合,而该方法能有效缓解图像生成中的概念纠缠,同时补充分布偏差的缓解。 Conclusion: BiasMap能够深入揭示并缓解稳定扩散模型中的概念级表征偏差,提供了超越传统输出分布分析的公平性评估与改进路径。 Abstract: Bias discovery is critical for black-box generative models, especiall text-to-image (TTI) models. Existing works predominantly focus on output-level demographic distributions, which do not necessarily guarantee concept representations to be disentangled post-mitigation. We propose BiasMap, a model-agnostic framework for uncovering latent concept-level representational biases in stable diffusion models. BiasMap leverages cross-attention attribution maps to reveal structural entanglements between demographics (e.g., gender, race) and semantics (e.g., professions), going deeper into representational bias during the image generation. Using attribution maps of these concepts, we quantify the spatial demographics-semantics concept entanglement via Intersection over Union (IoU), offering a lens into bias that remains hidden in existing fairness discovery approaches. In addition, we further utilize BiasMap for bias mitigation through energy-guided diffusion sampling that directly modifies latent noise space and minimizes the expected SoftIoU during the denoising process. Our findings show that existing fairness interventions may reduce the output distributional gap but often fail to disentangle concept-level coupling, whereas our mitigation method can mitigate concept entanglement in image generation while complementing distributional bias mitigation.

[54] LivePyxel: Accelerating image annotations with a Python-integrated webcam live streaming

Uriel Garcilazo-Cruz,Joseph O. Okeme,Rodrigo A. Vargas--Hernández

Main category: cs.CV

TL;DR: 本文介绍了LivePixel,一个基于Python的图形用户界面工具,可与显微镜等成像系统集成,支持实时图像标注,提升科学实验中AI模型开发效率。

Details Motivation: 现有图像标注工具通常要求用户预先上传数据集,限制了按需处理流程,在实验室实时获取图像的场景中尤为不便。 Method: 开发了一个名为LivePixel的Python图形界面工具,集成摄像头、显微镜等设备,支持贝塞尔曲线、二值掩码和非破坏性图层,并结合OpenCV与Numpy实现高效图像处理和对象检测。 Result: LivePixel实现了与多种视频设备的兼容,支持实时图像采集与精确标注,简化了数据收集与标记流程。 Conclusion: LivePixel有效填补了科学实验中灵活标注工具的空白,促进了AI模型在实验工作流中的快速开发与部署。 Abstract: The lack of flexible annotation tools has hindered the deployment of AI models in some scientific areas. Most existing image annotation software requires users to upload a precollected dataset, which limits support for on-demand pipelines and introduces unnecessary steps to acquire images. This constraint is particularly problematic in laboratory environments, where real-time data acquisition from instruments such as microscopes is increasingly common. In this work, we introduce \texttt{LivePixel}, a Python-based graphical user interface that integrates with imaging systems, such as webcams, microscopes, and others, to enable real-time image annotation. LivePyxel is designed to be easy to use through a simple interface that allows users to precisely delimit areas for annotation using tools commonly found in commercial graphics editing software. Of particular interest is the availability of B\'ezier splines and binary masks, and the software's capacity to work with non-destructive layers that enable high-performance editing. LivePyxel also integrates a wide compatibility across video devices, and it's optimized for object detection operations via the use of OpenCV in combination with high-performance libraries designed to handle matrix and linear algebra operations via Numpy effectively. LivePyxel facilitates seamless data collection and labeling, accelerating the development of AI models in experimental workflows. LivePyxel freely available at https://github.com/UGarCil/LivePyxel

[55] DEFT-VTON: Efficient Virtual Try-On with Consistent Generalised H-Transform

Xingzi Xu,Qi Li,Shuwen Qiu,Julien Han,Karim Bouyarmane

Main category: cs.CV

TL;DR: 本文提出了一种名为DEFT-VTON的虚拟试衣方法,通过Doob's h-transform进行高效微调,并引入自适应一致性损失,在仅15步去噪的情况下实现了最先进的性能。

Details Motivation: 现有的虚拟试衣方法依赖大规模模型的端到端训练,训练和推理成本高,难以在实际应用中部署。因此需要一种参数高效、低预算的微调方法。 Method: 采用Doob's h-transform进行高效微调(DEFT),冻结预训练模型参数,仅训练一个小的h-transform网络来实现条件生成;并提出自适应一致性损失,将一致性损失与去噪得分匹配损失结合,以低成本提升性能并减少推理时间。 Result: DEFT-VTON在虚拟试衣任务上达到了最先进水平,仅需15步去噪,且只需微调1.42%的参数,显著低于传统PEFT的5.52%,同时保持了竞争力的性能。 Conclusion: DEFT-VTON通过参数高效的微调策略和自适应一致性损失,实现了高性能、低推理成本的虚拟试衣,适合实际部署。 Abstract: Diffusion models enable high-quality virtual try-on (VTO) with their established image synthesis abilities. Despite the extensive end-to-end training of large pre-trained models involved in current VTO methods, real-world applications often prioritize limited training and inference, serving, and deployment budgets for VTO. To solve this obstacle, we apply Doob's h-transform efficient fine-tuning (DEFT) for adapting large pre-trained unconditional models for downstream image-conditioned VTO abilities. DEFT freezes the pre-trained model's parameters and trains a small h-transform network to learn a conditional h-transform. The h-transform network allows training only 1.42 percent of the frozen parameters, compared to a baseline of 5.52 percent in traditional parameter-efficient fine-tuning (PEFT). To further improve DEFT's performance and decrease existing models' inference time, we additionally propose an adaptive consistency loss. Consistency training distills slow but high-performing diffusion models into a fast one while retaining performance by enforcing consistencies along the inference path. Inspired by constrained optimization, instead of distillation, we combine the consistency loss and the denoising score matching loss in a data-adaptive manner for fine-tuning existing VTO models at a low cost. Empirical results show the proposed DEFT-VTON method achieves state-of-the-art performance on VTO tasks, with as few as 15 denoising steps, while maintaining competitive results.

[56] Adversarial Appearance Learning in Augmented Cityscapes for Pedestrian Recognition in Autonomous Driving

Artem Savkin,Thomas Lapotre,Kevin Strauss,Uzair Akbar,Federico Tombari

Main category: cs.CV

TL;DR: 本文提出了一种基于虚拟行人数据增强的管道,用于提升自动驾驶中行人的识别能力,特别是在语义和实例分割任务中的表现。

Details Motivation: 由于合成数据与真实数据之间存在域差距,因此需要更真实的数据增强方法来提高行人识别性能。 Method: 通过在Cityscapes数据集中引入虚拟行人,并设计一种新的生成网络架构,利用对抗学习模拟数据集的光照条件,以提升数据增强的真实性。 Result: 所提出的增强方法在语义分割和实例分割任务上取得了良好的评估效果。 Conclusion: 该方法有效缩小了合成与真实数据之间的域差距,提升了自动驾驶中对VRU(弱势道路使用者)的识别能力。 Abstract: In the autonomous driving area synthetic data is crucial for cover specific traffic scenarios which autonomous vehicle must handle. This data commonly introduces domain gap between synthetic and real domains. In this paper we deploy data augmentation to generate custom traffic scenarios with VRUs in order to improve pedestrian recognition. We provide a pipeline for augmentation of the Cityscapes dataset with virtual pedestrians. In order to improve augmentation realism of the pipeline we reveal a novel generative network architecture for adversarial learning of the data-set lighting conditions. We also evaluate our approach on the tasks of semantic and instance segmentation.

[57] FunKAN: Functional Kolmogorov-Arnold Network for Medical Image Enhancement and Segmentation

Maksim Penkin,Andrey Krylov

Main category: cs.CV

TL;DR: 提出了一种新的可解释神经网络FunKAN,基于Kolmogorov-Arnold定理并结合傅里叶-埃尔米特分解,用于医学图像增强与分割,在多个任务和数据集上优于现有KAN方法。

Details Motivation: 传统深度学习模型在医学图像处理中缺乏可解释性,而Kolmogorov-Arnold网络虽具可解释性,但其展平特征表示破坏了图像的空间结构。 Method: 提出FunKAN,将Kolmogorov-Arnold表示定理推广到函数空间,并使用Hermite函数基上的傅里叶分解学习内部函数;进一步提出U-FunKAN用于医学图像分割。 Result: 在IXI数据集上实现Gibbs振铃抑制,在BUSI、GlaS和CVC-ClinicDB数据集上进行二值分割,指标(PSNR、TV、IoU、F1)均优于其他KAN基线模型。 Conclusion: FunKAN在保持高可解释性的同时,有效保留图像空间结构,桥接了函数逼近理论与医学图像分析,为临床应用提供了鲁棒且可解释的解决方案。 Abstract: Medical image enhancement and segmentation are critical yet challenging tasks in modern clinical practice, constrained by artifacts and complex anatomical variations. Traditional deep learning approaches often rely on complex architectures with limited interpretability. While Kolmogorov-Arnold networks offer interpretable solutions, their reliance on flattened feature representations fundamentally disrupts the intrinsic spatial structure of imaging data. To address this issue we propose a Functional Kolmogorov-Arnold Network (FunKAN) -- a novel interpretable neural framework, designed specifically for image processing, that formally generalizes the Kolmogorov-Arnold representation theorem onto functional spaces and learns inner functions using Fourier decomposition over the basis Hermite functions. We explore FunKAN on several medical image processing tasks, including Gibbs ringing suppression in magnetic resonance images, benchmarking on IXI dataset. We also propose U-FunKAN as state-of-the-art binary medical segmentation model with benchmarks on three medical datasets: BUSI (ultrasound images), GlaS (histological structures) and CVC-ClinicDB (colonoscopy videos), detecting breast cancer, glands and polyps, respectively. Experiments on those diverse datasets demonstrate that our approach outperforms other KAN-based backbones in both medical image enhancement (PSNR, TV) and segmentation (IoU, F1). Our work bridges the gap between theoretical function approximation and medical image analysis, offering a robust, interpretable solution for clinical applications.

[58] Multimodal Hate Detection Using Dual-Stream Graph Neural Networks

Jiangbei Yue,Shuonan Yang,Tailin Chen,Jianbo Jiao,Zeyu Fu

Main category: cs.CV

TL;DR: 提出了一种基于图神经网络的双流多模态模型,用于仇恨视频检测,通过实例图和权重图突出仇恨内容并实现更好的分类性能。

Details Motivation: 现有方法未能有效强调视频中的仇恨内容,且无法系统建模多模态间的结构关系,导致检测效果受限。 Method: 构建实例图以提取实例级特征,并通过互补权重图分配重要性权重,突出仇恨实例;利用图神经网络融合多模态信息进行视频分类。 Result: 在公开数据集上的实验表明,该模型在仇恨视频分类任务上达到最先进水平,并具有良好的可解释性。 Conclusion: 所提出的双流图神经网络能有效捕捉多模态结构信息并强调关键仇恨内容,显著提升检测性能。 Abstract: Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video's category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: https://github.com/Multimodal-Intelligence-Lab-MIL/MultiHateGNN.

[59] ColonCrafter: A Depth Estimation Model for Colonoscopy Videos Using Diffusion Priors

Romain Hardy,Tyler Berzin,Pranav Rajpurkar

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型的结肠镜视频深度估计方法ColonCrafter,通过合成数据学习几何先验并引入风格迁移技术,实现了时间一致的深度图生成,在零样本设置下达到最优性能。

Details Motivation: 现有内窥镜深度估计模型在视频序列中的时间一致性较差,限制了其在三维重建中的应用。 Method: 提出ColonCrafter,一种基于扩散模型的方法,利用合成结肠镜序列学习几何先验,并结合保持几何结构的风格迁移技术,使真实临床视频适应合成训练域。 Result: 在C3VD数据集上实现了最先进的零样本深度估计性能,优于通用和专用内窥镜方法,并展示了3D点云生成和表面覆盖评估等临床相关应用。 Conclusion: ColonCrafter能有效生成时间一致的深度图,具有良好的零样本泛化能力,为结肠镜三维场景理解提供了可行方案。 Abstract: Three-dimensional (3D) scene understanding in colonoscopy presents significant challenges that necessitate automated methods for accurate depth estimation. However, existing depth estimation models for endoscopy struggle with temporal consistency across video sequences, limiting their applicability for 3D reconstruction. We present ColonCrafter, a diffusion-based depth estimation model that generates temporally consistent depth maps from monocular colonoscopy videos. Our approach learns robust geometric priors from synthetic colonoscopy sequences to generate temporally consistent depth maps. We also introduce a style transfer technique that preserves geometric structure while adapting real clinical videos to match our synthetic training domain. ColonCrafter achieves state-of-the-art zero-shot performance on the C3VD dataset, outperforming both general-purpose and endoscopy-specific approaches. Although full trajectory 3D reconstruction remains a challenge, we demonstrate clinically relevant applications of ColonCrafter, including 3D point cloud generation and surface coverage assessment.

[60] MemGS: Memory-Efficient Gaussian Splatting for Real-Time SLAM

Yinlong Bai,Hongxin Zhang,Sheng Zhong,Junkai Niu,Hai Li,Yijia He,Yi Zhou

Main category: cs.CV

TL;DR: 本文提出了一种针对嵌入式平台(如微型飞行器)优化的3D高斯点阵渲染方法,通过体素空间内的几何相似性合并冗余高斯图元以减少GPU内存占用,并采用Patch-Grid点采样提升渲染质量。

Details Motivation: 现有3D高斯点阵技术主要依赖高性能桌面GPU,难以应用于计算资源受限的嵌入式平台,如微型飞行器(MAV),需在系统性能与重建质量之间做出权衡。 Method: 提出在体素空间中基于几何相似性合并冗余的3D高斯图元以降低GPU内存使用;并通过Patch-Grid(PG)点采样初始化3D高斯图元,提高场景建模的准确性与渲染质量。 Result: 在公开数据集上的定量与定性实验表明,该方法有效降低了内存消耗,同时提升了渲染质量,且未影响系统运行时性能。 Conclusion: 所提出的方法在保证实时性能的同时,显著优化了3D高斯点阵在资源受限设备上的内存使用和重建质量,推动其在嵌入式SLAM系统中的应用潜力。 Abstract: Recent advancements in 3D Gaussian Splatting (3DGS) have made a significant impact on rendering and reconstruction techniques. Current research predominantly focuses on improving rendering performance and reconstruction quality using high-performance desktop GPUs, largely overlooking applications for embedded platforms like micro air vehicles (MAVs). These devices, with their limited computational resources and memory, often face a trade-off between system performance and reconstruction quality. In this paper, we improve existing methods in terms of GPU memory usage while enhancing rendering quality. Specifically, to address redundant 3D Gaussian primitives in SLAM, we propose merging them in voxel space based on geometric similarity. This reduces GPU memory usage without impacting system runtime performance. Furthermore, rendering quality is improved by initializing 3D Gaussian primitives via Patch-Grid (PG) point sampling, enabling more accurate modeling of the entire scene. Quantitative and qualitative evaluations on publicly available datasets demonstrate the effectiveness of our improvements.

[61] Dynamic Aware: Adaptive Multi-Mode Out-of-Distribution Detection for Trajectory Prediction in Autonomous Vehicles

Tongfei Guo,Lili Su

Main category: cs.CV

TL;DR: 提出一种自适应的轨迹级异常检测框架,通过建模预测误差模式,在真实驾驶环境中实现了更低的检测延迟和误报率。

Details Motivation: 现有自动驾驶中的异常检测多集中于视觉任务,而轨迹级的分布外(OOD)检测研究不足,且实际中存在训练与现实数据分布不一致的问题。 Method: 将轨迹级OOD检测建模为 quickest change detection (QCD) 问题,并引入自适应机制,显式建模随时间变化的、模式依赖的预测误差分布。 Result: 在多个真实世界数据集上验证了误差分布的动态特性,所提方法在检测延迟和误报率方面显著优于现有的不确定性量化和基于视觉的OOD方法。 Conclusion: 该框架在准确性和计算效率上均表现优越,为实现可靠、驾驶感知的自动驾驶提供了实用路径。 Abstract: Trajectory prediction is central to the safe and seamless operation of autonomous vehicles (AVs). In deployment, however, prediction models inevitably face distribution shifts between training data and real-world conditions, where rare or underrepresented traffic scenarios induce out-of-distribution (OOD) cases. While most prior OOD detection research in AVs has concentrated on computer vision tasks such as object detection and segmentation, trajectory-level OOD detection remains largely underexplored. A recent study formulated this problem as a quickest change detection (QCD) task, providing formal guarantees on the trade-off between detection delay and false alarms [1]. Building on this foundation, we propose a new framework that introduces adaptive mechanisms to achieve robust detection in complex driving environments. Empirical analysis across multiple real-world datasets reveals that prediction errors -- even on in-distribution samples -- exhibit mode-dependent distributions that evolve over time with dataset-specific dynamics. By explicitly modeling these error modes, our method achieves substantial improvements in both detection delay and false alarm rates. Comprehensive experiments on established trajectory prediction benchmarks show that our framework significantly outperforms prior UQ- and vision-based OOD approaches in both accuracy and computational efficiency, offering a practical path toward reliable, driving-aware autonomy.

[62] Annotating Satellite Images of Forests with Keywords from a Specialized Corpus in the Context of Change Detection

Nathalie Neptune,Josiane Mothe

Main category: cs.CV

TL;DR: 提出一种基于深度学习的卫星图像对分析方法,用于检测亚马逊雨林的森林砍伐,并结合视觉语义模型自动生成语义标注。

Details Motivation: 亚马逊雨林的砍伐对全球碳排放和生物多样性造成严重影响,亟需有效的监测手段。 Method: 利用深度学习技术比较同一区域不同时期的卫星图像,检测森林覆盖变化,并通过视觉语义模型从科学文献中提取关键词对变化进行自动标注。 Result: 在亚马逊图像对数据集上验证了该方法的有效性,能够准确检测砍伐区域并生成相关语义标注。 Conclusion: 该方法为监测亚马逊森林砍伐提供了有力工具,且具有推广到其他遥感监测领域的潜力。 Abstract: The Amazon rain forest is a vital ecosystem that plays a crucial role in regulating the Earth's climate and providing habitat for countless species. Deforestation in the Amazon is a major concern as it has a significant impact on global carbon emissions and biodiversity. In this paper, we present a method for detecting deforestation in the Amazon using image pairs from Earth observation satellites. Our method leverages deep learning techniques to compare the images of the same area at different dates and identify changes in the forest cover. We also propose a visual semantic model that automatically annotates the detected changes with relevant keywords. The candidate annotation for images are extracted from scientific documents related to the Amazon region. We evaluate our approach on a dataset of Amazon image pairs and demonstrate its effectiveness in detecting deforestation and generating relevant annotations. Our method provides a useful tool for monitoring and studying the impact of deforestation in the Amazon. While we focus on environment applications of our work by using images of deforestation in the Amazon rain forest to demonstrate the effectiveness of our proposed approach, it is generic enough to be applied to other domains.

[63] Intelligent Healthcare Imaging Platform An VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation

Samer Al-Hamadani

Main category: cs.CV

TL;DR: 提出一种基于视觉-语言模型的多模态医疗影像分析框架,利用Google Gemini 2.5 Flash实现跨模态肿瘤检测与报告生成,结合可视化与概率建模提升临床可信度。

Details Motivation: 为了提升医疗影像诊断的自动化水平和临床决策效率,减少对大规模标注数据的依赖,并增强模型在多模态影像中的泛化能力。 Method: 采用Vision-Language Models(VLMs)构建智能框架,集成Google Gemini 2.5 Flash进行视觉特征提取与自然语言处理,结合坐标验证机制和高斯概率建模分析异常分布,通过多层可视化技术生成医学图示与统计表示,并使用精确提示工程提取结构化临床信息。 Result: 在多种影像模态上实现了高性能的异常检测,定位误差平均为80像素,具备零样本学习能力,支持用户友好的Gradio界面集成,有效提升放射科工作流效率。 Conclusion: 该框架显著提升了医疗影像自动分析的能力和临床实用性,但在广泛部署前仍需进行临床验证和多中心评估。 Abstract: The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.

[64] A Generalization of CLAP from 3D Localization to Image Processing, A Connection With RANSAC & Hough Transforms

Ruochen Hou,Gabriel I. Fernandez,Alex Xu,Dennis W. Hong

Main category: cs.CV

TL;DR: 本文将CLAP(一种基于聚类的2D定位算法)扩展为适用于3D定位和图像拼接的通用框架,并探讨了CLAP与RANSAC、霍夫变换之间的关系,展示了其在处理噪声和不确定性方面的广泛应用潜力。

Details Motivation: 为了克服传统异常值抑制方法(如RANSAC)的局限性,并提升在复杂环境下的鲁棒性,需要一种更通用且抗噪能力强的定位框架。 Method: 通过引入基于聚类的策略替代传统的重投影误差验证方法,将CLAP从2D定位推广到3D定位和图像拼接任务,并分析其与RANSAC、霍夫变换的理论联系。 Result: CLAP被成功扩展至3D定位和图像拼接领域,表现出对噪声和误匹配的良好鲁棒性,并揭示了其与RANSAC、霍夫变换之间的内在关联。 Conclusion: CLAP作为一种通用的聚类基础框架,在多种应用场景中具有良好的适应性和抗噪能力,为处理感知中的不确定性和噪声提供了有效工具。 Abstract: In previous work, we introduced a 2D localization algorithm called CLAP, Clustering to Localize Across $n$ Possibilities, which was used during our championship win in RoboCup 2024, an international autonomous humanoid soccer competition. CLAP is particularly recognized for its robustness against outliers, where clustering is employed to suppress noise and mitigate against erroneous feature matches. This clustering-based strategy provides an alternative to traditional outlier rejection schemes such as RANSAC, in which candidates are validated by reprojection error across all data points. In this paper, CLAP is extended to a more general framework beyond 2D localization, specifically to 3D localization and image stitching. We also show how CLAP, RANSAC, and Hough transforms are related. The generalization of CLAP is widely applicable to many different fields and can be a useful tool to deal with noise and uncertainty.

[65] SAMIR, an efficient registration framework via robust feature learning from SAM

Yue He,Min Liu,Qinghao Liu,Jiazheng Wang,Yaonan Wang,Hang Zhang,Xiang Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉基础模型SAM的医学图像配准框架SAMIR,通过利用SAM强大的特征提取能力,结合任务特定的适应性设计和分层特征一致性损失,在无需弱标签的情况下显著提升了配准精度。

Details Motivation: 现有弱监督配准方法依赖分割掩码或标志点等解剖先验,但这些标签常难以获得,限制了实际应用。因此需要一种无需额外标注、能有效提取结构感知特征的方法。 Method: 利用预训练的Segment Anything Model(SAM)图像编码器提取结构感知的特征嵌入,并设计任务特定的适应性流程;引入轻量级3D头在嵌入空间中细化特征以适应局部形变;提出分层特征一致性损失实现由粗到细的特征匹配。 Result: 在ACDC心脏数据集和腹部CT数据集上显著优于现有最先进方法,分别提升2.68%和6.44%的性能。 Conclusion: SAMIR通过有效利用视觉基础模型增强了医学图像配准中的特征表示,实现了更精确的解剖结构对齐,且不依赖额外标注,具有良好的实用性和推广潜力。 Abstract: Image registration is a fundamental task in medical image analysis. Deformations are often closely related to the morphological characteristics of tissues, making accurate feature extraction crucial. Recent weakly supervised methods improve registration by incorporating anatomical priors such as segmentation masks or landmarks, either as inputs or in the loss function. However, such weak labels are often not readily available, limiting their practical use. Motivated by the strong representation learning ability of visual foundation models, this paper introduces SAMIR, an efficient medical image registration framework that utilizes the Segment Anything Model (SAM) to enhance feature extraction. SAM is pretrained on large-scale natural image datasets and can learn robust, general-purpose visual representations. Rather than using raw input images, we design a task-specific adaptation pipeline using SAM's image encoder to extract structure-aware feature embeddings, enabling more accurate modeling of anatomical consistency and deformation patterns. We further design a lightweight 3D head to refine features within the embedding space, adapting to local deformations in medical images. Additionally, we introduce a Hierarchical Feature Consistency Loss to guide coarse-to-fine feature matching and improve anatomical alignment. Extensive experiments demonstrate that SAMIR significantly outperforms state-of-the-art methods on benchmark datasets for both intra-subject cardiac image registration and inter-subject abdomen CT image registration, achieving performance improvements of 2.68% on ACDC and 6.44% on the abdomen dataset. The source code will be publicly available on GitHub following the acceptance of this paper.

[66] Federated Learning for Deforestation Detection: A Distributed Approach with Satellite Imagery

Yuvraj Dutta,Aaditya Sikder,Basabdatta Palit

Main category: cs.CV

TL;DR: 提出了一种基于联邦学习的分布式方法,用于在保护数据隐私的同时识别和定位森林砍伐区域。

Details Motivation: 准确识别森林砍伐对理解地区地理状况至关重要,传统集中式方法存在数据安全问题。 Method: 采用联邦学习框架(FLOWER+RAY),在多个边缘卫星中心协作训练YOLOS-small、Faster R-CNN(ResNet50和MobileNetV3)模型。 Result: 在公开数据集上实现了对卫星图像中森林砍伐的有效检测与定位,支持隐私保护下的分布式图像分割任务。 Conclusion: 该框架为卫星图像分析提供了一种兼顾性能与数据安全的新型分布式解决方案。 Abstract: Accurate identification of deforestation from satellite images is essential in order to understand the geographical situation of an area. This paper introduces a new distributed approach to identify as well as locate deforestation across different clients using Federated Learning (FL). Federated Learning enables distributed network clients to collaboratively train a model while maintaining data privacy and security of the active users. In our framework, a client corresponds to an edge satellite center responsible for local data processing. Moreover, FL provides an advantage over centralized training method which requires combining data, thereby compromising with data security of the clients. Our framework leverages the FLOWER framework with RAY framework to execute the distributed learning workload. Furthermore, efficient client spawning is ensured by RAY as it can select definite amount of users to create an emulation environment. Our FL framework uses YOLOS-small (a Vision Transformer variant), Faster R-CNN with a ResNet50 backbone, and Faster R-CNN with a MobileNetV3 backbone models trained and tested on publicly available datasets. Our approach provides us a different view for image segmentation-based tasks on satellite imagery.

[67] Gaussian Alignment for Relative Camera Pose Estimation via Single-View Reconstruction

Yumin Li,Dylan Campbell

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的框架GARPS,通过直接对齐两个独立重建的3D场景来实现度量相对相机位姿估计,结合单视图感知与多视图几何,在Real-Estate10K数据集上优于传统和最先进的学习方法。

Details Motivation: 传统的双视角位姿估计方法无法提供度量尺度下的平移信息,且在大基线、无纹理或反射表面场景中表现不佳,因此需要一种更鲁棒、精确的度量位姿估计方法。 Method: GARPS利用度量单目深度估计器和高斯场景重建器为每张图像生成度量3D高斯混合模型(GMM),并通过优化可微分的GMM对齐目标来 refine 来自前馈双视角位姿估计器的初始位姿,该目标联合考虑了几何结构、与视角无关的颜色、各向异性协方差和语义特征一致性。 Result: 在Real-Estate10K数据集上的大量实验表明,GARPS优于经典的和最先进的学习-based方法(如MASt3R),在宽基线和纹理贫乏区域表现出更强的鲁棒性。 Conclusion: GARPS展示了将单视图感知与多视图几何相结合在实现鲁棒且度量一致的相对位姿估计方面的潜力,无需训练即可取得优异性能。 Abstract: Estimating metric relative camera pose from a pair of images is of great importance for 3D reconstruction and localisation. However, conventional two-view pose estimation methods are not metric, with camera translation known only up to a scale, and struggle with wide baselines and textureless or reflective surfaces. This paper introduces GARPS, a training-free framework that casts this problem as the direct alignment of two independently reconstructed 3D scenes. GARPS leverages a metric monocular depth estimator and a Gaussian scene reconstructor to obtain a metric 3D Gaussian Mixture Model (GMM) for each image. It then refines an initial pose from a feed-forward two-view pose estimator by optimising a differentiable GMM alignment objective. This objective jointly considers geometric structure, view-independent colour, anisotropic covariance, and semantic feature consistency, and is robust to occlusions and texture-poor regions without requiring explicit 2D correspondences. Extensive experiments on the Real\-Estate10K dataset demonstrate that GARPS outperforms both classical and state-of-the-art learning-based methods, including MASt3R. These results highlight the potential of bridging single-view perception with multi-view geometry to achieve robust and metric relative pose estimation.

[68] Deep Lookup Network

Yulan Guo,Longguang Wang,Wendong Mao,Xiaoyu Dong,Yingqian Wang,Li Liu,Wei An

Main category: cs.CV

TL;DR: 本文提出了一种可微分的查找操作,用于替代卷积神经网络中的乘法操作,从而降低计算复杂度和能耗,提升推理速度,并在多种任务和数据类型上实现了最先进的性能。

Details Motivation: 乘法操作在卷积神经网络中计算复杂且耗能高,限制了其在移动设备上的部署。受边缘设备使用查找表减少计算成本的启发,本文旨在提出一种更高效的替代方案。 Method: 设计了一种可微分的查找操作,并将其作为神经网络的基本构建单元;通过查找表代替权重与激活值的乘法运算,并采用端到端优化和多种训练策略促进收敛。 Result: 在图像分类、图像超分辨率和点云分类任务中,所提出的查找网络在保持竞争性性能的同时,显著提高了能效和推理速度,并在多种任务和数据类型上达到最先进水平。 Conclusion: 查找网络通过替换昂贵的乘法操作,有效提升了模型效率,适用于资源受限的边缘设备,具有广泛的应用前景。 Abstract: Convolutional neural networks are constructed with massive operations with different types and are highly computationally intensive. Among these operations, multiplication operation is higher in computational complexity and usually requires {more} energy consumption with longer inference time than other operations, which hinders the deployment of convolutional neural networks on mobile devices. In many resource-limited edge devices, complicated operations can be calculated via lookup tables to reduce computational cost. Motivated by this, in this paper, we introduce a generic and efficient lookup operation which can be used as a basic operation for the construction of neural networks. Instead of calculating the multiplication of weights and activation values, simple yet efficient lookup operations are adopted to compute their responses. To enable end-to-end optimization of the lookup operation, we construct the lookup tables in a differentiable manner and propose several training strategies to promote their convergence. By replacing computationally expensive multiplication operations with our lookup operations, we develop lookup networks for the image classification, image super-resolution, and point cloud classification tasks. It is demonstrated that our lookup networks can benefit from the lookup operations to achieve higher efficiency in terms of energy consumption and inference speed while maintaining competitive performance to vanilla convolutional networks. Extensive experiments show that our lookup networks produce state-of-the-art performance on different tasks (both classification and regression tasks) and different data types (both images and point clouds).

[69] Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation

Xiaobo Yang,Xiaojin Gong

Main category: cs.CV

TL;DR: 提出了一种基于语义超像素的视觉投影器,显著减少视觉token数量(减少93%)的同时保持性能,在指代表像分割任务中高效加速MLLM的训练与推理。

Details Motivation: 传统基于patch的视觉投影器在减少视觉token和保持语义清晰之间难以平衡,导致计算开销大。 Method: 利用SAM生成的语义超像素作为“视觉词”,设计语义视觉投影器,结合语义超像素位置编码和聚合模块,自适应压缩token序列并保留细粒度细节与全局上下文。 Result: 视觉token减少93%,在不损失性能的前提下显著提升MLLM的训练和推理速度,在RIS任务上优于现有的压缩方法。 Conclusion: 所提方法有效解决了MLLM在图像分割中的视觉token冗余问题,在保持高性能的同时大幅降低计算成本。 Abstract: Recently, Referring Image Segmentation (RIS) frameworks that pair the Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM) have achieved impressive results. However, adapting MLLM to segmentation is computationally intensive, primarily due to visual token redundancy. We observe that traditional patch-wise visual projectors struggle to strike a balance between reducing the number of visual tokens and preserving semantic clarity, often retaining overly long token sequences to avoid performance drops. Inspired by text tokenizers, we propose a novel semantic visual projector that leverages semantic superpixels generated by SAM to identify "visual words" in an image. By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene complexity while minimizing semantic loss in compression. To mitigate loss of information, we propose a semantic superpixel positional embedding to strengthen MLLM's awareness of superpixel geometry and position, alongside a semantic superpixel aggregator to preserve both fine-grained details inside superpixels and global context outside. Experiments show that our method cuts visual tokens by 93% without compromising performance, notably speeding up MLLM training and inference, and outperforming existing compressive visual projectors on RIS.

[70] FishBEV: Distortion-Resilient Bird's Eye View Segmentation with Surround-View Fisheye Cameras

Hang Li,Dianmo Sheng,Qiankun Dong,Zichun Wang,Zhiwei Xu,Tao Li

Main category: cs.CV

TL;DR: 提出FishBEV,一种专为鱼眼相机设计的鸟瞰图分割框架,通过多尺度特征提取、不确定性感知交叉注意力和距离感知时序自注意力模块,有效应对几何畸变、跨视图对齐和时序稳定性问题。

Details Motivation: 现有基于针孔相机的鸟瞰图分割方法难以直接应用于存在严重几何畸变、视图对应关系模糊以及时序动态不稳定的鱼眼相机,导致性能下降。 Method: 提出FishBEV框架,包含三个核心组件:抗畸变多尺度提取(DRME)主干网络、不确定性感知空间交叉注意力(U-SCA)机制和距离感知时序自注意力(D-TSA)模块。 Result: 在Synwoodscapes数据集上的实验表明,FishBEV在环视鱼眼BEV分割任务中显著优于当前最先进方法。 Conclusion: FishBEV有效解决了鱼眼相机在BEV分割中的关键挑战,提升了分割性能,具有较强的鲁棒性和应用潜力。 Abstract: As a cornerstone technique for autonomous driving, Bird's Eye View (BEV) segmentation has recently achieved remarkable progress with pinhole cameras. However, it is non-trivial to extend the existing methods to fisheye cameras with severe geometric distortion, ambiguous multi-view correspondences and unstable temporal dynamics, all of which significantly degrade BEV performance. To address these challenges, we propose FishBEV, a novel BEV segmentation framework specifically tailored for fisheye cameras. This framework introduces three complementary innovations, including a Distortion-Resilient Multi-scale Extraction (DRME) backbone that learns robust features under distortion while preserving scale consistency, an Uncertainty-aware Spatial Cross-Attention (U-SCA) mechanism that leverages uncertainty estimation for reliable cross-view alignment, a Distance-aware Temporal Self-Attention (D-TSA) module that adaptively balances near field details and far field context to ensure temporal coherence. Extensive experiments on the Synwoodscapes dataset demonstrate that FishBEV consistently outperforms SOTA baselines, regarding the performance evaluation of FishBEV on the surround-view fisheye BEV segmentation tasks.

[71] Taylor-Series Expanded Kolmogorov-Arnold Network for Medical Imaging Classification

Kaniz Fatema,Emad A. Mohammed,Sukhjit Singh Sehra

Main category: cs.CV

TL;DR: 本研究提出基于样条的Kolmogorov-Arnold网络(KANs)用于小样本、多样化医疗图像分类,在无需预处理的情况下直接从原始数据学习,具有高准确率、强泛化能力和可解释性,且模型参数量极小,适用于资源受限的临床环境。

Details Motivation: 在资源有限的临床环境中,医疗图像分类面临数据稀缺、模型复杂度高和可解释性差等挑战,亟需一种高效、轻量且可解释的分类方法。 Method: 提出三种基于样条的KAN模型:SBTAYLOR-KAN(结合B样条与泰勒级数)、SBRBF-KAN(结合B样条与径向基函数)和SBWAVELET-KAN(嵌入Morlet小波变换),利用样条函数逼近捕捉局部与全局非线性特征,并采用Grad-CAM提升模型可解释性。 Result: 在脑MRI、胸部X光、结核X光和皮肤病变图像上验证,SBTAYLOR-KAN最高达98.93%准确率,仅用30%训练数据仍保持86%以上准确率;在皮肤癌数据集上达68.22%准确率,显著优于其他模型;相比ResNet50(2418万参数),SBTAYLOR-KAN仅2872个可训练参数,具备更强泛化性和稳定性。 Conclusion: 所提出的样条基KAN框架为医疗图像分类提供了一种轻量、可解释且泛化性强的解决方案,特别适用于数据稀缺和计算资源受限的临床AI应用。 Abstract: Effective and interpretable classification of medical images is a challenge in computer-aided diagnosis, especially in resource-limited clinical settings. This study introduces spline-based Kolmogorov-Arnold Networks (KANs) for accurate medical image classification with limited, diverse datasets. The models include SBTAYLOR-KAN, integrating B-splines with Taylor series; SBRBF-KAN, combining B-splines with Radial Basis Functions; and SBWAVELET-KAN, embedding B-splines in Morlet wavelet transforms. These approaches leverage spline-based function approximation to capture both local and global nonlinearities. The models were evaluated on brain MRI, chest X-rays, tuberculosis X-rays, and skin lesion images without preprocessing, demonstrating the ability to learn directly from raw data. Extensive experiments, including cross-dataset validation and data reduction analysis, showed strong generalization and stability. SBTAYLOR-KAN achieved up to 98.93% accuracy, with a balanced F1-score, maintaining over 86% accuracy using only 30% of the training data across three datasets. Despite class imbalance in the skin cancer dataset, experiments on both imbalanced and balanced versions showed SBTAYLOR-KAN outperforming other models, achieving 68.22% accuracy. Unlike traditional CNNs, which require millions of parameters (e.g., ResNet50 with 24.18M), SBTAYLOR-KAN achieves comparable performance with just 2,872 trainable parameters, making it more suitable for constrained medical environments. Gradient-weighted Class Activation Mapping (Grad-CAM) was used for interpretability, highlighting relevant regions in medical images. This framework provides a lightweight, interpretable, and generalizable solution for medical image classification, addressing the challenges of limited datasets and data-scarce scenarios in clinical AI applications.

[72] StyleProtect: Safeguarding Artistic Identity in Fine-tuned Diffusion Models

Qiuyu Tang,Joshua Krinsky,Aparna Bharati

Main category: cs.CV

TL;DR: 本文提出了一种名为StyleProtect的轻量级保护策略,通过仅更新选定的交叉注意力层来有效防御微调后的扩散模型对艺术风格的模仿。

Details Motivation: 生成模型尤其是扩散模型的快速发展带来了被滥用的风险,恶意使用者可以低成本复制艺术家的独特风格,因此需要探索保护艺术作品免受风格模仿的方法。 Method: 基于假设某些交叉注意力层对艺术风格更敏感,通过测量这些层在响应风格和内容表示时的激活强度,并结合外部模型提取的特征相关性分析,选择关键层进行更新以实现防护。 Result: 在基于WikiArt和Anita数据集的艺术作品上实验表明,该方法能有效保护艺术和动漫风格不被恶意定制化扩散模型窃取,同时保持良好的不可感知性。 Conclusion: StyleProtect是一种高效且轻量的防御方法,能够针对性地抵御针对艺术风格的生成模型微调攻击。 Abstract: The rapid advancement of generative models, particularly diffusion-based approaches, has inadvertently facilitated their potential for misuse. Such models enable malicious exploiters to replicate artistic styles that capture an artist's creative labor, personal vision, and years of dedication in an inexpensive manner. This has led to a rise in the need and exploration of methods for protecting artworks against style mimicry. Although generic diffusion models can easily mimic an artistic style, finetuning amplifies this capability, enabling the model to internalize and reproduce the style with higher fidelity and control. We hypothesize that certain cross-attention layers exhibit heightened sensitivity to artistic styles. Sensitivity is measured through activation strengths of attention layers in response to style and content representations, and assessing their correlations with features extracted from external models. Based on our findings, we introduce an efficient and lightweight protection strategy, StyleProtect, that achieves effective style defense against fine-tuned diffusion models by updating only selected cross-attention layers. Our experiments utilize a carefully curated artwork dataset based on WikiArt, comprising representative works from 30 artists known for their distinctive and influential styles and cartoon animations from the Anita dataset. The proposed method demonstrates promising performance in safeguarding unique styles of artworks and anime from malicious diffusion customization, while maintaining competitive imperceptibility.

[73] UM-Depth : Uncertainty Masked Self-Supervised Monocular Depth Estimation with Visual Odometry

Tae-Wook Um,Ki-Hyeon Kim,Hyun-Duck Choi,Hyo-Sung Ahn

Main category: cs.CV

TL;DR: 提出UM-Depth框架,结合运动和不确定性感知细化,提升动态物体边界和无纹理区域的单目深度估计精度。

Details Motivation: 现有自监督单目深度估计方法在低纹理或动态区域因输入数据不确定性而精度下降,需增强对这些区域的监督能力。 Method: 设计教师-学生网络训练策略,在训练中引入不确定性估计,并利用光流信息在教师网络中进行运动感知优化,不增加推理开销。 Result: 在KITTI和Cityscapes数据集上验证了方法有效性,UM-Depth在KITTI上实现了自监督深度和姿态估计的最先进性能。 Conclusion: UM-Depth通过不确定性感知细化显著提升了复杂场景下的深度估计精度,且无需额外标注或推理成本。 Abstract: Monocular depth estimation has been increasingly adopted in robotics and autonomous driving for its ability to infer scene geometry from a single camera. In self-supervised monocular depth estimation frameworks, the network jointly generates and exploits depth and pose estimates during training, thereby eliminating the need for depth labels. However, these methods remain challenged by uncertainty in the input data, such as low-texture or dynamic regions, which can cause reduced depth accuracy. To address this, we introduce UM-Depth, a framework that combines motion- and uncertainty-aware refinement to enhance depth accuracy at dynamic object boundaries and in textureless regions. Specifically, we develop a teacherstudent training strategy that embeds uncertainty estimation into both the training pipeline and network architecture, thereby strengthening supervision where photometric signals are weak. Unlike prior motion-aware approaches that incur inference-time overhead and rely on additional labels or auxiliary networks for real-time generation, our method uses optical flow exclusively within the teacher network during training, which eliminating extra labeling demands and any runtime cost. Extensive experiments on the KITTI and Cityscapes datasets demonstrate the effectiveness of our uncertainty-aware refinement. Overall, UM-Depth achieves state-of-the-art results in both self-supervised depth and pose estimation on the KITTI datasets.

[74] Mitigating Query Selection Bias in Referring Video Object Segmentation

Dingwei Zhang,Dong Zhang,Jinhui Tang

Main category: cs.CV

TL;DR: 提出Triple Query Former (TQF),将指代表达分解为外观、帧内交互和帧间运动三个查询组件,并结合语言线索与视觉引导动态构建查询,提升视频指代表分割性能。

Details Motivation: 现有基于静态文本查询的方法在指代表达分割中易受相似外观或运动的干扰物影响,产生查询选择偏差。 Method: 将查询分解为外观、帧内空间关系和帧间运动三个专用组件,动态融合语言与视觉信息构建查询;设计了两个运动感知聚合模块,分别增强帧内对象交互和跨帧运动一致性。 Result: 在多个RVOS基准上的实验表明,TQF优于现有方法,验证了结构化查询设计和运动感知聚合模块的有效性。 Conclusion: 通过分解查询并引入视觉引导的动态查询机制及运动感知聚合,有效缓解了查询选择偏差,提升了RVOS的跨模态对齐精度。 Abstract: Recently, query-based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross-modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in \emph{query selection bias}. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra-frame interaction query for spatial relations, and an inter-frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion-aware aggregation modules that enhance object token representations: Intra-frame Interaction Aggregation incorporates position-aware interactions among objects within a single frame, while Inter-frame Motion Aggregation leverages trajectory-guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion-aware aggregation modules.

[75] Improving Generalized Visual Grounding with Instance-aware Joint Learning

Ming Dai,Wenxuan Cheng,Jiang-Jiang Liu,Lingfeng Yang,Zhenhua Feng,Wankou Yang,Jingdong Wang

Main category: cs.CV

TL;DR: 本文提出了InstanceVG,一个具备实例感知能力的多任务通用视觉定位框架,首次同时解决GREC和GRES任务,并通过实例查询实现粗粒度框和细粒度掩码的一致性预测,在十个数据集上实现了最先进的性能。

Details Motivation: 现有方法通常独立处理GREC和GRES任务,且将GRES视为语义分割问题,忽略了实例感知能力和多粒度预测之间一致性的重要性。 Method: 提出InstanceVG框架,引入带有先验参考点的实例查询,统一实例级框和掩码的联合预测,并通过该参考点辅助目标匹配,实现点、框、掩码的一致性预测。 Result: 在十个数据集的四项任务上进行实验,InstanceVG在多个评估指标上显著优于现有方法,达到最先进水平。 Conclusion: InstanceVG是首个将实例感知引入通用视觉定位并联合处理多粒度任务的框架,有效提升了GREC和GRES的性能与预测一致性。 Abstract: Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.

[76] Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval

Hao Yin,Xin Man,Feiyu Chen,Jie Shao,Heng Tao Shen

Main category: cs.CV

TL;DR: 本文提出了一种用于文本到图像行人检索(TIPR)的全模态细粒度对齐框架FMFA,通过显式和隐式对齐机制提升跨模态匹配性能,在多个公开数据集上达到最优效果。

Details Motivation: 现有方法在跨模态对齐中缺乏对局部特征正确匹配的验证能力,且忽视了错误匹配的正样本对,导致全局匹配精度受限。 Method: 提出FMFA框架,包含自适应相似性分布匹配(A-SDM)模块以纠正未匹配的正样本对,以及显式细粒度对齐(EFA)模块通过稀疏化相似矩阵和硬编码实现局部对齐。 Result: 在三个公开数据集上的实验表明,FMFA在全局匹配方法中实现了最先进的性能。 Conclusion: FMFA通过显式细粒度对齐和隐式关系推理的结合,有效提升了文本到图像行人检索中的跨模态对齐精度,无需额外监督即可优化全局匹配。 Abstract: Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that aims to retrieve the most relevant person images based on a given text query. The key challenge in TIPR lies in achieving effective alignment between textual and visual modalities within a common latent space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods primarily focus on hard negative samples during model updates, with the goal of refining distinctions between positive and negative pairs, often neglecting incorrectly matched positive pairs. To alleviate these issues, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning -- hence the term ``full-mode" -- without requiring additional supervision. Specifically, we design an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. Our proposed method is evaluated on three public datasets, achieving state-of-the-art performance among all global matching methods. Our code is available at https://github.com/yinhao1102/FMFA.

[77] Controllable-Continuous Color Editing in Diffusion Model via Color Mapping

Yuqi Yang,Dongliang Chang,Yuanchen Fang,Yi-Zhe SonG,Zhanyu Ma,Jun Guo

Main category: cs.CV

TL;DR: 提出一种基于颜色映射模块的文本驱动图像编辑方法,实现对生成图像颜色的精确、连续和可控编辑。

Details Motivation: 现有文本驱动图像编辑在颜色控制方面存在精度不足和难以实现连续控制的问题,主要由于自然语言的模糊性和离散性,以及文本嵌入插值缺乏对颜色变化范围的精确控制。 Method: 引入一个颜色映射模块,显式建模文本嵌入空间与图像RGB值之间的对应关系;该模块可根据给定的RGB值预测相应的文本嵌入向量,从而实现对生成图像颜色的精确控制,并保持语义一致性。用户可指定目标RGB范围以生成具有连续颜色变化的图像。 Result: 实验结果表明,该方法在颜色连续性和可控性方面表现良好,能够实现细粒度、连续且可控的颜色编辑。 Conclusion: 所提出的方法通过建立文本嵌入与RGB值之间的显式映射,有效解决了文本驱动图像编辑中颜色控制不精确和不连续的问题,显著提升了颜色编辑的可控性和精细程度。 Abstract: In recent years, text-driven image editing has made significant progress. However, due to the inherent ambiguity and discreteness of natural language, color editing still faces challenges such as insufficient precision and difficulty in achieving continuous control. Although linearly interpolating the embedding vectors of different textual descriptions can guide the model to generate a sequence of images with varying colors, this approach lacks precise control over the range of color changes in the output images. Moreover, the relationship between the interpolation coefficient and the resulting image color is unknown and uncontrollable. To address these issues, we introduce a color mapping module that explicitly models the correspondence between the text embedding space and image RGB values. This module predicts the corresponding embedding vector based on a given RGB value, enabling precise color control of the generated images while maintaining semantic consistency. Users can specify a target RGB range to generate images with continuous color variations within the desired range, thereby achieving finer-grained, continuous, and controllable color editing. Experimental results demonstrate that our method performs well in terms of color continuity and controllability.

[78] Iterative Prompt Refinement for Safer Text-to-Image Generation

Jinwoo Jeon,JunHyeok Oh,Hayeong Lee,Byung-Jun Lee

Main category: cs.CV

TL;DR: 提出了一种基于视觉语言模型(VLM)的迭代式文本到图像提示词优化算法,通过结合生成图像的视觉反馈提升生成内容的安全性,同时保持用户意图对齐。

Details Motivation: 现有安全方法仅依赖大语言模型(LLM)优化文本提示,忽略生成图像内容,可能导致不安全输出或过度修改安全提示。 Method: 提出一种迭代提示优化算法,利用视觉语言模型(VLM)同时分析输入提示和生成图像,结合视觉反馈进行提示优化;并构建了一个包含文本与视觉安全标注的新数据集,用于监督微调。 Result: 实验表明,该方法在保持与用户意图一致的前提下,显著提升了文本到图像生成的安全性,效果优于或媲美现有的纯文本驱动的LLM方法。 Conclusion: 结合视觉反馈的多模态提示优化是提升文本到图像生成安全性的有效途径,兼顾安全性与用户意图保留,具有实际应用价值。 Abstract: Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. Our code is available at https://github.com/ku-dmlab/IPR. \textbf{\textcolor{red}WARNING: This paper contains examples of harmful or inappropriate images generated by models.

[79] Task-Aware Image Signal Processor for Advanced Visual Perception

Kai Chen,Jin Xiao,Leheng Zhang,Kexuan Shi,Shuhang Gu

Main category: cs.CV

TL;DR: 提出了一种紧凑的面向任务的图像信号处理框架TA-ISP,通过多尺度调制算子在RAW数据上实现高效、低计算开销的视觉感知任务性能提升。

Details Motivation: 现有基于RAW数据的视觉任务方法存在计算开销大或表示能力有限的问题,需更高效且适应下游任务的ISP方案。 Method: 设计TA-ISP框架,用轻量级、多尺度的调制算子替代传统密集卷积ISP流程,从全局、区域到像素级别调整图像统计特性。 Result: 在白天和夜间多个RAW域检测与分割基准上,TA-ISP显著降低参数量和推理时间的同时,持续提升下游任务精度。 Conclusion: TA-ISP在保持低计算成本的前提下有效提升RAW域视觉任务性能,适合资源受限设备部署。 Abstract: In recent years, there has been a growing trend in computer vision towards exploiting RAW sensor data, which preserves richer information compared to conventional low-bit RGB images. Early studies mainly focused on enhancing visual quality, while more recent efforts aim to leverage the abundant information in RAW data to improve the performance of visual perception tasks such as object detection and segmentation. However, existing approaches still face two key limitations: large-scale ISP networks impose heavy computational overhead, while methods based on tuning traditional ISP pipelines are restricted by limited representational capacity.To address these issues, we propose Task-Aware Image Signal Processing (TA-ISP), a compact RAW-to-RGB framework that produces task-oriented representations for pretrained vision models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small set of lightweight, multi-scale modulation operators that act at global, regional, and pixel scales to reshape image statistics across different spatial extents. This factorized control significantly expands the range of spatially varying transforms that can be represented while keeping memory usage, computation, and latency tightly constrained. Evaluated on several RAW-domain detection and segmentation benchmarks under both daytime and nighttime conditions, TA-ISP consistently improves downstream accuracy while markedly reducing parameter count and inference time, making it well suited for deployment on resource-constrained devices.

[80] NDLPNet: A Location-Aware Nighttime Deraining Network and a Real-World Benchmark Dataset

Huichun Liu,Xiaosong Li,Yang Liu,Xiaoqi Cheng,Haishu Tan

Main category: cs.CV

TL;DR: 提出了一种针对夜间低光照条件下雨 streak 去除的新型网络NDLPNet,结合位置感知模块(PPM)有效捕捉雨 streak 的空间分布与密度,并构建了包含900对真实夜景图像的NSR数据集,实验表明该方法在定性和定量评估中均优于现有SOTA方法。

Details Motivation: 现有去雨方法主要针对白天场景,在夜间低光照条件下因雨 streak 分布的空间异质性和光照依赖的条纹可见性影响,表现不佳,难以满足夜间监控和自动驾驶的需求。 Method: 提出NDLPNet,引入位置感知模块(PPM)以捕获输入数据的空间上下文信息,增强模型对不同特征通道重要性的识别与重校准能力,从而有效去除夜间雨 streak 并保留背景细节。 Result: 在多个现有数据集及新构建的NSR数据集上进行实验,定性和定量结果均显示该方法在夜间去雨任务中优于当前最先进的方法。 Conclusion: NDLPNet通过融合位置感知机制显著提升了夜间低光照条件下的图像去雨性能,所构建的NSR数据集为该领域提供了新的基准,推动了实际应用场景下的研究发展。 Abstract: Visual degradation caused by rain streak artifacts in low-light conditions significantly hampers the performance of nighttime surveillance and autonomous navigation. Existing image deraining techniques are primarily designed for daytime conditions and perform poorly under nighttime illumination due to the spatial heterogeneity of rain distribution and the impact of light-dependent stripe visibility. In this paper, we propose a novel Nighttime Deraining Location-enhanced Perceptual Network(NDLPNet) that effectively captures the spatial positional information and density distribution of rain streaks in low-light environments. Specifically, we introduce a Position Perception Module (PPM) to capture and leverage spatial contextual information from input data, enhancing the model's capability to identify and recalibrate the importance of different feature channels. The proposed nighttime deraining network can effectively remove the rain streaks as well as preserve the crucial background information. Furthermore, We construct a night scene rainy (NSR) dataset comprising 900 image pairs, all based on real-world nighttime scenes, providing a new benchmark for nighttime deraining task research. Extensive qualitative and quantitative experimental evaluations on both existing datasets and the NSR dataset consistently demonstrate our method outperform the state-of-the-art (SOTA) methods in nighttime deraining tasks. The source code and dataset is available at https://github.com/Feecuin/NDLPNet.

[81] VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI

Daiqi Liu,Tomás Arias-Vergara,Johannes Enk,Fangxu Xing,Maureen Stone,Jerry L. Prince,Jana Hutter,Andreas Maier,Jonghye Woo,Paula Andrea Pérez-Toro

Main category: cs.CV

TL;DR: 本文提出了一种名为VocSegMRI的多模态框架,结合视频、音频和音位信号,通过交叉注意力融合和对比学习,在实时MRI语音结构分割任务中实现了最先进的性能。

Details Motivation: 现有方法主要依赖视觉信息进行实时MRI中的发音器官分割,精度受限,而同步的声学和音位信号可提供有益的补充信息。 Method: 提出VocSegMRI框架,采用交叉注意力机制融合视频、音频和音位信号,并引入对比学习目标以增强跨模态表征能力,提升模型在缺少音频输入时的鲁棒性。 Result: 在USC-75 rtMRI数据集子集上验证,Dice系数达0.95,HD_95为4.20 mm,优于单模态和多模态基线方法;消融实验表明交叉注意力和对比学习均有效提升分割精度与鲁棒性。 Conclusion: 整合多模态信息(视觉、听觉与音位)能显著提升语音道结构分割的准确性与鲁棒性,验证了跨模态建模范式在实时MRI分析中的潜力。 Abstract: Accurately segmenting articulatory structures in real-time magnetic resonance imaging (rtMRI) remains challenging, as most existing methods rely almost entirely on visual cues. Yet synchronized acoustic and phonological signals provide complementary context that can enrich visual information and improve precision. In this paper, we introduce VocSegMRI, a multimodal framework that integrates video, audio, and phonological inputs through cross-attention fusion for dynamic feature alignment. To further enhance cross-modal representation, we incorporate a contrastive learning objective that improves segmentation performance even when the audio modality is unavailable at inference. Evaluated on a sub-set of USC-75 rtMRI dataset, our approach achieves state-of-the-art performance, with a Dice score of 0.95 and a 95th percentile Hausdorff Distance (HD_95) of 4.20 mm, outperforming both unimodal and multimodal baselines. Ablation studies confirm the contributions of cross-attention and contrastive learning to segmentation precision and robustness. These results highlight the value of integrative multimodal modeling for accurate vocal tract analysis.

[82] Generative Image Coding with Diffusion Prior

Jianhui Chang

Main category: cs.CV

TL;DR: 提出了一种基于扩散先验的生成式编码框架,用于在低比特率下提升压缩性能,尤其在视觉保真度和适应性方面优于传统方法和现有生成方法。

Details Motivation: 随着生成技术的发展,图像内容变得复杂,传统压缩方法在高压缩比下难以保持主观质量,现有生成方法也存在视觉保真度和泛化能力不足的问题。 Method: 采用预优化编码器生成通用压缩域表示,并通过轻量级适配器和注意力融合模块将其与预训练扩散模型的内部特征结合;引入分布重归一化方法以提升重建质量。 Result: 实验表明该方法在低比特率下视觉保真度优于现有方法,相比H.266/VVC压缩效率提升高达79%,且能有效适应AI生成内容及其他类型内容。 Conclusion: 所提框架有效利用预训练扩散模型,实现了高性能、低比特率的图像压缩,具备良好泛化性和实用性。 Abstract: As generative technologies advance, visual content has evolved into a complex mix of natural and AI-generated images, driving the need for more efficient coding techniques that prioritize perceptual quality. Traditional codecs and learned methods struggle to maintain subjective quality at high compression ratios, while existing generative approaches face challenges in visual fidelity and generalization. To this end, we propose a novel generative coding framework leveraging diffusion priors to enhance compression performance at low bitrates. Our approach employs a pre-optimized encoder to generate generalized compressed-domain representations, integrated with the pretrained model's internal features via a lightweight adapter and an attentive fusion module. This framework effectively leverages existing pretrained diffusion models and enables efficient adaptation to different pretrained models for new requirements with minimal retraining costs. We also introduce a distribution renormalization method to further enhance reconstruction fidelity. Extensive experiments show that our method (1) outperforms existing methods in visual fidelity across low bitrates, (2) improves compression performance by up to 79% over H.266/VVC, and (3) offers an efficient solution for AI-generated content while being adaptable to broader content types.

[83] AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

Yuechen Luo,Fang Li,Shaoqing Xu,Zhiyi Lai,Lei Yang,Qimao Chen,Ziang Luo,Zixun Xie,Shengyin Jiang,Jiaxin Liu,Long Chen,Bing Wang,Zhi-xin Yang

Main category: cs.CV

TL;DR: 提出AdaThinkDrive,一种具有双模式推理机制的视觉语言动作(VLA)框架,通过自适应地选择是否使用思维链(CoT)推理,在提升自动驾驶决策质量的同时降低计算开销。

Details Motivation: 现有将CoT推理应用于自动驾驶的方法在简单场景中引入不必要的计算开销,且未提升决策质量,缺乏对推理模式的自适应选择能力。 Method: 采用双模式推理机制(快速回答无需CoT,慢速思考使用CoT),在大规模自动驾驶数据上预训练,并在监督微调阶段引入包含两种模式的数据集;结合GRPO算法与自适应思维奖励策略,根据轨迹质量差异激励模型选择合适的推理模式。 Result: 在Navsim基准上达到90.3的PDMS,超过最佳纯视觉基线1.7点;相比从不推理和始终推理的基线分别提升2.0和1.4点PDMS,并减少14%推理时间。 Conclusion: AdaThinkDrive能有效平衡自动驾驶中的准确性和推理效率,通过自适应启用CoT实现性能与计算成本的最优权衡。 Abstract: While reasoning technology like Chain of Thought (CoT) has been widely adopted in Vision Language Action (VLA) models, it demonstrates promising capabilities in end to end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large scale autonomous driving (AD) scenarios using both question answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine tuning (SFT), we introduce a two mode dataset, fast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never Think and always Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.

[84] Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models

Weihang Wang,Xinhao Li,Ziyue Wang,Yan Pang,Jielei Zhang,Peiyi Li,Qiang Zhang,Longwen Gao

Main category: cs.CV

TL;DR: 本文提出了一种新的基准VHBench-10,用于评估大型视觉语言模型(LVLMs)中的细粒度对象幻觉问题,并提出了VisionWeaver方法,通过上下文感知路由网络动态聚合多专家视觉特征,显著减少幻觉并提升模型性能。

Details Motivation: 不同的视觉编码器由于训练范式不同而具有不同的归纳偏置,导致在LVLM中表现出各异的幻觉行为,现有基准难以捕捉这些多样性,因此需要更精细的评估手段和改进方法。 Method: 构建包含约10,000个样本的VHBench-10基准,涵盖十类细粒度幻觉;提出VisionWeaver,利用全局视觉特征生成路由信号,动态融合多个专业化专家的视觉特征。 Result: 实验证明不同视觉编码器具有独特的幻觉特性;VisionWeaver能有效降低幻觉并提升整体模型表现。 Conclusion: 视觉编码器的选择对LVLM的幻觉有显著影响,VHBench-10为细粒度幻觉分析提供了有效工具,而VisionWeaver通过上下文感知的多专家特征融合显著改善了模型可靠性。 Abstract: Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection and fail to capture the diverse hallucinations elaborated in our hypothesis. To systematically analyze these effects, we introduce VHBench-10, a comprehensive benchmark with approximately 10,000 samples for evaluating LVLMs across ten fine-grained hallucination categories. Our evaluations confirm encoders exhibit unique hallucination characteristics. Building on these insights and the suboptimality of simple feature fusion, we propose VisionWeaver, a novel Context-Aware Routing Network. It employs global visual features to generate routing signals, dynamically aggregating visual features from multiple specialized experts. Comprehensive experiments confirm the effectiveness of VisionWeaver in significantly reducing hallucinations and improving overall model performance.

[85] Morphology-optimized Multi-Scale Fusion: Combining Local Artifacts and Mesoscopic Semantics for Deepfake Detection and Localization

Chao Shuai,Gaojian Wang,Kun Pan,Tong Wu,Fanli Jin,Haohan Tan,Mengxiang Li,Zhenguang Liu,Feng Lin,Kui Ren

Main category: cs.CV

TL;DR: 提出一种新颖的深度伪造区域定位方法,通过独立的局部和全局视角预测,并利用形态学操作融合结果,有效抑制噪声并增强空间一致性。

Details Motivation: 现有的深度伪造检测方法在定位伪造区域时,往往忽视了局部细节与全局语义上下文的互补性,且简单的多分支输出融合策略容易放大噪声和误差,导致定位性能不佳。 Method: 该方法分别从局部和全局两个视角独立预测伪造区域,并采用形态学操作对两者的输出进行融合,以抑制噪声、提升空间连贯性。 Result: 大量实验表明,所提方法在伪造区域定位的准确性和鲁棒性方面均有显著提升,各模块均有效贡献于整体性能。 Conclusion: 通过分离局部与全局预测并设计有效的融合策略,能够显著提升深度伪造区域的定位精度,为后续研究提供了新的思路。 Abstract: While the pursuit of higher accuracy in deepfake detection remains a central goal, there is an increasing demand for precise localization of manipulated regions. Despite the remarkable progress made in classification-based detection, accurately localizing forged areas remains a significant challenge. A common strategy is to incorporate forged region annotations during model training alongside manipulated images. However, such approaches often neglect the complementary nature of local detail and global semantic context, resulting in suboptimal localization performance. Moreover, an often-overlooked aspect is the fusion strategy between local and global predictions. Naively combining the outputs from both branches can amplify noise and errors, thereby undermining the effectiveness of the localization. To address these issues, we propose a novel approach that independently predicts manipulated regions using both local and global perspectives. We employ morphological operations to fuse the outputs, effectively suppressing noise while enhancing spatial coherence. Extensive experiments reveal the effectiveness of each module in improving the accuracy and robustness of forgery localization.

[86] Dense Video Understanding with Gated Residual Tokenization

Haichao Zhang,Wenhao Chai,Shwai He,Ang Li,Yun Fu

Main category: cs.CV

TL;DR: 本文提出了Dense Video Understanding (DVU) 和 DIVE 基准,以实现高效高帧率视频理解,通过Gated Residual Tokenization (GRT) 方法减少冗余计算和token开销。

Details Motivation: 现有视频大语言模型和基准测试多采用低帧率采样,丢失了密集的时间信息,难以应对如讲座理解等需要精确时间对齐的任务。 Method: 提出GRT框架,包括运动补偿的跨门控token化和语义场景内token融合,分别在静态区域跳过和合并token以降低计算和token数量。 Result: 在DIVE基准上实验表明,GRT优于更大的VLLM基线模型,并随FPS提升表现更好。 Conclusion: 密集时间信息对视频理解至关重要,GRT实现了高效、可扩展的高FPS视频理解。 Abstract: High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.

[87] CETUS: Causal Event-Driven Temporal Modeling With Unified Variable-Rate Scheduling

Hanfang Liang,Bing Wang,Shizhen Zhang,Wen Jiang,Yizhuo Yang,Weixiang Guo,Shenghai Yuan

Main category: cs.CV

TL;DR: 提出了一种名为Variable-Rate Spatial Event Mamba的新架构,直接处理原始事件流,避免中间表示,实现低延迟和高效率的高动态视觉任务。

Details Motivation: 现有方法通常将事件流转换为帧、体素网格或点云等中间表示,需预定义时间窗口,引入延迟;点级检测方法计算成本高,难以实现实时性。 Method: 设计轻量级因果空间邻域编码器以捕捉局部几何关系,并采用基于Mamba的状态空间模型进行线性复杂度的时间建模;推理时通过控制器根据事件率自适应调整处理速度。 Result: 实现了对原始事件流的高效直接处理,在保证实时性的同时显著降低窗口延迟和推理延迟。 Conclusion: 所提方法在高动态视觉任务中优于传统方法,兼具高效性、可扩展性和低延迟优势。 Abstract: Event cameras capture asynchronous pixel-level brightness changes with microsecond temporal resolution, offering unique advantages for high-speed vision tasks. Existing methods often convert event streams into intermediate representations such as frames, voxel grids, or point clouds, which inevitably require predefined time windows and thus introduce window latency. Meanwhile, pointwise detection methods face computational challenges that prevent real-time efficiency due to their high computational cost. To overcome these limitations, we propose the Variable-Rate Spatial Event Mamba, a novel architecture that directly processes raw event streams without intermediate representations. Our method introduces a lightweight causal spatial neighborhood encoder to efficiently capture local geometric relations, followed by Mamba-based state space models for scalable temporal modeling with linear complexity. During inference, a controller adaptively adjusts the processing speed according to the event rate, achieving an optimal balance between window latency and inference latency.

[88] BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

Hanshuai Cui,Zhiqing Tang,Zhifei Xu,Zhi Yao,Wenyi Zeng,Weijia Jia

Main category: cs.CV

TL;DR: 提出了一种名为Block-Wise Caching (BWCache)的无需训练的方法,通过在扩散时间步间动态缓存和重用DiT块特征来加速视频生成,实现最高2.24倍的加速且保持视觉质量。

Details Motivation: Diffusion Transformers (DiTs)在视频生成中表现出色,但其顺序去噪过程导致延迟问题,现有加速方法存在质量下降或特征重用粒度不当的问题。 Method: 分析发现DiT块是推理延迟的主要来源,且其特征变化呈现U型模式,在中间时间步高度相似。基于此,提出BWCache方法,动态缓存并重用DiT块特征,并引入相似性指标,在相邻时间步特征差异低于阈值时触发重用,以减少冗余计算。 Result: 在多个视频扩散模型上的实验表明,BWCache可实现最高2.24倍的加速,同时保持与原始模型相当的视觉质量。 Conclusion: BWCache是一种有效的训练-free加速方法,能够在不牺牲视觉质量的前提下显著提升DiT-based视频生成的推理效率。 Abstract: Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.

[89] Bridging the Synthetic-Real Gap: Supervised Domain Adaptation for Robust Spacecraft 6-DoF Pose Estimation

Inder Pal Singh,Nidhal Eddine Chenni,Abd El Rahman Shabayek,Arunkumar Rathinam,Djamila Aouada

Main category: cs.CV

TL;DR: 本文提出了一种针对航天器姿态估计(SPE)关键点回归的首个监督域自适应(SDA)框架,基于LIRR范式,利用合成数据和少量真实标注数据联合优化域不变表示和任务特定风险,在SPEED+基准上显著优于现有方法。

Details Motivation: 现有的无监督域自适应方法在仅有少量标注目标样本时表现不佳,且合成到真实的域差距导致混合流水线在真实图像上性能急剧下降。 Method: 基于LIRR范式,提出一种监督域自适应框架,联合优化域不变特征表示和任务相关风险,结合标记的合成数据与有限的真实标注数据进行训练。 Result: 在SPEED+基准上的实验表明,该方法在仅使用5%标注真实数据时即可达到甚至超过使用更多标注数据训练的Oracle基线性能,且优于源域训练、微调等方法。 Conclusion: 所提出的轻量级、主干无关且计算高效的SDA框架有效缩小了合成与真实数据间的域差距,为实际空间环境中鲁棒可部署的航天器姿态估计提供了可行路径。 Abstract: Spacecraft Pose Estimation (SPE) is a fundamental capability for autonomous space operations such as rendezvous, docking, and in-orbit servicing. Hybrid pipelines that combine object detection, keypoint regression, and Perspective-n-Point (PnP) solvers have recently achieved strong results on synthetic datasets, yet their performance deteriorates sharply on real or lab-generated imagery due to the persistent synthetic-to-real domain gap. Existing unsupervised domain adaptation approaches aim to mitigate this issue but often underperform when a modest number of labeled target samples are available. In this work, we propose the first Supervised Domain Adaptation (SDA) framework tailored for SPE keypoint regression. Building on the Learning Invariant Representation and Risk (LIRR) paradigm, our method jointly optimizes domain-invariant representations and task-specific risk using both labeled synthetic and limited labeled real data, thereby reducing generalization error under domain shift. Extensive experiments on the SPEED+ benchmark demonstrate that our approach consistently outperforms source-only, fine-tuning, and oracle baselines. Notably, with only 5% labeled target data, our method matches or surpasses oracle performance trained on larger fractions of labeled data. The framework is lightweight, backbone-agnostic, and computationally efficient, offering a practical pathway toward robust and deployable spacecraft pose estimation in real-world space environments.

[90] SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments

Jiayu Yuan,Ming Dai,Enhui Zheng,Chao Su,Nanxing Chen,Qiming Hu,Shibo Zhu,Yibin Cao

Main category: cs.CV

TL;DR: 提出了一种基于语义加权自适应粒子滤波(SWA-PF)的无人机定位方法,结合多高度飞行数据集MAFS,在无GNSS环境下实现高效、鲁棒的4自由度位姿估计。

Details Motivation: 现有基于检索的视觉无人机定位方法在数据集可用性、实时性、环境适应性和泛化能力方面存在局限,尤其在动态或时变环境中表现不佳。 Method: 提出语义加权机制和优化的粒子滤波架构,融合无人机图像与卫星影像的语义特征,并构建大规模多高度飞行段数据集MAFS用于训练与验证。 Result: 相比传统特征提取方法计算效率提升10倍,全局定位误差低于10米,可在数秒内完成4自由度位姿估计,且适用于低分辨率卫星地图。 Conclusion: SWA-PF结合MAFS数据集显著提升了无人机在复杂、动态及GNSS拒止环境下的定位精度与效率,具有良好的实际应用前景。 Abstract: Vision-based Unmanned Aerial Vehicle (UAV) localization systems have been extensively investigated for Global Navigation Satellite System (GNSS)-denied environments. However, existing retrieval-based approaches face limitations in dataset availability and persistent challenges including suboptimal real-time performance, environmental sensitivity, and limited generalization capability, particularly in dynamic or temporally varying environments. To overcome these limitations, we present a large-scale Multi-Altitude Flight Segments dataset (MAFS) for variable altitude scenarios and propose a novel Semantic-Weighted Adaptive Particle Filter (SWA-PF) method. This approach integrates robust semantic features from both UAV-captured images and satellite imagery through two key innovations: a semantic weighting mechanism and an optimized particle filtering architecture. Evaluated using our dataset, the proposed method achieves 10x computational efficiency gain over feature extraction methods, maintains global positioning errors below 10 meters, and enables rapid 4 degree of freedom (4-DoF) pose estimation within seconds using accessible low-resolution satellite maps. Code and dataset will be available at https://github.com/YuanJiayuuu/SWA-PF.

[91] Masked Feature Modeling Enhances Adaptive Segmentation

Wenlve Zhou,Zhiheng Zhou,Tiantao Xian,Yikui Zhai,Weibin Wu,Biyun Ma

Main category: cs.CV

TL;DR: 提出了一种新的无监督域自适应语义分割辅助任务——掩码特征建模(MFM),通过在特征空间中进行特征掩码和重建,提升模型在目标域上的性能。

Details Motivation: 现有的掩码建模方法因架构不兼容和优化目标不一致,在无监督域自适应语义分割中未被充分探索,因此需要一种与标准架构兼容且优化目标一致的辅助任务。 Method: 提出MFM方法,在特征空间直接进行特征掩码与重建,并引入轻量级Rebuilder模块协助重建;利用分割解码器对重建特征进行分类,使辅助任务与主任务紧密耦合。 Result: 在多种架构和UDA基准上实验表明,MFM能持续提升语义分割性能,且推理时无额外计算开销。 Conclusion: MFM是一种简单、高效、可泛化的无监督域自适应语义分割辅助训练策略。 Abstract: Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer models from a labeled source domain to an unlabeled target domain. While auxiliary self-supervised tasks-particularly contrastive learning-have improved feature discriminability, masked modeling approaches remain underexplored in this setting, largely due to architectural incompatibility and misaligned optimization objectives. We propose Masked Feature Modeling (MFM), a novel auxiliary task that performs feature masking and reconstruction directly in the feature space. Unlike existing masked modeling methods that reconstruct low-level inputs or perceptual features (e.g., HOG or visual tokens), MFM aligns its learning target with the main segmentation task, ensuring compatibility with standard architectures like DeepLab and DAFormer without modifying the inference pipeline. To facilitate effective reconstruction, we introduce a lightweight auxiliary module, Rebuilder, which is trained jointly but discarded during inference, adding zero computational overhead at test time. Crucially, MFM leverages the segmentation decoder to classify the reconstructed features, tightly coupling the auxiliary objective with the pixel-wise prediction task to avoid interference with the primary task. Extensive experiments across various architectures and UDA benchmarks demonstrate that MFM consistently enhances segmentation performance, offering a simple, efficient, and generalizable strategy for unsupervised domain-adaptive semantic segmentation.

[92] Data-Efficient Spectral Classification of Hyperspectral Data Using MiniROCKET and HDC-MiniROCKET

Nick Theisen,Kenny Schlegel,Dietrich Paulus,Peer Neubert

Main category: cs.CV

TL;DR: 本文研究了在训练数据有限的情况下,MiniROCKET和HDC-MiniROCKET在高光谱图像纯光谱分类中的性能,发现其优于当前最先进的1D-Justo-LiuNet模型。

Details Motivation: 由于现有高效模型1D-Justo-LiuNet在训练数据不足时性能下降,本文旨在探索更鲁棒的光谱分类方法。 Method: 采用无参数特征提取的MiniROCKET和HDC-MiniROCKET模型进行纯光谱分类,并与1D-Justo-LiuNet对比。 Result: MiniROCKET在数据受限场景下表现更优,且在一般情况下性能相当。 Conclusion: MiniROCKET是一种对小样本更鲁棒的高光谱光谱分类方法,有助于提升未来空间-光谱联合方法的性能。 Abstract: The classification of pixel spectra of hyperspectral images, i.e. spectral classification, is used in many fields ranging from agricultural, over medical to remote sensing applications and is currently also expanding to areas such as autonomous driving. Even though for full hyperspectral images the best-performing methods exploit spatial-spectral information, performing classification solely on spectral information has its own advantages, e.g. smaller model size and thus less data required for training. Moreover, spectral information is complementary to spatial information and improvements on either part can be used to improve spatial-spectral approaches in the future. Recently, 1D-Justo-LiuNet was proposed as a particularly efficient model with very few parameters, which currently defines the state of the art in spectral classification. However, we show that with limited training data the model performance deteriorates. Therefore, we investigate MiniROCKET and HDC-MiniROCKET for spectral classification to mitigate that problem. The model extracts well-engineered features without trainable parameters in the feature extraction part and is therefore less vulnerable to limited training data. We show that even though MiniROCKET has more parameters it outperforms 1D-Justo-LiuNet in limited data scenarios and is mostly on par with it in the general case

[93] Semi-MoE: Mixture-of-Experts meets Semi-Supervised Histopathology Segmentation

Nguyen Lan Vi Vu,Thanh-Huy Nguyen,Thien Nguyen,Daisuke Kihara,Tianyang Wang,Xingjian Li,Min Xu

Main category: cs.CV

TL;DR: 本文提出了Semi-MOE,这是首个用于半监督组织病理学图像分割的多任务混合专家框架,通过三个专业化专家网络和动态伪标签机制,在低标注数据场景下显著优于现有方法。

Details Motivation: 现有半监督方法在处理模糊腺体边界和形态误分类时产生的噪声伪标签上表现不佳,难以有效利用有限的标注数据进行精确分割。 Method: 提出Semi-MOE框架,包含三个专门的专家网络(主分割、符号距离场回归和边界预测),并通过多门控伪标签模块动态聚合特征,结合自适应多目标损失函数实现无需手动调参的动态损失平衡。 Result: 在GlaS和CRAG两个基准上的实验表明,该方法在低标签设置下优于当前最先进方法,有效提升了分割性能。 Conclusion: 基于混合专家的架构在半监督组织病理学图像分割中具有巨大潜力,Semi-MOE通过多任务协同与动态优化机制显著提高了伪标签质量和模型鲁棒性。 Abstract: Semi-supervised learning has been employed to alleviate the need for extensive labeled data for histopathology image segmentation, but existing methods struggle with noisy pseudo-labels due to ambiguous gland boundaries and morphological misclassification. This paper introduces Semi-MOE, to the best of our knowledge, the first multi-task Mixture-of-Experts framework for semi-supervised histopathology image segmentation. Our approach leverages three specialized expert networks: A main segmentation expert, a signed distance field regression expert, and a boundary prediction expert, each dedicated to capturing distinct morphological features. Subsequently, the Multi-Gating Pseudo-labeling module dynamically aggregates expert features, enabling a robust fuse-and-refine pseudo-labeling mechanism. Furthermore, to eliminate manual tuning while dynamically balancing multiple learning objectives, we propose an Adaptive Multi-Objective Loss. Extensive experiments on GlaS and CRAG benchmarks show that our method outperforms state-of-the-art approaches in low-label settings, highlighting the potential of MoE-based architectures in advancing semi-supervised segmentation. Our code is available at https://github.com/vnlvi2k3/Semi-MoE.

[94] Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation

Puru Vaish,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: 本文提出了一种名为Consistent View Alignment的自监督学习方法,通过显式对齐数据的不同视图来学习有效的表征,并在MICCAI 2025 SSL3D挑战赛中取得优异成绩。

Details Motivation: 许多现有的表示学习方法假设数据点的不相关视图足以学习有意义的表示,但作者认为有意义的潜在结构不会自然形成,必须显式诱导。 Method: 提出Consistent View Alignment方法,对齐数据不同视图的表示以整合互补信息,同时避免引入假阳性关联。 Result: 实验表明该方法在下游任务中表现更优,在MICCAI 2025 SSL3D挑战赛中使用Primus ViT和ResEnc网络分别获得第一和第二名。 Conclusion: 有效的表征学习需要显式的、结构化的视图对齐,而不仅仅是依赖不相关的数据视图。 Abstract: Many recent approaches in representation learning implicitly assume that uncorrelated views of a data point are sufficient to learn meaningful representations for various downstream tasks. In this work, we challenge this assumption and demonstrate that meaningful structure in the latent space does not emerge naturally. Instead, it must be explicitly induced. We propose a method that aligns representations from different views of the data to align complementary information without inducing false positives. Our experiments show that our proposed self-supervised learning method, Consistent View Alignment, improves performance for downstream tasks, highlighting the critical role of structured view alignment in learning effective representations. Our method achieved first and second place in the MICCAI 2025 SSL3D challenge when using a Primus vision transformer and ResEnc convolutional neural network, respectively. The code and pretrained model weights are released at https://github.com/Tenbatsu24/LatentCampus.

[95] SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation

Jiayi Pan,Jiaming Xu,Yongkang Zhou,Guohao Dai

Main category: cs.CV

TL;DR: 本文提出了一种名为SpecDiff的新型无训练多级特征缓存策略,通过引入自推测机制利用未来信息,结合历史信息提升扩散模型推理效率,在Stable Diffusion 3、3.5和FLUX上实现了显著加速且质量损失极小。

Details Motivation: 现有特征缓存方法仅依赖历史信息,导致速度与精度存在权衡瓶颈,限制了扩散模型的推理效率。 Method: 提出基于自推测信息的特征选择算法和基于重要性评分的多级特征分类算法,利用同一时间步在不同迭代次数间的相似性引入未来信息,实现动态重要性评分与分层计算。 Result: 在NVIDIA A800-80GB GPU上,相比RFlow,SpecDiff在Stable Diffusion 3、3.5和FLUX中分别平均实现2.80×、2.74×和3.17×的加速,且图像质量损失可忽略。 Conclusion: SpecDiff通过融合推测与历史信息,突破了速度与精度之间的权衡瓶颈,推动了高效扩散模型推理的帕累托前沿。 Abstract: Feature caching has recently emerged as a promising method for diffusion model acceleration. It effectively alleviates the inefficiency problem caused by high computational requirements by caching similar features in the inference process of the diffusion model. In this paper, we analyze existing feature caching methods from the perspective of information utilization, and point out that relying solely on historical information will lead to constrained accuracy and speed performance. And we propose a novel paradigm that introduces future information via self-speculation based on the information similarity at the same time step across different iteration times. Based on this paradigm, we present \textit{SpecDiff}, a training-free multi-level feature caching strategy including a cached feature selection algorithm and a multi-level feature classification algorithm. (1) Feature selection algorithm based on self-speculative information. \textit{SpecDiff} determines a dynamic importance score for each token based on self-speculative information and historical information, and performs cached feature selection through the importance score. (2) Multi-level feature classification algorithm based on feature importance scores. \textit{SpecDiff} classifies tokens by leveraging the differences in feature importance scores and introduces a multi-level feature calculation strategy. Extensive experiments show that \textit{SpecDiff} achieves average 2.80 \times, 2.74 \times , and 3.17\times speedup with negligible quality loss in Stable Diffusion 3, 3.5, and FLUX compared to RFlow on NVIDIA A800-80GB GPU. By merging speculative and historical information, \textit{SpecDiff} overcomes the speedup-accuracy trade-off bottleneck, pushing the Pareto frontier of speedup and accuracy in the efficient diffusion model inference.

[96] EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics

Qianxin Xia,Jiawei Du,Guoming Lu,Zhiyong Shu,Jielei Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的数据集蒸馏框架EDITS,利用图像中隐含的文本语义信息来提升蒸馏效果。

Details Motivation: 传统数据集蒸馏方法主要捕捉低级视觉特征,忽略了图像中的高级语义和结构信息。 Method: 通过结合视觉语言模型生成的外部文本与图像特征,使用全局语义查询模块形成先验聚类缓冲区;局部语义感知选择代表性样本构建图像和文本原型,并通过精心设计的提示引导大语言模型生成文本原型;最终采用双原型引导策略通过扩散模型生成最终的合成数据集。 Result: 大量实验验证了该方法的有效性。 Conclusion: EDITS框架能够有效利用图像中的隐含文本语义信息,显著提高数据集蒸馏的质量和效率。 Abstract: Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.

[97] LamiGauss: Pitching Radiative Gaussian for Sparse-View X-ray Laminography Reconstruction

Chu Chen,Ander Biguri,Jean-Michel Morel,Raymond H. Chan,Carola-Bibiane Schönlieb,Jizhou Li

Main category: cs.CV

TL;DR: 本文提出了一种名为LamiGauss的X射线计算层析成像重建算法,结合高斯光栅化与特定几何建模,可在极稀疏投影条件下高效准确地重建板状结构,仅用3%的采样视图即优于全数据迭代方法。

Details Motivation: 传统CT在板状结构成像中受限于几何约束,而X射线计算层析成像(CL)在稀疏视角下重建质量差,亟需高效、高质量的重建算法。 Method: 提出LamiGauss算法,结合高斯光斑辐射光栅化与考虑层析倾斜角的探测器到世界坐标变换模型,并采用能滤除初始重建中常见伪影的初始化策略,避免冗余高斯分布,集中建模真实结构。 Result: 在合成与真实数据上均表现出色,仅使用全视角3%的投影即可超越基于完整数据集优化的迭代方法,实现高效精准重建。 Conclusion: LamiGauss显著提升了稀疏视角下CL重建的质量与效率,为微芯片和复合电池材料等板状结构的无损检测提供了先进解决方案。 Abstract: X-ray Computed Laminography (CL) is essential for non-destructive inspection of plate-like structures in applications such as microchips and composite battery materials, where traditional computed tomography (CT) struggles due to geometric constraints. However, reconstructing high-quality volumes from laminographic projections remains challenging, particularly under highly sparse-view acquisition conditions. In this paper, we propose a reconstruction algorithm, namely LamiGauss, that combines Gaussian Splatting radiative rasterization with a dedicated detector-to-world transformation model incorporating the laminographic tilt angle. LamiGauss leverages an initialization strategy that explicitly filters out common laminographic artifacts from the preliminary reconstruction, preventing redundant Gaussians from being allocated to false structures and thereby concentrating model capacity on representing the genuine object. Our approach effectively optimizes directly from sparse projections, enabling accurate and efficient reconstruction with limited data. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness and superiority of the proposed method over existing techniques. LamiGauss uses only 3$\%$ of full views to achieve superior performance over the iterative method optimized on a full dataset.

[98] Distractor-Aware Memory-Based Visual Object Tracking

Jovana Videnovic,Matej Kristan,Alan Lukezic

Main category: cs.CV

TL;DR: 本文提出了一种针对SAM2的干扰感知记忆模块和自省式管理方法(DAM4SAM),有效减少了视觉目标跟踪中向干扰物的漂移,并提升了遮挡后的重检测能力。

Details Motivation: 现有的基于记忆的视频分割方法(如SAM2)在存在视觉相似干扰物的情况下,难以有效应对跟踪任务中的跟踪漂移问题。 Method: 设计了一种即插即用的干扰感知记忆模块和基于自省的记忆管理机制,并构建了专门用于分析干扰环境下跟踪性能的DiDi数据集。 Result: DAM4SAM在13个基准上优于SAM2.1,在10个基准上达到最先进水平;集成到EfficientTAM后性能提升11%,与非实时SAM2.1-L相当;集成到EdgeTAM后性能提升4%,显示出良好的跨架构泛化能力。 Conclusion: 所提出的干扰感知记忆模块显著提升了视频目标跟踪在干扰物存在下的鲁棒性和重检测能力,具有广泛适用性和实际部署潜力。 Abstract: Recent emergence of memory-based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor-aware drop-in memory module and introspection-based management method for SAM2, leading to DAM4SAM. Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion. To facilitate the analysis of tracking in the presence of distractors, we construct DiDi, a Distractor-Distilled dataset. DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten. Furthermore, integrating the proposed distractor-aware memory into a real-time tracker EfficientTAM leads to 11% improvement and matches tracking quality of the non-real-time SAM2.1-L on multiple tracking and segmentation benchmarks, while integration with edge-based tracker EdgeTAM delivers 4% performance boost, demonstrating a very good generalization across architectures.

[99] Invisible Yet Detected: PelFANet with Attention-Guided Anatomical Fusion for Pelvic Fracture Diagnosis

Siam Tahsin Bhuiyan,Rashedur Rahman,Sefatul Wasi,Naomi Yagi,Syoji Kobashi,Ashraful Islam,Saadia Binte Alam

Main category: cs.CV

TL;DR: 本文提出了一种名为PelFANet的双流注意力网络,用于提高骨盆骨折的分类准确性,特别是在X光片上骨折迹象不明显的情况下。该网络通过融合原始X光图像和分割后的骨骼图像,并利用Fused Attention Blocks(FABlocks)来迭代交换和优化特征,从而捕捉全局上下文和局部解剖细节。PelFANet在两阶段训练流程中表现出优于传统方法的性能,在AMERI数据集上的可见骨折检测中达到了88.68%的准确率和0.9334的AUC,对于不可见骨折情况也实现了82.29%的准确率和0.8688的AUC,展示了其在临床应用中的潜力。

Details Motivation: 骨盆骨折在标准X光片上可能表现得非常微妙或难以察觉,导致诊断困难,因此需要一种更有效的技术来提升骨折检测的准确性和可靠性。 Method: 提出了一种双流注意力网络PelFANet,该网络结合了原始骨盆X光图像与经过分割处理的骨骼图像,使用Fused Attention Blocks进行特征交互和优化,以增强对骨折区域的识别能力。模型采用两阶段训练策略,首先进行分割指导下的预训练,然后进行骨折分类任务的微调。 Result: PelFANet在AMERI数据集上对可见骨折的检测达到了88.68%的准确率和0.9334的AUC,对不可见骨折的检测也有82.29%的准确率和0.8688的AUC,显示出良好的泛化能力和临床应用前景。 Conclusion: PelFANet作为一种解剖学感知的双输入架构,在骨盆骨折检测方面展现出了显著的优势,特别是针对那些在常规影像学检查中不易发现的细微骨折,具有重要的临床价值。 Abstract: Pelvic fractures pose significant diagnostic challenges, particularly in cases where fracture signs are subtle or invisible on standard radiographs. To address this, we introduce PelFANet, a dual-stream attention network that fuses raw pelvic X-rays with segmented bone images to improve fracture classification. The network em-ploys Fused Attention Blocks (FABlocks) to iteratively exchange and refine fea-tures from both inputs, capturing global context and localized anatomical detail. Trained in a two-stage pipeline with a segmentation-guided approach, PelFANet demonstrates superior performance over conventional methods. On the AMERI dataset, it achieves 88.68% accuracy and 0.9334 AUC on visible fractures, while generalizing effectively to invisible fracture cases with 82.29% accuracy and 0.8688 AUC, despite not being trained on them. These results highlight the clini-cal potential of anatomy-aware dual-input architectures for robust fracture detec-tion, especially in scenarios with subtle radiographic presentations.

[100] EvHand-FPV: Efficient Event-Based 3D Hand Tracking from First-Person View

Zhen Xu,Guorui Lu,Chang Gao,Qinyu Chen

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的单事件相机第一人称视角3D手势追踪框架EvHand-FPV,结合合成与真实数据构建数据集,通过腕部区域兴趣定位、端到端映射和多任务学习策略,在显著降低计算开销的同时提升了2D和3D追踪性能,适用于XR设备上的实时应用。

Details Motivation: 帧基手势追踪方法在准确性、延迟和能耗方面存在局限,尤其在资源受限的扩展现实(XR)设备中表现不佳。事件相机虽具备微秒级响应和低功耗优势,但缺乏适用于第一人称视角的手势追踪框架和数据集,因此需要开发高效且准确的基于事件的手势追踪方案。 Method: 提出EvHand-FPV框架,构建包含合成3D标注训练数据和真实2D标注测试数据的事件相机第一人称视角数据集;引入基于手腕的区域兴趣(ROI)机制,利用几何线索定位手部区域,并将ROI偏移嵌入网络实现端到端映射以减少计算量;采用多任务学习策略,增加辅助几何特征头以增强表征能力而不增加推理开销。 Result: 在真实第一人称视角测试集上,2D-AUCp从0.77提升至0.85,参数量减少89%(11.2M→1.2M),每推理FLOPs从1.648G降至0.185G(减少89%);在合成数据上保持3D-AUCp为0.84,验证了方法的高效性与竞争力。 Conclusion: EvHand-FPV实现了高精度、低计算成本的单事件相机第一人称视角手势追踪,适合部署于XR等资源受限设备,推动了基于事件相机的交互技术发展。 Abstract: Hand tracking holds great promise for intuitive interaction paradigms, but frame-based methods often struggle to meet the requirements of accuracy, low latency, and energy efficiency, especially in resource-constrained settings such as Extended Reality (XR) devices. Event cameras provide $\mu$s-level temporal resolution at mW-level power by asynchronously sensing brightness changes. In this work, we present EvHand-FPV, a lightweight framework for egocentric First-Person-View 3D hand tracking from a single event camera. We construct an event-based FPV dataset that couples synthetic training data with 3D labels and real event data with 2D labels for evaluation to address the scarcity of egocentric benchmarks. EvHand-FPV also introduces a wrist-based region of interest (ROI) that localizes the hand region via geometric cues, combined with an end-to-end mapping strategy that embeds ROI offsets into the network to reduce computation without explicit reconstruction, and a multi-task learning strategy with an auxiliary geometric feature head that improves representations without test-time overhead. On our real FPV test set, EvHand-FPV improves 2D-AUCp from 0.77 to 0.85 while reducing parameters from 11.2M to 1.2M by 89% and FLOPs per inference from 1.648G to 0.185G by 89%. It also maintains a competitive 3D-AUCp of 0.84 on synthetic data. These results demonstrate accurate and efficient egocentric event-based hand tracking suitable for on-device XR applications. The dataset and code are available at https://github.com/zen5x5/EvHand-FPV.

[101] White Aggregation and Restoration for Few-shot 3D Point Cloud Semantic Segmentation

Jiyun Im,SuBeen Lee,Miso Lee,Jae-Pil Heo

Main category: cs.CV

TL;DR: 本文提出了一种基于注意力机制的先进原型生成方法,用于少样本3D点云分割(FS-PCS),通过白化和着色变换解决可学习原型标记与支持特征之间的分布差距问题,显著提升了性能。

Details Motivation: 现有方法在原型生成过程中存在初始随机性影响性能的问题,且该过程尚未被充分探索,因此需要一种更鲁棒的原型生成方法。 Method: 提出White Aggregation and Restoration Module (WARM),在交叉注意力前后引入白化和着色变换,以对齐支持特征与原型标记的分布。 Result: 在多个FS-PCS基准上实现了显著优于现有方法的性能,验证了所提方法的有效性。 Conclusion: WARM模块通过分布对齐增强了注意力机制的稳定性,能够生成更具代表性的原型,推动了少样本3D点云分割的发展。 Abstract: Few-Shot 3D Point Cloud Segmentation (FS-PCS) aims to predict per-point labels for an unlabeled point cloud, given only a few labeled examples. To extract discriminative representations from the limited support set, existing methods have constructed prototypes using conventional algorithms such as farthest point sampling. However, we point out that its initial randomness significantly affects FS-PCS performance and that the prototype generation process remains underexplored despite its prevalence. This motivates us to investigate an advanced prototype generation method based on attention mechanism. Despite its potential, we found that vanilla module suffers from the distributional gap between learnable prototypical tokens and support features. To overcome this, we propose White Aggregation and Restoration Module (WARM), which resolves the misalignment by sandwiching cross-attention between whitening and coloring transformations. Specifically, whitening aligns the support features to prototypical tokens before attention process, and subsequently coloring restores the original distribution to the attended tokens. This simple yet effective design enables robust attention, thereby generating representative prototypes by capturing the semantic relationships among support features. Our method achieves state-of-the-art performance with a significant margin on multiple FS-PCS benchmarks, demonstrating its effectiveness through extensive experiments.

[102] Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

Yuanchen Wu,Ke Yan,Shouhong Ding,Ziyin Zhou,Xiaoqiang Li

Main category: cs.CV

TL;DR: 本文提出了Self-Rationale Calibration (SRC)框架,通过迭代校准推理过程与答案之间的一致性,提升大视觉语言模型(LVLMs)在视觉问答中的表现。

Details Motivation: 现有LVLMs在生成答案时推理过程与答案不一致,导致错误响应。需要增强模型的推理与输出对齐能力。 Method: 首先采用轻量级的‘推理微调’方法,强制模型先生成推理再得出答案;然后从微调后的模型中搜索多样化的候选响应,并使用专门设计的R-Scorer模型进行成对评分,评估推理质量与事实一致性;最后通过置信度加权的偏好筛选实现偏好微调,完成对齐校准。 Result: SRC在多个基准上显著提升了LVLMs在感知、推理和泛化方面的能力,验证了以推理为中心的对齐策略的有效性。 Conclusion: 通过迭代校准推理与答案的对齐,SRC有效改善了LVLMs的推理一致性与回答准确性,突出了推理导向对齐在挖掘LVLM潜力中的重要性。 Abstract: Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces the Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight "rationale fine-tuning" approach, which modifies the model's response format to require a rationale before deriving an answer without explicit prompts. Next, SRC searches for a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.

[103] Towards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification

Wenkui Yang,Jie Cao,Junxian Duan,Ran He

Main category: cs.CV

TL;DR: 本文提出了一种名为AntiPure的抗净化保护扰动方法,通过 Patch-wise 频率引导和错误时间步引导机制,在图像中嵌入难以被净化去除的不可感知扰动,有效对抗图像滥用风险。

Details Motivation: 现有的保护性扰动方法容易被净化技术移除,导致图像再次面临被恶意伪造的风险,因此需要一种能够抵抗净化的新型保护方法。 Method: 提出AntiPure方法,采用Patch-wise频率引导抑制高频成分的影响,并利用错误时间步引导干扰模型去噪策略,使扰动在净化后仍能保留并造成显著的定制化失真。 Result: 实验表明,AntiPure在代表性净化设置下仍能保持扰动效果,实现最小感知差异和最大失真,优于其他保护性扰动方法。 Conclusion: AntiPure有效揭示了净化-定制流程中的漏洞,为图像安全提供了更强的防护手段,同时可作为评估净化方法鲁棒性的压力测试工具。 Abstract: Diffusion models like Stable Diffusion have become prominent in visual synthesis tasks due to their powerful customization capabilities, which also introduce significant security risks, including deepfakes and copyright infringement. In response, a class of methods known as protective perturbation emerged, which mitigates image misuse by injecting imperceptible adversarial noise. However, purification can remove protective perturbations, thereby exposing images again to the risk of malicious forgery. In this work, we formalize the anti-purification task, highlighting challenges that hinder existing approaches, and propose a simple diagnostic protective perturbation named AntiPure. AntiPure exposes vulnerabilities of purification within the "purification-customization" workflow, owing to two guidance mechanisms: 1) Patch-wise Frequency Guidance, which reduces the model's influence over high-frequency components in the purified image, and 2) Erroneous Timestep Guidance, which disrupts the model's denoising strategy across different timesteps. With additional guidance, AntiPure embeds imperceptible perturbations that persist under representative purification settings, achieving effective post-customization distortion. Experiments show that, as a stress test for purification, AntiPure achieves minimal perceptual discrepancy and maximal distortion, outperforming other protective perturbation methods within the purification-customization workflow.

[104] Noise-Level Diffusion Guidance: Well Begun is Half Done

Harvey Mannering,Zhiwu Huang,Adam Prugel-Bennett

Main category: cs.CV

TL;DR: 提出了一种简单、高效且通用的噪声水平优化方法NLG,无需额外数据、辅助网络或反向传播,能提升扩散模型的生成质量和条件对齐能力。

Details Motivation: 现有噪声水平优化方法依赖额外数据集、辅助网络或基于反向传播的优化,限制了实用性。 Method: 通过提高初始噪声与通用引导对齐的可能性来优化噪声水平,不需训练数据、额外网络或反向传播,适用于条件和无条件扩散模型。 Result: 在五个标准基准上实验表明,NLG提升了生成图像的质量和对输入条件的遵循程度,且计算效率高。 Conclusion: NLG为扩散模型提供了一种实用、可扩展的增强方法,可无缝集成到现有引导技术中。 Abstract: Diffusion models have achieved state-of-the-art image generation. However, the random Gaussian noise used to start the diffusion process influences the final output, causing variations in image quality and prompt adherence. Existing noise-level optimization approaches generally rely on extra dataset construction, additional networks, or backpropagation-based optimization, limiting their practicality. In this paper, we propose Noise Level Guidance (NLG), a simple, efficient, and general noise-level optimization approach that refines initial noise by increasing the likelihood of its alignment with general guidance - requiring no additional training data, auxiliary networks, or backpropagation. The proposed NLG approach provides a unified framework generalizable to both conditional and unconditional diffusion models, accommodating various forms of diffusion-level guidance. Extensive experiments on five standard benchmarks demonstrate that our approach enhances output generation quality and input condition adherence. By seamlessly integrating with existing guidance methods while maintaining computational efficiency, our method establishes NLG as a practical and scalable enhancement to diffusion models. Code can be found at https://github.com/harveymannering/NoiseLevelGuidance.

[105] Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation

Gia Khanh Nguyen,Yifeng Huang,Minh Hoai

Main category: cs.CV

TL;DR: 本文介绍了PairTally,一个用于评估细粒度视觉计数的新基准数据集,包含681张高分辨率图像,涵盖跨类别和类内子类别的对象计数任务。实验表明,尽管现有模型取得进展,但在细粒度和视觉模糊场景下仍难以准确理解用户意图并进行精确计数。

Details Motivation: 现有的视觉计数模型在粗粒度任务上表现良好,但在细粒度、意图驱动的计数任务中能力尚不明确,缺乏合适的基准来评估模型对细微差异(如形状、大小、颜色或语义)的区分与计数能力。 Method: 提出PairTally数据集,每张图像包含两个对象类别,设计了跨类别和类内子类别两种设置,以测试模型的选择性计数能力;并对多种先进模型(包括基于示例的方法、语言提示模型和大型视觉语言模型)进行系统评测。 Result: 实验结果显示,当前最先进的模型在细粒度和视觉模糊情况下难以准确计数,尤其在理解用户意图方面存在明显不足。 Conclusion: PairTally为细粒度视觉计数提供了新的评估基础,揭示了现有模型的局限性,并为未来改进此类系统指明方向。 Abstract: Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.

[106] MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

Elena Camuffo,Francesco Barbato,Mete Ozay,Simone Milani,Umberto Michieli

Main category: cs.CV

TL;DR: MOCHA是一种知识蒸馏方法,将视觉-语言教师模型的区域级多模态语义传递给轻量级纯视觉检测器学生模型,通过对象级对齐实现高效语义迁移。

Details Motivation: 现有知识蒸馏方法多关注密集或全局对齐,难以有效传递细粒度的多模态语义,且通常依赖文本输入或修改教师模型,限制了在轻量级视觉任务中的应用。 Method: 提出MOCHA框架,引入翻译模块将学生模型特征映射到联合空间,并通过双目标损失函数同时优化局部对齐和全局关系一致性,实现无需文本输入和教师模型修改的对象级语义迁移。 Result: 在四个少样本个性化检测基准上验证,平均得分提升+10.1,性能媲美更大的多模态模型。 Conclusion: MOCHA实现了高效、即插即用的跨架构知识蒸馏,适用于资源受限场景下的实际部署。 Abstract: We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.

[107] Performance Optimization of YOLO-FEDER FusionNet for Robust Drone Detection in Visually Complex Environments

Tamara R. Lenhard,Andreas Weinmann,Tobias Koch

Main category: cs.CV

TL;DR: 本文提出了一种改进的YOLO-FEDER FusionNet框架,通过融合通用目标检测与伪装目标检测技术,提升复杂视觉环境中无人机的检测性能。

Details Motivation: 由于背景杂乱、目标尺寸小和伪装效应,无人机在复杂环境中的检测具有挑战性,现有通用检测器在低纹理分离场景中表现不佳。 Method: 在原有YOLO-FEDER FusionNet基础上,改进训练数据构成(大规模逼真合成数据+少量真实数据)、特征融合策略和主干网络设计,并系统评估中间多尺度FEDER特征的作用。 Result: 结合中间FEDER特征与主干网络升级,在YOLOv8l主干和DWD模块提取特征的配置下,相比基线模型,漏检率降低最多达39.1个百分点,mAP在IoU=0.5时提升最多达62.8个百分点。 Conclusion: 所提出的改进显著提升了复杂环境下无人机检测的鲁棒性和准确性,验证了合成数据、特征融合与主干网络协同优化的有效性。 Abstract: Drone detection in visually complex environments remains challenging due to background clutter, small object scale, and camouflage effects. While generic object detectors like YOLO exhibit strong performance in low-texture scenes, their effectiveness degrades in cluttered environments with low object-background separability. To address these limitations, this work presents an enhanced iteration of YOLO-FEDER FusionNet -- a detection framework that integrates generic object detection with camouflage object detection techniques. Building upon the original architecture, the proposed iteration introduces systematic advancements in training data composition, feature fusion strategies, and backbone design. Specifically, the training process leverages large-scale, photo-realistic synthetic data, complemented by a small set of real-world samples, to enhance robustness under visually complex conditions. The contribution of intermediate multi-scale FEDER features is systematically evaluated, and detection performance is comprehensively benchmarked across multiple YOLO-based backbone configurations. Empirical results indicate that integrating intermediate FEDER features, in combination with backbone upgrades, contributes to notable performance improvements. In the most promising configuration -- YOLO-FEDER FusionNet with a YOLOv8l backbone and FEDER features derived from the DWD module -- these enhancements lead to a FNR reduction of up to 39.1 percentage points and a mAP increase of up to 62.8 percentage points at an IoU threshold of 0.5, compared to the initial baseline.

[108] SAIL-VL2 Technical Report

Weijie Yin,Yongjie Ye,Fangxun Shu,Yue Liao,Zijian Kang,Hongyuan Dong,Haiyang Yu,Dingkang Yang,Jiacong Wang,Han Wang,Wenzhuo Liu,Xiao Liang,Shuicheng Yan,Chao Feng

Main category: cs.CV

TL;DR: SAIL-VL2是一个在2B和8B参数规模上表现领先的开源视觉-语言基础模型,通过高质量数据 pipeline、渐进式训练框架和稀疏MoE架构,在多模态理解和复杂推理任务中达到先进水平。

Details Motivation: 为了提升视觉-语言模型在细粒度感知与复杂推理任务上的综合理解能力,并推动开源多模态模型的发展。 Method: 采用大规模数据筛选与评分 pipeline、基于SAIL-ViT的渐进式多阶段训练(包括预训练和思维融合的SFT-RL微调),以及稀疏化Mixture-of-Experts架构设计。 Result: 在106个数据集上表现出色,在MMMU和MathVista等高难度推理基准上达到SOTA;SAIL-VL2-2B在4B以下开源模型中于OpenCompass榜单排名第一。 Conclusion: SAIL-VL2通过系统性优化,在性能、效率和可扩展性之间取得良好平衡,成为当前开源多模态社区中高效且可扩展的基础模型。 Abstract: We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

[109] PROFUSEme: PROstate Cancer Biochemical Recurrence Prediction via FUSEd Multi-modal Embeddings

Suhang You,Carla Pitarch-Abaigar,Sanket Kachole,Sumedh Sonawane,Juhyung Ha,Anish Sudarshan Gada,David Crandall,Rakesh Shiradkar,Spyridon Bakas

Main category: cs.CV

TL;DR: 本文提出了一种名为PROFUSEme的方法,通过融合临床、放射学和病理学数据的多模态嵌入来预测前列腺癌根治术后生化复发(BCR),采用中间融合策略结合Cox比例风险回归模型,在内部验证和外部挑战数据上均表现出优越性能。

Details Motivation: 约30%的前列腺癌患者在进行根治性前列腺切除术后会出现生化复发,早期准确预测有助于改善临床决策和患者预后。 Method: 提出PROFUSEme方法,采用中间融合配置,整合临床、放射学和病理学数据,并使用Cox比例风险回归模型学习跨模态特征交互以预测BCR。 Result: 在内部5折嵌套交叉验证中达到平均C指数0.861(σ=0.112),在CHIMERA 2025挑战赛的保留数据上C指数为0.7103,优于晚期融合方法。 Conclusion: PROFUSEme能有效利用多模态数据提升前列腺癌术后BCR的预测性能,具有临床应用潜力。 Abstract: Almost 30% of prostate cancer (PCa) patients undergoing radical prostatectomy (RP) experience biochemical recurrence (BCR), characterized by increased prostate specific antigen (PSA) and associated with increased mortality. Accurate early prediction of BCR, at the time of RP, would contribute to prompt adaptive clinical decision-making and improved patient outcomes. In this work, we propose prostate cancer BCR prediction via fused multi-modal embeddings (PROFUSEme), which learns cross-modal interactions of clinical, radiology, and pathology data, following an intermediate fusion configuration in combination with Cox Proportional Hazard regressors. Quantitative evaluation of our proposed approach reveals superior performance, when compared with late fusion configurations, yielding a mean C-index of 0.861 ($\sigma=0.112$) on the internal 5-fold nested cross-validation framework, and a C-index of 0.7103 on the hold out data of CHIMERA 2025 challenge validation leaderboard.

[110] Wan-Animate: Unified Character Animation and Replacement with Holistic Replication

Gang Cheng,Xin Gao,Li Hu,Siqi Hu,Mingyang Huang,Chaonan Ji,Ju Li,Dechao Meng,Jinwei Qi,Penchong Qiao,Zhen Shen,Yafei Song,Ke Sun,Linrui Tian,Feng Wang,Guangyuan Wang,Qi Wang,Zhongjian Wang,Jiayu Xiao,Sheng Xu,Bang Zhang,Peng Zhang,Xindi Zhang,Zhe Zhang,Jingren Zhou,Lian Zhuo

Main category: cs.CV

TL;DR: Wan-Animate是一个统一的框架,用于角色动画生成和替换,能够高保真地复现表情与动作,并实现与场景光照和色彩的无缝融合。

Details Motivation: 为了实现高质量的角色动画生成与替换,同时保持环境一致性,现有方法在动作控制、表情还原和场景融合方面仍存在不足。 Method: 基于Wan模型,采用改进的输入范式,使用空间对齐的骨骼信号控制身体动作,从源图像提取隐式面部特征重现表情,并设计辅助的Relighting LoRA模块以匹配场景光照和色彩。 Result: 实验结果表明,Wan-Animate在角色动画生成和替换任务上达到了最先进的性能,具备高可控性和表现力。 Conclusion: Wan-Animate成功统一了角色动画与替换任务,通过多模态条件控制和环境适配模块实现了高保真、自然融合的视频生成,且将开源模型权重与代码。 Abstract: We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene's lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character's appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.

[111] VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement

Jun Du,Weiwei Xing,Ming Li,Fei Richard Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语义增强的多目标跟踪框架VSE-MOT,通过引入视觉语言模型和三分支结构,在低质量视频场景中显著提升了跟踪性能,相比现有方法性能提升约8%到20%,同时在常规场景中保持鲁棒性。

Details Motivation: 当前多目标跟踪算法在低质量视频中表现显著下降,难以应对真实世界中的图像退化问题,因此需要提升算法在低质量视频中的鲁棒性和准确性。 Method: 设计了一个三分支架构,利用视觉语言模型提取图像的全局视觉语义信息,并与查询向量融合;引入MOT-Adapter适配语义信息用于多目标跟踪任务,以及视觉语义融合模块(VSFM)增强特征融合效果。 Result: 实验表明,该方法在真实低质量视频场景中表现出色,跟踪性能指标优于现有方法8%至20%,且在常规场景中保持良好性能。 Conclusion: VSE-MOT通过有效融合视觉语义信息,显著提升了多目标跟踪算法在低质量视频中的性能,具有较强的实际应用价值。 Abstract: Current multi-object tracking (MOT) algorithms typically overlook issues inherent in low-quality videos, leading to significant degradation in tracking performance when confronted with real-world image deterioration. Therefore, advancing the application of MOT algorithms in real-world low-quality video scenarios represents a critical and meaningful endeavor. To address the challenges posed by low-quality scenarios, inspired by vision-language models, this paper proposes a Visual Semantic Enhancement-guided Multi-Object Tracking framework (VSE-MOT). Specifically, we first design a tri-branch architecture that leverages a vision-language model to extract global visual semantic information from images and fuse it with query vectors. Subsequently, to further enhance the utilization of visual semantic information, we introduce the Multi-Object Tracking Adapter (MOT-Adapter) and the Visual Semantic Fusion Module (VSFM). The MOT-Adapter adapts the extracted global visual semantic information to suit multi-object tracking tasks, while the VSFM improves the efficacy of feature fusion. Through extensive experiments, we validate the effectiveness and superiority of the proposed method in real-world low-quality video scenarios. Its tracking performance metrics outperform those of existing methods by approximately 8% to 20%, while maintaining robust performance in conventional scenarios.

[112] AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration

Jingyi Yuan,Jianxiong Ye,Wenkang Chen,Chenqiang Gao

Main category: cs.CV

TL;DR: 本文提出AD-DINOv3,首个将DINOv3应用于零样本异常检测(ZSAD)的多模态框架,通过轻量级适配器和异常感知校准模块解决领域偏差与细微异常识别问题,在八个工业与医学基准上表现优异。

Details Motivation: 传统ZSAD方法主要基于CLIP模型,难以捕捉细微异常且存在领域偏差;新兴视觉基础模型如DINOv3虽具强表征能力,但尚未被有效用于ZSAD任务。 Method: 将异常检测建模为多模态对比学习问题,使用DINOv3作为视觉骨干提取patch和CLS token,CLIP文本编码器生成正常与异常提示的文本嵌入;引入轻量级适配器对齐跨模态表示,并设计异常感知校准模块(AACM)引导CLS token关注异常区域。 Result: 在八个工业和医学异常检测基准上实验表明,AD-DINOv3性能稳定达到或超过现有最先进方法。 Conclusion: AD-DINOv3是一种有效的通用零样本异常检测框架,通过域对齐与注意力引导机制,显著提升DINOv3在ZSAD任务中的表现。 Abstract: Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrary novel categories, offering a scalable and annotation-efficient solution. Traditionally, most ZSAD works have been based on the CLIP model, which performs anomaly detection by calculating the similarity between visual and text embeddings. Recently, vision foundation models such as DINOv3 have demonstrated strong transferable representation capabilities. In this work, we are the first to adapt DINOv3 for ZSAD. However, this adaptation presents two key challenges: (i) the domain bias between large-scale pretraining data and anomaly detection tasks leads to feature misalignment; and (ii) the inherent bias toward global semantics in pretrained representations often leads to subtle anomalies being misinterpreted as part of the normal foreground objects, rather than being distinguished as abnormal regions. To overcome these challenges, we introduce AD-DINOv3, a novel vision-language multimodal framework designed for ZSAD. Specifically, we formulate anomaly detection as a multimodal contrastive learning problem, where DINOv3 is employed as the visual backbone to extract patch tokens and a CLS token, and the CLIP text encoder provides embeddings for both normal and abnormal prompts. To bridge the domain gap, lightweight adapters are introduced in both modalities, enabling their representations to be recalibrated for the anomaly detection task. Beyond this baseline alignment, we further design an Anomaly-Aware Calibration Module (AACM), which explicitly guides the CLS token to attend to anomalous regions rather than generic foreground semantics, thereby enhancing discriminability. Extensive experiments on eight industrial and medical benchmarks demonstrate that AD-DINOv3 consistently matches or surpasses state-of-the-art methods, verifying its superiority as a general zero-shot anomaly detection framework.

[113] Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

Yaru Chen,Ruohao Guo,Liting Gao,Yang Xiang,Qingyu Luo,Zhenbo Li,Wenwu Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于弱监督音视频视频解析(AVVP)的新方法,通过引入EMA引导的伪监督框架和类感知跨模态一致性损失,实现了在无时序标注情况下的高性能事件检测。

Details Motivation: 现有方法多关注全局预测优化,忽视了稳定的片段级监督和类感知的跨模态对齐,限制了弱监督AVVP的性能。 Method: 提出两种策略:1)基于指数移动平均(EMA)生成可靠片段级掩码的伪监督框架;2)类感知跨模态一致性(CMA)损失,用于在可靠片段-类别对上对齐音视频嵌入。 Result: 在LLP和UnAV-100数据集上的实验表明,该方法在多个指标上达到SOTA性能。 Conclusion: 所提出的片段级伪监督和类感知跨模态对齐策略有效提升了弱监督AVVP的检测精度和跨模态一致性。 Abstract: Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.

[114] CSMoE: An Efficient Remote Sensing Foundation Model with Soft Mixture-of-Experts

Leonard Hackel,Tom Burgert,Begüm Demir

Main category: cs.CV

TL;DR: 本文提出了一种基于软混合专家(Soft MoE)机制的遥感基础模型(RS FM)效率提升方法,应用于Cross-Sensor Masked Autoencoder(CSMAE),构建了CSMoE模型,并引入主题-气候描述符驱动的采样策略构建多样化训练集。实验表明,该方法在保持或提升表征性能的同时显著降低计算开销,相比现有模型平均计算效率提升两倍以上。

Details Motivation: 现有的遥感基础模型在训练和推理过程中存在计算复杂度高或表征能力有限的问题,限制了其实际应用。因此需要一种既能提升计算效率又能保持强表征能力的新方法。 Method: 将软混合专家(Soft MoE)机制集成到遥感基础模型中,实现模态特定专家专业化与跨传感器共享表示学习的结合,并在CSMAE基础上构建CSMoE模型;同时提出一种基于主题-气候描述符的采样策略来构建更具代表性和多样性的训练集。 Result: 在场景分类、语义分割和基于内容的图像检索任务上的大量实验表明,CSMoE在计算需求显著降低的同时保持甚至提升了表征性能,相比当前最先进的遥感基础模型实现了更优的性能-效率权衡,平均计算效率提升超过两倍。 Conclusion: 所提出的Soft MoE集成方法和采样策略能有效提升遥感基础模型的计算效率与实用性,为高效遥感表征学习提供了新方向。 Abstract: Self-supervised learning through masked autoencoders has attracted great attention for remote sensing (RS) foundation model (FM) development, enabling improved representation learning across diverse sensors and downstream tasks. However, existing RS FMs often either suffer from substantial computational complexity during both training and inference or exhibit limited representational capacity. These issues restrict their practical applicability in RS. To address this limitation, we propose an adaptation for enhancing the efficiency of RS FMs by integrating the Soft mixture-of-experts (MoE) mechanism into the FM. The integration of Soft MoEs into the FM allows modality-specific expert specialization alongside shared cross-sensor representation learning. To demonstrate the effectiveness of our adaptation, we apply it on the Cross-Sensor Masked Autoencoder (CSMAE) model, resulting in the Cross-Sensor Mixture-of-Experts (CSMoE) model. In addition, we introduce a thematic-climatic descriptor-driven sampling strategy for the construction of a representative and diverse training set to train our CSMoE model. Extensive experiments on scene classification, semantic segmentation, and content-based image retrieval demonstrate that our adaptation yields a reduction in computational requirements while maintaining or improving representational performance. Compared to state-of-the-art RS FMs, CSMoE achieves a superior trade-off between representational capacity, accuracy, and computational efficiency. On average, CSMoE achieves more than twice the computational efficiency of existing RS FMs, while maintaining competitive performance across all experiments. These results show the effectiveness of the proposed adaptation for creating computationally efficient RS FMs. The code for the model, the training set creation, and the model weights will be available at https://git.tu-berlin.de/rsim/csmoe.

[115] Generative AI for Misalignment-Resistant Virtual Staining to Accelerate Histopathology Workflows

Jiabo MA,Wenqiang Li,Jinbang Li,Ziyi Liu,Linshan Wu,Fengtao Zhou,Li Liang,Ronald Cheong Kin Chan,Terence T. W. Wong,Hao Chen

Main category: cs.CV

TL;DR: 提出一种具有级联配准机制的鲁棒虚拟染色框架,以解决生成结果与真实标签之间的空间错位问题,在多个数据集上显著优于现有方法。

Details Motivation: 现有虚拟染色方法依赖于配对数据,但由于化学染色会导致组织形变且无法对同一组织多次染色,难以获得精确配对数据,限制了临床应用。 Method: 设计了一种具有级联注册机制的虚拟染色框架,通过逐步校正空间错位,实现对未配对或粗略配对数据的有效学习。 Result: 在五个数据集上实验表明,该方法在内部数据集上平均提升3.2%,外部数据集上提升10.1%,在严重错位数据集中峰值信噪比提升23.8%。 Conclusion: 所提方法具有强鲁棒性,简化了虚拟染色的数据采集过程,推动了其在临床中的应用发展。 Abstract: Accurate histopathological diagnosis often requires multiple differently stained tissue sections, a process that is time-consuming, labor-intensive, and environmentally taxing due to the use of multiple chemical stains. Recently, virtual staining has emerged as a promising alternative that is faster, tissue-conserving, and environmentally friendly. However, existing virtual staining methods face significant challenges in clinical applications, primarily due to their reliance on well-aligned paired data. Obtaining such data is inherently difficult because chemical staining processes can distort tissue structures, and a single tissue section cannot undergo multiple staining procedures without damage or loss of information. As a result, most available virtual staining datasets are either unpaired or roughly paired, making it difficult for existing methods to achieve accurate pixel-level supervision. To address this challenge, we propose a robust virtual staining framework featuring cascaded registration mechanisms to resolve spatial mismatches between generated outputs and their corresponding ground truth. Experimental results demonstrate that our method significantly outperforms state-of-the-art models across five datasets, achieving an average improvement of 3.2% on internal datasets and 10.1% on external datasets. Moreover, in datasets with substantial misalignment, our approach achieves a remarkable 23.8% improvement in peak signal-to-noise ratio compared to baseline models. The exceptional robustness of the proposed method across diverse datasets simplifies the data acquisition process for virtual staining and offers new insights for advancing its development.

[116] Deceptive Beauty: Evaluating the Impact of Beauty Filters on Deepfake and Morphing Attack Detection

Sara Concas,Simone Maurizio La Cava,Andrea Panzino,Ester Masala,Giulia Orrù,Gian Luca Marcialis

Main category: cs.CV

TL;DR: 该研究探讨了社交媒体美颜滤镜对深度伪造和人脸融合攻击检测器性能的影响,发现美颜滤镜会导致检测性能下降,暴露出当前检测模型在面对面部美化处理时的脆弱性。

Details Motivation: 随着美颜滤镜在社交媒体上的普及,人脸图像和视频的真实性受到威胁,尤其影响到深度伪造和融合攻击的检测效果,因此需要评估这些滤镜对现有检测系统的影响。 Method: 通过在多个基准数据集上应用不同类型的平滑滤镜,系统性地评估多种最先进的深度伪造和融合攻击检测器在滤镜处理前后的表现变化。 Result: 实验结果显示,经过美颜滤镜处理后,检测器的性能普遍下降,表明面部美化增强了检测难度,削弱了现有检测方法的有效性。 Conclusion: 美颜滤镜会显著影响数字篡改检测系统的可靠性,亟需开发更具鲁棒性的检测模型以应对这类常见的人脸增强干扰。 Abstract: Digital beautification through social media filters has become increasingly popular, raising concerns about the reliability of facial images and videos and the effectiveness of automated face analysis. This issue is particularly critical for digital manipulation detectors, systems aiming at distinguishing between genuine and manipulated data, especially in cases involving deepfakes and morphing attacks designed to deceive humans and automated facial recognition. This study examines whether beauty filters impact the performance of deepfake and morphing attack detectors. We perform a comprehensive analysis, evaluating multiple state-of-the-art detectors on benchmark datasets before and after applying various smoothing filters. Our findings reveal performance degradation, highlighting vulnerabilities introduced by facial enhancements and underscoring the need for robust detection models resilient to such alterations.

[117] MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

Peng Xu,Shengwu Xiong,Jiajun Zhang,Yaxiong Chen,Bowen Zhou,Chen Change Loy,David A. Clifton,Kyoung Mu Lee,Luc Van Gool,Ruiming He,Ruilin Yao,Xinwei Long,Jirui Huang,Kai Tian,Sa Yang,Yihua Shao,Jin Feng,Yue Zhong,Jiakai Zhou,Cheng Tang,Tianyu Zou,Yifang Zhang,Junming Liang,Guoyou Li,Zhaoxiang Wang,Qiang Zhou,Yichen Zhao,Shili Xiong,Hyeongjin Nam,Jaerin Lee,Jaeyoung Chung,JoonKyu Park,Junghun Oh,Kanggeon Lee,Wooseok Lee,Juneyoung Ro,Turghun Osman,Can Hu,Chaoyang Liao,Cheng Chen,Chengcheng Han,Chenhao Qiu,Chong Peng,Cong Xu,Dailin Li,Feiyu Wang,Feng Gao,Guibo Zhu,Guopeng Tang,Haibo Lu,Han Fang,Han Qi,Hanxiao Wu,Haobo Cheng,Hongbo Sun,Hongyao Chen,Huayong Hu,Hui Li,Jiaheng Ma,Jiang Yu,Jianing Wang,Jie Yang,Jing He,Jinglin Zhou,Jingxuan Li,Josef Kittler,Lihao Zheng,Linnan Zhao,Mengxi Jia,Muyang Yan,Nguyen Thanh Thien,Pu Luo,Qi Li,Shien Song,Shijie Dong,Shuai Shao,Shutao Li,Taofeng Xue,Tianyang Xu,Tianyi Gao,Tingting Li,Wei Zhang,Weiyang Su,Xiaodong Dong,Xiao-Jun Wu,Xiaopeng Zhou,Xin Chen,Xin Wei,Xinyi You,Xudong Kang,Xujie Zhou,Xusheng Liu,Yanan Wang,Yanbin Huang,Yang Liu,Yang Yang,Yanglin Deng,Yashu Kang,Ye Yuan,Yi Wen,Yicen Tian,Yilin Tao,Yin Tang,Yipeng Lin,Yiqing Wang,Yiting Xi,Yongkang Yu,Yumei Li,Yuxin Qin,Yuying Chen,Yuzhe Cen,Zhaofan Zou,Zhaohong Liu,Zhehao Shen,Zhenglin Du,Zhengyang Li,Zhenni Huang,Zhenwei Shao,Zhilong Song,Zhiyong Feng,Zhiyu Wang,Zhou Yu,Ziang Li,Zihan Zhai,Zijian Zhang,Ziyang Peng,Ziyun Xiao,Zongshu Li

Main category: cs.CV

TL;DR: 本文介绍了MARS2 2025挑战赛,旨在通过大规模基准测试推动多模态机器学习与大语言模型的发展,聚焦现实世界和特定场景下的多模态推理应用。

Details Motivation: 为了整合多模态机器学习和大语言模型的不同方法,并推动该领域的前沿研究,特别是在真实和专业化场景中的应用。 Method: 发布了两个定制数据集Lens和AdsQA,设置了三个竞赛赛道:VG-RS、VQA-SA和VR-Ads,并评估了40多个基线模型。 Result: 吸引了76支来自知名学术和工业机构的团队参与,收到1200多次提交,其中40余次有效提交进入排名;所有数据集、代码和排名已公开。 Conclusion: MARS2 2025成功构建了一个促进多模态推理研究的平台,推动了MLLM在实际场景中的发展,并为未来研究提供了丰富的资源。 Abstract: This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.

[118] An Exploratory Study on Abstract Images and Visual Representations Learned from Them

Haotian Li,Jianbo Jiao

Main category: cs.CV

TL;DR: 本文研究了抽象图像(由基本形状构成)在不同抽象层次下捕捉高级语义信息的能力,提出了分层抽象图像数据集(HAID),并通过多种视觉任务评估了其与传统光栅图像的表示差异。

Details Motivation: 探索为何抽象图像在深度学习模型中的表现不及传统光栅图像,并研究不同抽象层次下能保留多少高层语义内容。 Method: 构建了一个包含多级抽象程度的抽象图像数据集HAID,并在分类、分割和目标检测等任务上训练和评估常规视觉系统。 Result: 通过实验分析了抽象图像与光栅图像在视觉任务中的性能差距,揭示了不同抽象层次对语义信息保留的影响。 Conclusion: 抽象图像在一定程度上可以有效传达视觉语义信息,具备作为视觉任务输入格式的潜力,但其性能受抽象程度影响显著。 Abstract: Imagine living in a world composed solely of primitive shapes, could you still recognise familiar objects? Recent studies have shown that abstract images-constructed by primitive shapes-can indeed convey visual semantic information to deep learning models. However, representations obtained from such images often fall short compared to those derived from traditional raster images. In this paper, we study the reasons behind this performance gap and investigate how much high-level semantic content can be captured at different abstraction levels. To this end, we introduce the Hierarchical Abstraction Image Dataset (HAID), a novel data collection that comprises abstract images generated from normal raster images at multiple levels of abstraction. We then train and evaluate conventional vision systems on HAID across various tasks including classification, segmentation, and object detection, providing a comprehensive study between rasterised and abstract image representations. We also discuss if the abstract image can be considered as a potentially effective format for conveying visual semantic information and contributing to vision tasks.

[119] BEVUDA++: Geometric-aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection

Rongyu Zhang,Jiaming Liu,Xiaoqi Li,Xiaowei Chi,Dan Wang,Li Du,Yuan Du,Shanghang Zhang

Main category: cs.CV

TL;DR: 本文提出了BEVUDA++,一种几何感知的师生框架,用于解决多视角3D目标检测中的域适应问题,通过可靠深度教师和几何一致学生模型有效减小跨域场景下的性能下降。

Details Motivation: 现有视觉中心的BEV感知方法在跨域应用时存在显著的域偏移问题,导致性能下降,而当前研究忽视了这一挑战。 Method: 提出BEVUDA++框架,包含可靠深度教师(RDT)和几何一致学生(GCS);RDT结合LiDAR与不确定性估计的深度预测生成深度感知特征;GCS将多空间特征映射到统一几何嵌入空间;引入不确定性引导的指数移动平均(UEMA)减少误差累积。 Result: 在四个跨域场景中进行了实验,实现了最先进的性能,在Day-Night适应任务上NDS提升12.9%,mAP提升9.5%。 Conclusion: BEVUDA++能有效缓解多几何空间中的域偏移累积问题,显著提升跨域BEV 3D目标检测性能。 Abstract: Vision-centric Bird's Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9\% NDS and 9.5\% mAP enhancement on Day-Night adaptation.

[120] Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

Michal Szczepanski,Martyna Poreba,Karim Haroun

Main category: cs.CV

TL;DR: 提出STEP框架,结合动态补丁合并和令牌剪枝,提升视觉Transformer在高分辨率语义分割中的效率。

Details Motivation: Vision Transformers在语义分割中性能优越但计算和内存开销高,需提高效率。 Method: 设计基于轻量CNN的dCTS策略网络实现超补丁合并,并在编码器中引入早期退出机制以剪除高置信度超令牌。 Result: 相比标准补丁划分,dCTS单独使用可减少2.5倍令牌数;完整STEP框架最高降低4倍计算复杂度,提升1.7倍推理速度。 Conclusion: STEP显著降低ViT的计算成本并提升吞吐量,仅带来不超过2.0%的精度损失,有效平衡效率与性能。 Abstract: Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

[121] Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark

Nisarg A. Shah,Amir Ziai,Chaitanya Ekanadham,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了Cinéaste,一个用于长篇电影理解的综合基准,包含3,119个来自200部电影的多选题问答对,旨在评估模型在细粒度上下文推理方面的能力。通过GPT-4o生成问题,并采用两阶段过滤确保问题依赖视频上下文且事实准确。实验表明现有MLLM在该基准上表现不佳,尤其在长时序推理上存在瓶颈。

Details Motivation: 现有视频理解基准多局限于短片段识别或模板化问题,缺乏对长篇叙事内容的细粒度推理能力评估,因此需要构建更全面的评测基准以诊断模型的深层叙事理解能力。 Method: 构建了一个包含视觉描述、字幕、场景标题和摘要的多模态数据集,利用GPT-4o生成上下文丰富的问题,并设计两阶段过滤机制(上下文独立性过滤和上下文真实性过滤)保证问题质量和事实一致性。 Result: 实验显示现有MLLM在Cinéaste上表现较差,最佳开源模型准确率仅为63.15%,表明长距离时序推理是主要瓶颈。 Conclusion: Cinéaste揭示了当前多模态大模型在长篇电影理解特别是细粒度上下文推理方面的显著不足,凸显了该领域亟需进一步研究和技术进步。 Abstract: While recent advancements in vision-language models have improved video understanding, diagnosing their capacity for deep, narrative comprehension remains a challenge. Existing benchmarks often test short-clip recognition or use template-based questions, leaving a critical gap in evaluating fine-grained reasoning over long-form narrative content. To address these gaps, we introduce $\mathsf{Cin\acute{e}aste}$, a comprehensive benchmark for long-form movie understanding. Our dataset comprises 3,119 multiple-choice question-answer pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel fine-grained contextual reasoning categories. We use GPT-4o to generate diverse, context-rich questions by integrating visual descriptions, captions, scene titles, and summaries, which require deep narrative understanding. To ensure high-quality evaluation, our pipeline incorporates a two-stage filtering process: Context-Independence filtering ensures questions require video context, while Contextual Veracity filtering validates factual consistency against the movie content, mitigating hallucinations. Experiments show that existing MLLMs struggle on $\mathsf{Cin\acute{e}aste}$; our analysis reveals that long-range temporal reasoning is a primary bottleneck, with the top open-source model achieving only 63.15\% accuracy. This underscores significant challenges in fine-grained contextual understanding and the need for advancements in long-form movie comprehension.

[122] GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang,Penghao Yin,Xiangyu Zhao,Changyao Tian,Yu Qiao,Wenhai Wang,Jifeng Dai,Gen Luo

Main category: cs.CV

TL;DR: GenExam是首个面向多学科文本到图像生成的考试基准,包含1000个样本和精细评分体系,揭示当前模型在语义正确性和视觉合理性方面表现极差,凸显其对实现通用人工智能评估的重要意义。

Details Motivation: 现有基准多关注理解与推理或世界知识的呈现,缺乏对严谨绘图考试的评估,因此需要一个能综合测试模型知识整合、推理与生成能力的新型基准。 Method: 提出GenExam,构建涵盖10个学科的1000个样本数据集,设计四层分类体系的考试式提示,并为每个问题提供真实图像和细粒度评分点,以量化评估生成结果的语义正确性与视觉合理性。 Result: 实验显示,即使最先进的模型如GPT-Image-1和Gemini-2.5-Flash-Image的严格得分也低于15%,大多数模型接近0%,表明该基准极具挑战性。 Conclusion: GenExam为文本到图像生成模型提供了严格的多学科考试框架,有效衡量模型在知识整合、推理与生成方面的综合能力,为通向通用人工智能的发展路径提供了重要评估工具。 Abstract: Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.