Skip to content

Table of Contents

cs.CL [Back]

[1] PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization

Jiajun Zhang,Jianke Zhang,Zeyu Cui,Jiaxi Yang,Lei Zhang,Binyuan Hui,Qiang Liu,Zilei Wang,Liang Wang,Junyang Lin

Main category: cs.CL

TL;DR: 本文提出了PlotCraft,一个包含1000个复杂可视化任务的新基准,并发布了SynthVis-30K数据集和PlotCraftor模型,显著提升了大语言模型在复杂数据可视化生成上的性能。

Details Motivation: 现有大语言模型在代码生成方面表现优异,但在复杂、结构化数据的可视化生成能力上仍缺乏系统评估与提升,亟需专门的基准和模型来填补这一空白。 Method: 构建了涵盖7类高层任务、48种图表类型的PlotCraft基准,提出多轮交互评估机制;通过协作代理框架生成SynthVis-30K高质量可视化代码数据集,并基于此训练轻量级模型PlotCraftor。 Result: 在PlotCraft、VisEval和PandasPlotBench等多个基准上,PlotCraftor性能媲美领先的闭源模型,在困难任务上性能提升超过50%。 Conclusion: PlotCraft为复杂数据可视化提供了新的评估标准,结合SynthVis-30K和PlotCraftor展示了合成数据驱动的小模型在该领域的潜力,推动了LLM在可视化生成方面的发展。 Abstract: Recent Large Language Models (LLMs) have demonstrated remarkable profi- ciency in code generation. However, their ability to create complex visualiza- tions for scaled and structured data remains largely unevaluated and underdevel- oped. To address this gap, we introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as fi- nance, scientific research, and sociology. The benchmark is structured around seven high-level visualization tasks and encompasses 48 distinct chart types. Cru- cially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our com- prehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious per- formance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we develope SynthVis-30K, a large-scale, high-quality dataset of complex visualization code synthesized via a collaborative agent frame- work. Building upon this dataset, we develope PlotCraftor, a novel code gener- ation model that achieves strong capabilities in complex data visualization with a remarkably small size. Across VisEval, PandasPlotBench, and our proposed PlotCraft, PlotCraftor shows performance comparable to that of leading propri- etary approaches. Especially, on hard task, Our model achieves over 50% per- formance improvement. We will release the benchmark, dataset, and code at https://github.com/Speakn0w/PlotCraft-Benchmark.

[2] Cognitive Alignment in Personality Reasoning: Leveraging Prototype Theory for MBTI Inference

Haoyuan Li,Yuanbo Tong,Yuchen Li,Zirui Wang,Chunhou Liu,Jiamou Liu

Main category: cs.CL

TL;DR: 本文提出了ProtoMBTI,一种基于大语言模型和原型理论的MBTI人格推断框架,通过检索、复用、修正和保留机制提升文本中人格识别的准确性、可解释性和跨数据集泛化能力。

Details Motivation: 传统的人格识别方法多采用硬标签分类,忽略了人类人格判断的渐变性和原型特征,因此需要一种更符合认知机制的建模方式。 Method: 提出ProtoMBTI框架:首先通过LLM引导的多维增强构建高质量语料库;然后使用LoRA微调轻量编码器以学习判别性嵌入并建立标准化人格原型库;推理时通过检索-复用-修正-保留循环进行预测。 Result: 在Kaggle和Pandora数据集上,ProtoMBTI在四个MBTI维度及16种类型分类任务中均优于基线模型,并表现出良好的跨数据集泛化能力。 Conclusion: 将推理过程与心理学中的原型理论对齐,能有效提升基于文本的人格建模在准确性、可解释性和迁移性方面的表现。 Abstract: Personality recognition from text is typically cast as hard-label classification, which obscures the graded, prototype-like nature of human personality judgments. We present ProtoMBTI, a cognitively aligned framework for MBTI inference that operationalizes prototype theory within an LLM-based pipeline. First, we construct a balanced, quality-controlled corpus via LLM-guided multi-dimensional augmentation (semantic, linguistic, sentiment). Next, we LoRA-fine-tune a lightweight (<=2B) encoder to learn discriminative embeddings and to standardize a bank of personality prototypes. At inference, we retrieve top-k prototypes for a query post and perform a retrieve--reuse--revise--retain cycle: the model aggregates prototype evidence via prompt-based voting, revises when inconsistencies arise, and, upon correct prediction, retains the sample to continually enrich the prototype library. Across Kaggle and Pandora benchmarks, ProtoMBTI improves over baselines on both the four MBTI dichotomies and the full 16-type task, and exhibits robust cross-dataset generalization. Our results indicate that aligning the inference process with psychological prototype reasoning yields gains in accuracy, interpretability, and transfer for text-based personality modeling.

[3] ParaScopes: What do Language Models Activations Encode About Future Text?

Nicky Pochinkov,Yulia Volkova,Anna Vasileva,Sai V R Chereddy

Main category: cs.CL

TL;DR: 提出残差流解码器框架,用于探测语言模型中段落和文档级别的规划信息,能在小模型中解码出相当于5个以上未来词元的信息。

Details Motivation: 现有方法在理解语言模型激活时多局限于特定概念或词元,难以捕捉长时程任务中的规划表示。 Method: 开发残差流解码器框架,通过探测模型激活来提取段落和文档尺度的计划信息。 Result: 在小模型中可解码出相当于5个以上未来词元的上下文信息。 Conclusion: 该方法为监控语言模型和理解其长期规划信息编码机制奠定了基础。 Abstract: Interpretability studies in language models often investigate forward-looking representations of activations. However, as language models become capable of doing ever longer time horizon tasks, methods for understanding activations often remain limited to testing specific concepts or tokens. We develop a framework of Residual Stream Decoders as a method of probing model activations for paragraph-scale and document-scale plans. We test several methods and find information can be decoded equivalent to 5+ tokens of future context in small models. These results lay the groundwork for better monitoring of language models and better understanding how they might encode longer-term planning information.

[4] Training LLMs Beyond Next Token Prediction - Filling the Mutual Information Gap

Chun-Hao Yang,Bo-Han Feng,Tzu-Yuan Lai,Yan Yu Chen,Yin-Kai Dean Huang,Shou-De Lin

Main category: cs.CL

TL;DR: 提出了一种通过预测信息丰富的标记来优化大语言模型训练的新方法,相较于传统的下一个标记预测,能更有效地提升模型性能。

Details Motivation: 传统的下一个标记预测方法在训练大语言模型时效率较低,难以在保持计算成本的同时显著提升性能。 Method: 研究选择信息丰富的目标标记进行预测,替代常规的下一个标记预测,并在算术、文本多标签分类和自然语言生成任务中验证其效果。 Result: 该方法在多种任务上提升了模型性能,同时提供了对目标标记选择策略的理论理解。 Conclusion: 通过有选择地预测信息丰富的标记,可以更高效地训练大语言模型,为训练优化提供了新的原则性路径。 Abstract: Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.

[5] Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Marwa Abdulhai,Ryan Cheng,Donovan Clay,Tim Althoff,Sergey Levine,Natasha Jaques

Main category: cs.CL

TL;DR: 本文提出了一种统一框架,用于评估和改进大语言模型(LLM)在生成对话中的人设一致性,通过三种自动度量指标并结合多轮强化学习,显著减少了角色偏离,使人设模拟更连贯、更忠实。

Details Motivation: 现成的大型语言模型在模拟人类用户时容易偏离指定人设,导致前后矛盾或行为不当,影响其在治疗、教育等交互场景中的应用效果。 Method: 提出了三种自动评估指标:提示-语句一致性、语句-语句一致性和问答一致性,并使用这些指标作为奖励信号,通过多轮强化学习对LLM进行微调。 Result: 在患者、学生和社会聊天伙伴三种角色上进行实验,该方法将不一致性降低了55%以上。 Conclusion: 所提出的框架能有效提升LLM在模拟用户角色时的人设一致性,增强了其在交互式应用中的可靠性和实用性。 Abstract: Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.

[6] AgentBnB: A Browser-Based Cybersecurity Tabletop Exercise with Large Language Model Support and Retrieval-Aligned Scaffolding

Arman Anwar,Zefang Liu

Main category: cs.CL

TL;DR: AgentBnB 是一个基于浏览器的网络安全桌面演练系统,利用大语言模型和检索增强型辅助工具,提供可扩展、轻量化的训练体验。

Details Motivation: 传统网络安全桌面演练(TTXs)通常脚本化、资源消耗大且难以扩展,因此需要一种更灵活、低成本的替代方案。 Method: 设计并实现 AgentBnB 系统,结合大语言模型队友与 Bloom 目标对齐的检索增强型 copilot(C2D2),将知识库分解为事实性、概念性、程序性和元认知片段,并通过提示工程实现渐进式支架式学习。 Result: 在四名研究生的单人试点实验中,参与者更倾向于使用基于代理的版本,认为其更具可扩展性;但在简单知识测验中出现了天花板效应。 Conclusion: 尽管存在样本量小、仅限单人模式和语料范围窄等局限,结果表明大语言模型增强的 TTX 可提供轻量、可重复的练习,减轻传统演练的后勤负担。未来计划包括多人模式、基于遥测数据的辅导和更大规模的对比研究。 Abstract: Traditional cybersecurity tabletop exercises (TTXs) provide valuable training but are often scripted, resource-intensive, and difficult to scale. We introduce AgentBnB, a browser-based re-imagining of the Backdoors & Breaches game that integrates large language model teammates with a Bloom-aligned, retrieval-augmented copilot (C2D2). The system expands a curated corpus into factual, conceptual, procedural, and metacognitive snippets, delivering on-demand, cognitively targeted hints. Prompt-engineered agents employ a scaffolding ladder that gradually fades as learner confidence grows. In a solo-player pilot with four graduate students, participants reported greater intention to use the agent-based version compared to the physical card deck and viewed it as more scalable, though a ceiling effect emerged on a simple knowledge quiz. Despite limitations of small sample size, single-player focus, and narrow corpus, these early findings suggest that large language model augmented TTXs can provide lightweight, repeatable practice without the logistical burden of traditional exercises. Planned extensions include multi-player modes, telemetry-driven coaching, and comparative studies with larger cohorts.

Shounak Paul,Dhananjay Ghumare,Pawan Goyal,Saptarshi Ghosh,Ashutosh Modi

Main category: cs.CL

TL;DR: 本文提出了IL-PCR,一个用于印度法律先例和法规检索的统一语料库,支持利用两项任务间依赖关系的模型开发,并提出基于LLM的重排序方法实现最优性能。

Details Motivation: 现有研究将法规检索和先例检索视为独立任务,忽略了二者之间的内在关联(如相似案件常引用相似法规),本文旨在填补这一空白。 Method: 构建了一个名为IL-PCR的统一语料库,支持同时进行法规与先例检索;实验评估了多种基线模型(包括词法、语义及基于GNN的集成模型),并设计了一种基于大语言模型(LLM)的重排序方法以利用任务间的依赖关系。 Result: 基于LLM的重排序方法在两项检索任务上均取得了最佳性能,验证了利用任务间依赖关系的有效性。 Conclusion: 通过构建统一语料库和引入LLM重排序,能够有效利用法规与先例检索之间的内在联系,提升法律信息检索的整体效果。 Abstract: Identifying/retrieving relevant statutes and prior cases/precedents for a given legal situation are common tasks exercised by law practitioners. Researchers to date have addressed the two tasks independently, thus developing completely different datasets and models for each task; however, both retrieval tasks are inherently related, e.g., similar cases tend to cite similar statutes (due to similar factual situation). In this paper, we address this gap. We propose IL-PCR (Indian Legal corpus for Prior Case and Statute Retrieval), which is a unique corpus that provides a common testbed for developing models for both the tasks (Statute Retrieval and Precedent Retrieval) that can exploit the dependence between the two. We experiment extensively with several baseline models on the tasks, including lexical models, semantic models and ensemble based on GNNs. Further, to exploit the dependence between the two tasks, we develop an LLM-based re-ranking approach that gives the best performance.

[8] POSESTITCH-SLT: Linguistically Inspired Pose-Stitching for End-to-End Sign Language Translation

Abhinav Joshi,Vaibhav Sharma,Sanjeet Singh,Ashutosh Modi

Main category: cs.CL

TL;DR: 提出POSESTITCH-SLT预训练方法,利用语言模板生成句对,在低资源手语翻译中显著提升性能。

Details Motivation: 由于大规模对齐数据集稀缺,手语翻译面临挑战,现有方法多集中于特征提取和模型结构改进。 Method: 提出基于语言模板的预训练方案POSESTITCH-SLT,采用简单的Transformer编码器-解码器架构,并引入模板生成的合成句对进行训练。 Result: 在How2Sign和iSign数据集上BLEU-4分数分别从1.97提升至4.56、0.55提升至3.43,优于先前方法。 Conclusion: 模板驱动的合成监督能有效提升低资源场景下的姿态为基础的手语翻译性能。 Abstract: Sign language translation remains a challenging task due to the scarcity of large-scale, sentence-aligned datasets. Prior arts have focused on various feature extraction and architectural changes to support neural machine translation for sign languages. We propose POSESTITCH-SLT, a novel pre-training scheme that is inspired by linguistic-templates-based sentence generation technique. With translation comparison on two sign language datasets, How2Sign and iSign, we show that a simple transformer-based encoder-decoder architecture outperforms the prior art when considering template-generated sentence pairs in training. We achieve BLEU-4 score improvements from 1.97 to 4.56 on How2Sign and from 0.55 to 3.43 on iSign, surpassing prior state-of-the-art methods for pose-based gloss-free translation. The results demonstrate the effectiveness of template-driven synthetic supervision in low-resource sign language settings.

[9] Language Modeling With Factorization Memory

Lee Xiong,Maksim Tkachenko,Johanes Effendi,Ting Cai

Main category: cs.CL

TL;DR: 提出了一种名为Factorization Memory的高效RNN架构,在短上下文任务上性能媲美Transformer,并在长上下文场景中表现出更强的泛化能力。

Details Motivation: 设计一种兼具高效训练、低推理复杂度和良好长短上下文建模能力的RNN架构。 Method: 基于Mamba-2构建Factorization Memory,引入稀疏化版本以提升效率,在训练时利用并行计算,推理时保持常数级计算和内存复杂度。 Result: Factorization Memory在短上下文语言建模任务上达到与Transformer相当的性能,在长上下文场景中表现更优;稀疏版本在减少状态更新的同时保持了密集版本的性能。 Conclusion: 这是首个将稀疏记忆激活与在长短上下文设置下均具竞争力的性能相结合的RNN架构,为高效序列建模提供了新方向。 Abstract: We propose Factorization Memory, an efficient recurrent neural network (RNN) architecture that achieves performance comparable to Transformer models on short-context language modeling tasks while also demonstrating superior generalization in long-context scenarios. Our model builds upon Mamba-2, enabling Factorization Memory to exploit parallel computations during training while preserving constant computational and memory complexity during inference. To further optimize model efficiency and representational capacity, we develop a sparse formulation of Factorization Memory that updates only a subset of recurrent states at each step while preserving the strong performance of its dense counterpart. To our knowledge, this represents the first RNN architecture that successfully combines sparse memory activation with competitive performance across both short and long-context settings. This work provides a systematic empirical analysis of Factorization Memory in comparison to Transformer and Mamba-2 architectures.

[10] Reversal Invariance in Autoregressive Language Models

Mihir Sahasrabudhe

Main category: cs.CL

TL;DR: 本文提出了因果语言模型(CLM)目标函数的一种结构性质——反转不变性,即下一个词预测损失对文本及其反转序列赋予相同的似然,导致标准预训练是方向盲的。作者认为这限制了模型捕捉语言中时间不对称依赖的能力,并呼吁未来研究应关注能显式建模语言时间箭头的新目标函数和架构。

Details Motivation: 尽管人类语言具有时间上的不对称性,但现有CLM预训练目标在文本与其反转形式上表现相似,引发对当前目标函数是否能有效捕获语言方向性依赖的质疑。 Method: 通过形式化分析CLM目标函数的结构特性,提出并定义了‘反转不变性’这一概念,并论证其在理论和实证上的含义。 Result: 证明了标准CLM的下一词预测损失在原始语料和反转语料上具有相同似然,解释了反向训练模型为何仍可取得良好性能,并揭示当前预训练目标可能无法捕捉语言中的方向性依赖。 Conclusion: 反转不变性反映了当前CLM预训练目标的局限性,建议未来工作应转向能够显式建模时间不对称性的语言建模范式,以更好捕捉语言中的因果、形态或音系方向依赖。 Abstract: We formalize a structural property of the causal (autoregressive) language modeling (CLM) objective: reversal invariance. Formally, the next-token prediction loss assigns identical likelihood to a corpus and its reversal, implying that standard CLM pretraining is direction-blind. This symmetry explains why models trained on reversed text can achieve comparable performance to those trained on forward text, despite the inherently time-asymmetric nature of human language and reasoning. We argue that this invariance represents a limitation of current pretraining objectives rather than a benign artifact. If natural language encodes directional dependencies - phonological, morphological, or causal - a symmetric objective may fail to capture them. We therefore propose viewing pretraining through the lens of temporal asymmetry, motivating future work on loss functions and architectures that explicitly model the arrow of language while retaining standard language modeling capacity.

[11] LingGym: How Far Are LLMs from Thinking Like Field Linguists?

Changbing Yang,Franklin Ma,Freda Shi,Jian Zhu

Main category: cs.CL

TL;DR: 本文提出了LingGym,一个基于18种类型多样的参考语法书中提取的逐词标注文本(IGT)和语法描述来评估大语言模型(LLM)元语言推理能力的新基准。通过引入“词-标注推断”任务,研究发现利用结构化语言线索能持续提升模型在低资源语言和未见结构上的推理表现,揭示了LLM在语言类型学分析与低资源语言记录中的潜力与局限。

Details Motivation: 现有研究多集中于特定下游任务,缺乏对LLM在未见语言结构和低资源语言中进行泛化语言推理能力的评估。因此,需要一个能够衡量LLM元语言推理能力的新基准。 Method: 从18种类型多样的参考语法书中提取逐词标注文本(IGT)和语法描述,构建LingGym基准;设计“词-标注推断”任务,控制输入的语言信息层次(如标注、语法解释、翻译),评估模型在不同信息条件下的推理能力。 Result: 实验结果显示,引入结构化的语言线索(如语法解释和标注)能显著且一致地提升所有模型的推理性能,尤其是在低资源和未见语言结构上表现出更强的泛化能力。 Conclusion: LingGym为评估LLM的元语言推理提供了有效框架,表明结构化语言信息有助于提升模型表现,但同时也暴露了当前模型在复杂语言分析任务中的局限性,提示未来需增强对语言类型多样性的建模能力。 Abstract: This paper introduces LingGym, a new benchmark that evaluates LLMs' capacity for meta-linguistic reasoning using Interlinear Glossed Text (IGT) and grammatical descriptions extracted from 18 typologically diverse reference grammars. Unlike previous work that focuses on specific downstream tasks, we assess whether LLMs can generalize linguistic inference across low-resource languages and structures not seen during training. We present a controlled evaluation task: Word-Gloss Inference, in which the model must infer a missing word and gloss from context using varying levels of linguistic information (e.g., glosses, grammatical explanations, translations). Our results show that incorporating structured linguistic cues leads to consistent improvements in reasoning performance across all models. This work highlights both the promise and current limitations of using LLMs for typologically informed linguistic analysis and low-resource language documentation.

[12] Reasoning Trajectories for Socratic Debugging of Student Code: From Misconceptions to Contradictions and Updated Beliefs

Erfan Al-Hossami,Razvan Bunescu

Main category: cs.CL

TL;DR: 本文提出了“推理路径生成”任务,并构建了一个带有Socratic调试推理路径标注的数据集,利用大模型生成有效的推理路径和对话,评估显示前沿大模型能生成91%正确的推理路径和98.7%有效的对话回合。

Details Motivation: 针对初学者因编程误解导致的bug,希望通过Socratic式引导帮助学生自主发现并纠正错误观念,从而提升学习效果。 Method: 提出推理轨迹(RT)的概念,构建带标注的调试问题数据集,并基于大语言模型生成RT及基于RT的Socratic对话。 Result: 大规模LLM-as-judge评估显示,前沿模型可生成最多91%正确推理轨迹和98.7%有效对话回合。 Conclusion: 基于大模型的Socratic调试推理路径生成是可行且高效的,有望用于编程教育中的个性化指导。 Abstract: In Socratic debugging, instructors guide students towards identifying and fixing a bug on their own, instead of providing the bug fix directly. Most novice programmer bugs are caused by programming misconceptions, namely false beliefs about a programming concept. In this context, Socratic debugging can be formulated as a guided Reasoning Trajectory (RT) leading to a statement about the program behavior that contradicts the bug-causing misconception. Upon reaching this statement, the ensuing cognitive dissonance leads the student to first identify and then update their false belief. In this paper, we introduce the task of reasoning trajectory generation, together with a dataset of debugging problems manually annotated with RTs. We then describe LLM-based solutions for generating RTs and Socratic conversations that are anchored on them. A large-scale LLM-as-judge evaluation shows that frontier models can generate up to 91% correct reasoning trajectories and 98.7% valid conversation turns.

[13] PADBen: A Comprehensive Benchmark for Evaluating AI Text Detectors Against Paraphrase Attacks

Yiwei Zha,Rui Min,Shanu Sushmita

Main category: cs.CL

TL;DR: 本文研究了迭代改写文本如何逃避AI生成文本检测器,并提出了PADBen基准来系统评估检测器对两类改写攻击(作者身份混淆和抄袭规避)的鲁棒性,发现现有检测方法在中间“清洗”区域表现不佳。

Details Motivation: 现有的AI生成文本检测器在面对直接的大语言模型输出时表现良好,但在处理经过迭代改写的文本时效果显著下降,因此需要探究其背后机制并提升检测器的鲁棒性。 Method: 通过内在机制分析,提出了一种包含五类文本的分类体系和五个渐进式检测任务的PADBen基准,用于系统评估11种最先进检测器在两种改写攻击场景下的性能。 Result: 实验表明,当前检测器能有效识别抄袭规避问题,但无法应对作者身份混淆问题,暴露出在语义位移与生成模式保留交织的‘中间清洗区域’中的检测盲区。 Conclusion: 现有基于语义和风格差异的检测方法不足以应对迭代改写带来的挑战,必须发展更根本性的检测架构以应对文本‘清洗’攻击。 Abstract: While AI-generated text (AIGT) detectors achieve over 90\% accuracy on direct LLM outputs, they fail catastrophically against iteratively-paraphrased content. We investigate why iteratively-paraphrased text -- itself AI-generated -- evades detection systems designed for AIGT identification. Through intrinsic mechanism analysis, we reveal that iterative paraphrasing creates an intermediate laundering region characterized by semantic displacement with preserved generation patterns, which brings up two attack categories: paraphrasing human-authored text (authorship obfuscation) and paraphrasing LLM-generated text (plagiarism evasion). To address these vulnerabilities, we introduce PADBen, the first benchmark systematically evaluating detector robustness against both paraphrase attack scenarios. PADBen comprises a five-type text taxonomy capturing the full trajectory from original content to deeply laundered text, and five progressive detection tasks across sentence-pair and single-sentence challenges. We evaluate 11 state-of-the-art detectors, revealing critical asymmetry: detectors successfully identify the plagiarism evasion problem but fail for the case of authorship obfuscation. Our findings demonstrate that current detection approaches cannot effectively handle the intermediate laundering region, necessitating fundamental advances in detection architectures beyond existing semantic and stylistic discrimination methods. For detailed code implementation, please see https://github.com/JonathanZha47/PadBen-Paraphrase-Attack-Benchmark.

[14] MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

Naoto Iwase,Hiroki Okuyama,Junichiro Iwasawa

Main category: cs.CL

TL;DR: MedRECT是一个新的跨语言(日语/英语)医学错误处理基准,用于评估大语言模型在检测、定位和纠正临床文本错误方面的能力,发现推理型模型表现更优,且经过微调的模型在结构化任务中超过人类专家。

Details Motivation: 大语言模型在医疗应用中的安全性依赖于其检测和纠正临床文本错误的能力,但这一能力在非英语语境下尚缺乏系统评估。因此,需要一个跨语言的基准来全面衡量模型在多语言环境下的医学错误处理性能。 Method: 提出MedRECT基准,包含日语和英语两个版本,基于日本医师执照考试及对应的英文数据构建,涵盖错误检测、错误定位(句子提取)和错误纠正三个子任务,并通过自动化流程生成数据集;评估了9种主流大语言模型,并对部分模型进行LoRA微调以提升性能。 Result: (1)推理型模型显著优于标准架构,错误检测最高提升13.5%,句子提取提升达51.0%;(2)从英语到日语存在5-10%的性能差距,推理模型差距较小;(3)LoRA微调带来非对称提升(日语+0.078,英语+0.168),同时保持推理能力;(4)微调后的模型在结构化错误纠正任务上超过人类专家。 Conclusion: MedRECT是首个全面的跨语言医学错误纠正基准,为开发更安全的多语言医疗大模型提供了可复现的框架与资源,推动了医学LLM在真实临床场景中的可靠部署。 Abstract: Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts -- a prerequisite for safe deployment -- remains under-evaluated, particularly beyond English. We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks: error detection, error localization (sentence extraction), and error correction. MedRECT is built with a scalable, automated pipeline from the Japanese Medical Licensing Examinations (JMLE) and a curated English counterpart, yielding MedRECT-ja (663 texts) and MedRECT-en (458 texts) with comparable error/no-error balance. We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families. Key findings: (i) reasoning models substantially outperform standard architectures, with up to 13.5% relative improvement in error detection and 51.0% in sentence extraction; (ii) cross-lingual evaluation reveals 5-10% performance gaps from English to Japanese, with smaller disparities for reasoning models; (iii) targeted LoRA fine-tuning yields asymmetric improvements in error correction performance (Japanese: +0.078, English: +0.168) while preserving reasoning capabilities; and (iv) our fine-tuned model exceeds human expert performance on structured medical error correction tasks. To our knowledge, MedRECT is the first comprehensive cross-lingual benchmark for medical error correction, providing a reproducible framework and resources for developing safer medical LLMs across languages.

[15] G2: Guided Generation for Enhanced Output Diversity in LLMs

Zhiwen Ruan,Yixia Li,Yefeng Liu,Yun Chen,Weihua Luo,Peng Li,Yang Liu,Guanhua Chen

Main category: cs.CL

TL;DR: 提出了一种无需训练的即插即用方法Guide-to-Generation (G2),通过解码干预提升大语言模型输出多样性,同时保持生成质量。

Details Motivation: 大语言模型在多次生成中往往输出高度相似的内容,限制了其在需要多样性的任务中的应用,而现有方法在提升多样性时容易牺牲生成质量。 Method: G2采用一个基础生成器和两个引导模块,通过基于解码的干预策略,在保持原始查询条件的同时引导生成过程,从而促进输出多样性。 Result: 大量实验表明,G2能有效提升输出多样性,并在多样性和生成质量之间保持良好平衡。 Conclusion: G2是一种无需训练、即插即用的生成多样性增强方法,在不损害生成质量的前提下显著提升大语言模型的输出多样性。 Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks. However, these models exhibit a critical limitation in output diversity, often generating highly similar content across multiple attempts. This limitation significantly affects tasks requiring diverse outputs, from creative writing to reasoning. Existing solutions, like temperature scaling, enhance diversity by modifying probability distributions but compromise output quality. We propose Guide-to-Generation (G2), a training-free plug-and-play method that enhances output diversity while preserving generation quality. G2 employs a base generator alongside dual Guides, which guide the generation process through decoding-based interventions to encourage more diverse outputs conditioned on the original query. Comprehensive experiments demonstrate that G2 effectively improves output diversity while maintaining an optimal balance between diversity and quality.

[16] Remembering Unequally: Global and Disciplinary Bias in LLM-Generated Co-Authorship Networks

Ghazal Kalhor,Afra Mashhadi

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLM)记忆化对合作作者网络的影响,分析了DeepSeek R1、Llama 4 Scout和Mixtral 8x7B三个主流模型在不同学科和地区的表现,发现普遍存在偏向高被引研究者的偏差,但在临床医学和非洲部分地区呈现更均衡的表征。

Details Motivation: 随着LLM在学术搜索与推荐系统中的深入应用,其生成内容可能因记忆化数据引入公平性与偏见问题,影响科学计量结果的公正性,亟需评估其对合作网络构建的影响。 Method: 通过分析三个主流LLM在不同学术领域和地理区域中生成的合作作者网络,评估模型记忆化程度及其输出结果的偏差模式。 Result: 全球范围内LLM普遍偏向高被引研究者,但临床医学和部分非洲地区表现出更平衡的代表性,表明训练数据在某些领域或区域更具包容性。 Conclusion: LLM的记忆化可能放大科研社区中的既有偏见,但也存在改进机会;需优化训练数据以提升学术发现系统的公平性。 Abstract: Ongoing breakthroughs in Large Language Models (LLMs) are reshaping search and recommendation platforms at their core. While this shift unlocks powerful new scientometric tools, it also exposes critical fairness and bias issues that could erode the integrity of the information ecosystem. Additionally, as LLMs become more integrated into web-based searches for scholarly tools, their ability to generate summarized research work based on memorized data introduces new dimensions to these challenges. The extent of memorization in LLMs can impact the accuracy and fairness of the co-authorship networks they produce, potentially reflecting and amplifying existing biases within the scientific community and across different regions. This study critically examines the impact of LLM memorization on the co-authorship networks. To this end, we assess memorization effects across three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B, analyzing how memorization-driven outputs vary across academic disciplines and world regions. While our global analysis reveals a consistent bias favoring highly cited researchers, this pattern is not uniformly observed. Certain disciplines, such as Clinical Medicine, and regions, including parts of Africa, show more balanced representation, pointing to areas where LLM training data may reflect greater equity. These findings underscore both the risks and opportunities in deploying LLMs for scholarly discovery.

[17] Leveraging the Cross-Domain & Cross-Linguistic Corpus for Low Resource NMT: A Case Study On Bhili-Hindi-English Parallel Corpus

Pooja Singh,Shashwat Bhardwaj,Vaibhav Sharma,Sandeep Kumar

Main category: cs.CL

TL;DR: 本文介绍了首个大规模比尔语-印地语-英语平行语料库(BHEPC),包含11万句精心整理的句子,填补了低资源语言机器翻译的资源空白,并基于该语料库建立了比尔语机器翻译基准,评估了多种多语言大模型的表现,结果表明微调后的NLLB-200模型效果最佳,同时探讨了大模型在跨域泛化和分布差异下的生成翻译能力。

Details Motivation: 比尔语等印度部落语言缺乏高质量语言资源,导致机器翻译研究面临挑战,亟需构建专用语料库以推动低资源语言的自然语言处理发展。 Method: 通过专业人工翻译构建BHEPC双语/三语平行语料库,覆盖教育、行政和新闻等领域;在此基础上对多种专有和开源多语言大语言模型进行双向翻译任务评估,采用微调与上下文学习方法测试其在跨域泛化和分布差异下的表现。 Result: 构建出首个大规模比尔语-印地语-英语平行语料库(BHEPC),实验显示微调后的NLLB-200 distilled 600M模型在翻译任务中表现最优,且多语言大模型在低资源条件下展现出良好潜力,同时揭示了跨域泛化能力与分布差异之间的关系。 Conclusion: 本研究填补了比尔语等低资源语言在机器翻译领域的关键资源空白,验证了多语言大模型在低资源场景下的有效性,为促进全球边缘语言的包容性自然语言处理技术发展提供了重要基础。 Abstract: The linguistic diversity of India poses significant machine translation challenges, especially for underrepresented tribal languages like Bhili, which lack high-quality linguistic resources. This paper addresses the gap by introducing Bhili-Hindi-English Parallel Corpus (BHEPC), the first and largest parallel corpus worldwide comprising 110,000 meticulously curated sentences across Bhili, Hindi, and English. The corpus was created with the assistance of expert human translators. BHEPC spans critical domains such as education, administration, and news, establishing a valuable benchmark for research in low resource machine translation. To establish a comprehensive Bhili Machine Translation benchmark, we evaluated a wide range of proprietary and open-source Multilingual Large Language Models (MLLMs) on bidirectional translation tasks between English/Hindi and Bhili. Comprehensive evaluation demonstrates that the fine-tuned NLLB-200 distilled 600M variant model outperforms others, highlighting the potential of multilingual models in low resource scenarios. Furthermore, we investigated the generative translation capabilities of multilingual LLMs on BHEPC using in-context learning, assessing performance under cross-domain generalization and quantifying distributional divergence. This work bridges a critical resource gap and promotes inclusive natural language processing technologies for low-resource and marginalized languages globally.

[18] With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting

Stephen Meisenbacher,Florian Matthes

Main category: cs.CL

TL;DR: 本研究首次探讨了数据集规模对差分隐私文本重写机制在效用和隐私保护方面的影响,通过在大规模动态分割数据集上进行实验,发现数据集规模显著影响隐私-效用权衡,呼吁更严谨的评估方法,并为大规模DP NLP的实际应用提供启示。

Details Motivation: 现有差分隐私自然语言处理(DP NLP)研究在评估文本重写机制时往往忽视数据集规模的影响,缺乏对不同数据规模下隐私与效用表现的系统分析。 Method: 设计了针对大规模、动态分割数据集的效用与隐私测试,在最多包含一百万文本的数据集上评估不同规模对DP文本重写机制性能的影响。 Result: 实验证明数据集规模对隐私-效用权衡具有重要影响,较大的数据集能更准确地反映机制的真实性能。 Conclusion: 数据集规模是评估DP文本重写机制的关键因素,未来应采用更严格的评估流程,以推动DP NLP在实际场景中的规模化应用。 Abstract: Recent work in Differential Privacy with Natural Language Processing (DP NLP) has proposed numerous promising techniques in the form of text rewriting mechanisms. In the evaluation of these mechanisms, an often-ignored aspect is that of dataset size, or rather, the effect of dataset size on a mechanism's efficacy for utility and privacy preservation. In this work, we are the first to introduce this factor in the evaluation of DP text privatization, where we design utility and privacy tests on large-scale datasets with dynamic split sizes. We run these tests on datasets of varying size with up to one million texts, and we focus on quantifying the effect of increasing dataset size on the privacy-utility trade-off. Our findings reveal that dataset size plays an integral part in evaluating DP text rewriting mechanisms; additionally, these findings call for more rigorous evaluation procedures in DP NLP, as well as shed light on the future of DP NLP in practice and at scale.

[19] ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models

Jiani Guo,Zuchao Li,Jie Wu,Qianren Wang,Yun Li,Lefei Zhang,Hai Zhao,Yujiu Yang

Main category: cs.CL

TL;DR: 本文提出了一种面向树结构的MapReduce框架ToM,用于解决大语言模型在长文本推理中因上下文窗口限制和分块处理导致的逻辑不连贯与长距离依赖缺失问题。

Details Motivation: 现有的检索增强生成(RAG)和分而治之框架(DCF)在处理长文本时,因依赖相似性排序或孤立处理文本块,容易牺牲逻辑一致性并忽略长程依赖。 Method: ToM通过层次化语义解析构建文档树(DocTree),利用树形MapReduce框架进行自底向上的递归推理:在Map阶段生成子节点推理链,在Reduce阶段聚合兄弟节点的推理结果以解决冲突或达成共识。 Result: 在70B以上的大语言模型上的实验表明,ToM显著优于现有的DCF和RAG方法,在逻辑一致性和长文本推理能力上表现更优。 Conclusion: ToM通过利用文档的内在层次结构,有效提升了长文本推理的性能,是一种具有潜力的长上下文推理框架。 Abstract: Large Language Models (LLMs), constrained by limited context windows, often face significant performance degradation when reasoning over long contexts. To address this, Retrieval-Augmented Generation (RAG) retrieves and reasons over chunks but frequently sacrifices logical coherence due to its reliance on similarity-based rankings. Similarly, divide-and-conquer frameworks (DCF) split documents into small chunks for independent reasoning and aggregation. While effective for local reasoning, DCF struggles to capture long-range dependencies and risks inducing conflicts by processing chunks in isolation. To overcome these limitations, we propose ToM, a novel Tree-oriented MapReduce framework for long-context reasoning. ToM leverages the inherent hierarchical structure of long documents (e.g., main headings and subheadings) by constructing a DocTree through hierarchical semantic parsing and performing bottom-up aggregation. Using a Tree MapReduce approach, ToM enables recursive reasoning: in the Map step, rationales are generated at child nodes; in the Reduce step, these rationales are aggregated across sibling nodes to resolve conflicts or reach consensus at parent nodes. Experimental results on 70B+ LLMs show that ToM significantly outperforms existing divide-and-conquer frameworks and retrieval-augmented generation methods, achieving better logical coherence and long-context reasoning. Our code is available at https://github.com/gjn12-31/ToM .

[20] Zero-RAG: Towards Retrieval-Augmented Generation with Zero Redundant Knowledge

Qi Luo,Xiaonan Li,Junqi Dai,Shuang Cheng,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出Zero-RAG框架,通过识别并剪枝检索语料中的冗余知识以提升RAG效率,在减少30%维基百科语料和22%检索时间的同时保持性能。

Details Motivation: 随着大模型内部知识的增长,传统RAG中外源语料与模型内部知识存在显著冗余,导致检索开销增加且可能损害模型对自身知识的利用。 Method: 提出Mastery-Score指标识别并剪枝RAG语料中的冗余知识,结合Query Router和Noise-Tolerant Tuning机制,增强模型对内部知识的使用能力。 Result: Zero-RAG成功将维基百科语料库剪枝30%,检索阶段加速22%,且未牺牲RAG的整体性能。 Conclusion: 通过去除外部语料中的冗余知识,Zero-RAG有效平衡了内外知识利用,在降低检索成本的同时维持了生成质量。 Abstract: Retrieval-Augmented Generation has shown remarkable results to address Large Language Models' hallucinations, which usually uses a large external corpus to supplement knowledge to LLMs. However, with the development of LLMs, the internal knowledge of LLMs has expanded significantly, thus causing significant knowledge redundancy between the external corpus and LLMs. On the one hand, the indexing cost of dense retrieval is highly related to the corpus size and thus significant redundant knowledge intensifies the dense retrieval's workload. On the other hand, the redundant knowledge in the external corpus is not helpful to LLMs and our exploratory analysis shows that it instead hurts the RAG performance on those questions which the LLM can answer by itself. To address these issues, we propose Zero-RAG to tackle these challenges. Specifically, we first propose the Mastery-Score metric to identify redundant knowledge in the RAG corpus to prune it. After pruning, answers to "mastered" questions rely primarily on internal knowledge of the LLM. To better harness the internal capacity, we propose Query Router and Noise-Tolerant Tuning to avoid the irrelevant documents' distraction and thus further improve the LLM's utilization of internal knowledge with pruned corpus. Experimental results show that Zero-RAG prunes the Wikipedia corpus by 30\% and accelerates the retrieval stage by 22\%, without compromising RAG's performance.

[21] Fine-Tuning DialoGPT on Common Diseases in Rural Nepal for Medical Conversations

Birat Poudel,Satyam Ghimire,Er. Prakash Chandra Prasad

Main category: cs.CL

TL;DR: 本研究在资源受限的尼泊尔农村地区,通过微调轻量级离线对话模型DialoGPT,使用合成的医生-患者交互数据集,针对十种常见疾病进行领域适配,结果表明该模型能生成连贯、相关且符合医学语境的回应,具备症状理解与共情能力,展示了小型离线模型在偏远医疗场景中的应用潜力。

Details Motivation: 在缺乏互联网和云计算基础设施的农村地区,大型对话模型难以部署,因此需要一种可在离线环境下运行且适用于本地常见疾病的医疗对话系统,以支持基层医疗。 Method: 采用轻量级生成式对话模型DialoGPT,在一个合成构建的涵盖十种尼泊尔农村常见疾病的医生-患者对话数据集上进行微调,并评估其生成回复的相关性与医学合理性。 Result: 尽管训练数据有限且局限于特定领域,微调后的模型仍能生成连贯、情境相关且医学上合理的回应,表现出对症状、疾病背景的理解以及共情沟通能力。 Conclusion: 紧凑型、可离线运行的对话模型结合针对性的领域数据集,能够在低资源医疗环境中有效实现领域适配,为农村医疗AI对话系统的发展提供了可行路径。 Abstract: Conversational agents are increasingly being explored to support healthcare delivery, particularly in resource-constrained settings such as rural Nepal. Large-scale conversational models typically rely on internet connectivity and cloud infrastructure, which may not be accessible in rural areas. In this study, we fine-tuned DialoGPT, a lightweight generative dialogue model that can operate offline, on a synthetically constructed dataset of doctor-patient interactions covering ten common diseases prevalent in rural Nepal, including common cold, seasonal fever, diarrhea, typhoid fever, gastritis, food poisoning, malaria, dengue fever, tuberculosis, and pneumonia. Despite being trained on a limited, domain-specific dataset, the fine-tuned model produced coherent, contextually relevant, and medically appropriate responses, demonstrating an understanding of symptoms, disease context, and empathetic communication. These results highlight the adaptability of compact, offline-capable dialogue models and the effectiveness of targeted datasets for domain adaptation in low-resource healthcare environments, offering promising directions for future rural medical conversational AI.

[22] Exploring and Mitigating Gender Bias in Encoder-Based Transformer Models

Ariyan Hossain,Khondokar Mohammad Ahanaf Hannan,Rakinul Haque,Nowreen Tarannum Rafa,Humayra Musarrat,Shoaib Ahmed Dipu,Farig Yousuf Sadeque

Main category: cs.CL

TL;DR: 本文研究了基于编码器的Transformer模型(如BERT、ALBERT等)在上下文词嵌入中的性别偏见,提出了一种基于掩码语言模型概率的新度量方法MALoR,并通过反事实数据增强构建性别平衡数据集进行持续预训练以缓解偏见。实验表明该方法显著降低了多种模型中的性别偏见,且不影响下游任务性能。

Details Motivation: 由于语言模型从训练数据中继承了性别偏见,可能在实际应用中导致不公平结果,因此需要量化并减轻这种偏见,尤其是在表现优异的Transformer模型中。 Method: 提出新指标MALoR来衡量掩码语言模型中的性别偏见,并采用反事实数据增强生成性别平衡数据集,对模型进行持续预训练以缓解偏见。 Result: 在BERT-base上,“he-she”偏见分数从1.27降至0.08,“his-her”从2.51降至0.36;BERT-large中“male-female”偏见从1.82降至0.10,其他模型也表现出类似改善,且下游任务性能未受影响。 Conclusion: 所提出的MALoR指标能有效量化性别偏见,结合反事实数据增强的持续预训练可显著降低多种Transformer模型中的性别偏见,同时保持模型性能。 Abstract: Gender bias in language models has gained increasing attention in the field of natural language processing. Encoder-based transformer models, which have achieved state-of-the-art performance in various language tasks, have been shown to exhibit strong gender biases inherited from their training data. This paper investigates gender bias in contextualized word embeddings, a crucial component of transformer-based models. We focus on prominent architectures such as BERT, ALBERT, RoBERTa, and DistilBERT to examine their vulnerability to gender bias. To quantify the degree of bias, we introduce a novel metric, MALoR, which assesses bias based on model probabilities for filling masked tokens. We further propose a mitigation approach involving continued pre-training on a gender-balanced dataset generated via Counterfactual Data Augmentation. Our experiments reveal significant reductions in gender bias scores across different pronoun pairs. For instance, in BERT-base, bias scores for "he-she" dropped from 1.27 to 0.08, and "his-her" from 2.51 to 0.36 following our mitigation approach. We also observed similar improvements across other models, with "male-female" bias decreasing from 1.82 to 0.10 in BERT-large. Our approach effectively reduces gender bias without compromising model performance on downstream tasks.

[23] Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly

Wenya Xie,Shaochen,Zhong,Hoang Anh Duy Le,Zhaozhuo Xu,Jianwen Xie,Zirui Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为WordSaladChopper(WSC)的轻量级组件,用于检测和剪除大型推理模型(LRM)中的无意义重复输出(即“word salad”),通过在线检测隐藏状态模式并进行简单再生,显著减少解码长度,同时几乎不损失质量。

Details Motivation: 大型推理模型在生成输出时常常产生大量无意义的自我重复内容(word salad),消耗解码资源却无实际语义价值,导致推理成本高且效率低,因此需要一种高效、低侵入性的方法来实时识别并去除这类冗余。 Method: 利用<\n\n>标记后的隐藏状态特征,训练一个单层线性分类器实时检测word salad行为;一旦检测到,立即截断并附加再生提示以继续生成,从而节省输出长度。 Result: 该方法能有效识别word salad,实现显著的长度压缩(节省输出token),且对最终输出质量影响极小,具备低开销和即插即用特性。 Conclusion: WordSaladChopper是一种高效、轻量、低侵入的LRM增强组件,因其低成本和高效益,应被视为面向用户体验的LRM系统的必备模块。 Abstract: Large Reasoning Models (LRMs) are often bottlenecked by the high cost of output tokens. We show that a significant portion of these tokens are useless self-repetitions - what we call "word salad" - that exhaust the decoding budget without adding value. Interestingly, we observe that LRMs are self-aware when trapped in these loops: the hidden states of <\n\n> tokens trailing each reasoning chunk exhibit patterns that allow us to detect word salad behavior on-the-fly via a single-layer linear classifier. Once detected, a simple chop appended by a straightforward regeneration prompt yields substantial length savings with minimal quality loss. Our work offers WordSaladChopper (WSC) - a lightweight, turnkey component for LRM that is minimally invasive to its reasoning trajectory by only removing semantically redundant tokens. Given its low overhead, strong savings, and the lack of semantic value of word salad tokens, we believe it is not too far-fetched to argue that WSC - or a similar component - is a must-have for all LRM applications with user experience in mind. Our code is publicly available at https://github.com/wenyaxie023/WordSaladChopper.

[24] Multi-refined Feature Enhanced Sentiment Analysis Using Contextual Instruction

Peter Atandoh,Jie Zou,Weikang Guo,Jiwei Wei,Zheng Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于预训练语言模型的新型情感分析框架CISEA-MRFE,结合上下文指令、语义增强和多级特征提取,显著提升了在多个基准数据集上的分类性能。

Details Motivation: 现有方法在处理细微情感线索、领域迁移和情感分布不平衡时表现不佳,主要受限于语义基础不足、泛化能力差以及对主导情感类别的偏差。 Method: 提出CISEA-MRFE框架,融合上下文指令(CI)以引导情感消歧,语义增强增强(SEA)提升鲁棒性,以及多精炼特征提取(MRFE),包括用于多尺度特征建模的SADE和用于情感感知序列建模的EECE。 Result: 在IMDb、Yelp、Twitter和Amazon四个基准数据集上,准确率相对提升最高达4.6%(IMDb)、6.5%(Yelp)、30.3%(Twitter)和4.1%(Amazon)。 Conclusion: CISEA-MRFE有效增强了情感分析的语义理解与泛化能力,在跨领域情感分类任务中表现出优越性能。 Abstract: Sentiment analysis using deep learning and pre-trained language models (PLMs) has gained significant traction due to their ability to capture rich contextual representations. However, existing approaches often underperform in scenarios involving nuanced emotional cues, domain shifts, and imbalanced sentiment distributions. We argue that these limitations stem from inadequate semantic grounding, poor generalization to diverse linguistic patterns, and biases toward dominant sentiment classes. To overcome these challenges, we propose CISEA-MRFE, a novel PLM-based framework integrating Contextual Instruction (CI), Semantic Enhancement Augmentation (SEA), and Multi-Refined Feature Extraction (MRFE). CI injects domain-aware directives to guide sentiment disambiguation; SEA improves robustness through sentiment-consistent paraphrastic augmentation; and MRFE combines a Scale-Adaptive Depthwise Encoder (SADE) for multi-scale feature specialization with an Emotion Evaluator Context Encoder (EECE) for affect-aware sequence modeling. Experimental results on four benchmark datasets demonstrate that CISEA-MRFE consistently outperforms strong baselines, achieving relative improvements in accuracy of up to 4.6% on IMDb, 6.5% on Yelp, 30.3% on Twitter, and 4.1% on Amazon. These results validate the effectiveness and generalization ability of our approach for sentiment classification across varied domains.

[25] Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

Peng Ding,Jun Kuang,Wen Sun,Zongyu Wang,Xuezhi Cao,Xunliang Cai,Jiajun Chen,Shujian Huang

Main category: cs.CL

TL;DR: 本文提出了ISA(Intent Shift Attack),通过意图转换的分类体系生成看似无害的自然语言攻击提示,显著提升了对大语言模型的越狱攻击成功率,并揭示了现有安全防御机制在应对此类攻击时的不足。

Details Motivation: 尽管大语言模型能力强大,但仍易受越狱攻击。现有方法多依赖额外上下文或对抗性标记来分散模型注意力,而核心恶意意图不变。本文旨在探索更隐蔽、更自然的攻击方式,以揭示模型在意图理解上的根本缺陷,从而推动更鲁棒的安全机制发展。 Method: 提出ISA攻击方法,构建意图转换的分类体系,通过对原始有害请求进行最小化编辑,生成语义上看似无害的信息请求类提示。该方法不依赖复杂标记或长上下文,仅需轻微修改即可实现意图混淆。 Result: 在开源和商用大语言模型上的实验表明,ISA相比直接有害请求攻击成功率提升超过70%;仅使用ISA重构的良性数据微调模型后,攻击成功率接近100%。评估显示现有防御方法难以应对ISA攻击。 Conclusion: ISA揭示了大语言模型在意图推理方面的根本性安全挑战,现有防御机制存在明显不足,亟需发展更有效的防御策略来应对基于意图混淆的自然语言攻击。 Abstract: Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at https://github.com/NJUNLP/ISA.

[26] FlashEVA: Accelerating LLM inference via Efficient Attention

Juan Gabriel Kostelec,Qinghai Guo

Main category: cs.CL

TL;DR: 本文提出了FlashEVA,一种基于控制变量法的高效注意力机制实现,可在保持模型效果的同时显著提升推理吞吐量并降低显存占用。

Details Motivation: Transformer模型在自然语言处理中表现出色,但其推理过程中因需维护完整上下文而导致内存消耗巨大,限制了实际应用效率。 Method: 提出FlashEVA,改进EVA(Efficient Attention via Control Variates)的实现,并设计微调方法使Transformer模型适应FlashEVA注意力机制,支持通过超参数调节效率与精度的权衡。 Result: 相比标准Transformer实现,FlashEVA在推理时最高可实现6.7倍的吞吐量提升和5倍的峰值GPU显存降低,仅用1.5B token微调即可保持下游任务性能,但在检索密集型任务中表现受限。 Conclusion: FlashEVA为Transformer模型的高效推理提供了有效解决方案,是迈向更高效、更灵活模型的重要一步。 Abstract: Transformer models have revolutionized natural language processing, achieving state-of-the-art performance and demonstrating remarkable scalability. However, their memory demands, particularly due to maintaining full context in memory, pose significant challenges for inference. In this paper, we present FlashEVA, an efficient implementation of EVA (Efficient Attention via Control Variates), and demonstrate how to finetune transformers to adapt to FlashEVA attention. Our method enables fine-tuning of Transformer models with as few as 1.5B tokens while preserving effectiveness across various downstream tasks. Notably, FlashEVA achieves up to 6.7x higher throughput and 5x lower peak GPU memory usage during inference compared to standard Transformer implementations. Despite these improvements, we observe limitations in retrieval-focused tasks. Our implementation offers control over the trade-off between throughput and accuracy through adjustable hyperparameters, providing flexibility for diverse use cases. This work represents a significant step towards more efficient and adaptable Transformer-based models for inference.

[27] OpenSIR: Open-Ended Self-Improving Reasoner

Wai-Chung Kwan,Joshua Ong Jun Leang,Pavlos Vougiouklis,Jeff Z. Pan,Marco Valentino,Pasquale Minervini

Main category: cs.CL

TL;DR: OpenSIR是一种无需外部监督的自对弈框架,通过交替教师与学生角色,使大语言模型能够自主生成并解决新问题,实现开放-ended的数学发现。

Details Motivation: 现有基于强化学习的大语言模型推理方法依赖人工标注数据或外部验证器,限制了模型超越人类水平的能力,且难以实现开放-ended学习。 Method: 提出OpenSIR框架,让同一模型在教师(生成问题)和学生(解决问题)角色间切换,通过优化问题的难度和多样性进行自我进化,无需外部监督。 Result: 在GSM8K和College Math等数学任务上显著提升模型性能,例如Gemma-2-2B-Instruct在GSM8K上从38.5提升至58.7;模型能从简单问题自主演化到复杂数学问题求解。 Conclusion: OpenSIR实现了无需外部监督的开放-ended自我改进,通过教师-学生协同进化机制,推动模型在数学推理能力上的持续增长。 Abstract: Recent advances in large language model (LLM) reasoning through reinforcement learning rely on annotated datasets for verifiable rewards, which may limit models' ability to surpass human-level performance. While self-play offers a promising alternative, existing approaches depend on external verifiers or cannot learn open-endedly. We present Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework where an LLM learns to generate and solve novel problems by alternating teacher and student roles without external supervision. To generate novel problems, OpenSIR optimises for both difficulty and diversity, rewarding problems that challenge appropriately while exploring distinct concepts, enabling open-ended mathematical discovery. Starting from a single trivial seed problem, OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct advances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math, while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses reveal that OpenSIR achieves open-ended learning through co-evolving teacher-student roles that adaptively calibrate difficulty and drive diverse exploration, progressing autonomously from basic to advanced mathematics.

[28] SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Jameson Sandler,Jacob K. Christopher,Thomas Hartvigsen,Nando Fioretto

Main category: cs.CL

TL;DR: 本文提出了SpecDiff-2,一种基于离散扩散的非自回归推测解码框架,有效解决了现有推测解码中自回归依赖和草案令牌频繁被拒绝的两大瓶颈,在保持准确率的同时显著提升了大语言模型推理速度。

Details Motivation: 现有的推测解码方法受限于草案生成的自回归依赖和草案与验证模型之间的不匹配导致的高拒绝率,限制了推理加速效果。 Method: 提出SpecDiff-2框架,使用离散扩散模型作为非自回归草案生成器以提升并行性,并设计新技术校准离散扩散草案生成器与自回归验证器之间的对齐。 Result: 在多个推理、编程和数学基准测试中,SpecDiff-2相比先前方法平均提升55%的每秒生成令牌数,最高实现5.5倍的平均加速,且无精度损失。 Conclusion: SpecDiff-2通过结合非自回归扩散草案生成与有效的校准机制,显著提升了大语言模型的推理效率,成为新的最先进方法。 Abstract: Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.

[29] Certain but not Probable? Differentiating Certainty from Probability in LLM Token Outputs for Probabilistic Scenarios

Autumn Toney-Wails,Ryan Wails

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型(如GPT-4.1和DeepSeek-Chat)在概率场景中token级置信度与理论概率分布之间的一致性问题,发现尽管模型输出正确,但其token级概率和熵值与理论分布存在持续偏差。

Details Motivation: 可靠不确定性量化对确保大语言模型在决策支持等知识密集型应用中的可信使用至关重要;然而,现有基于token logits的置信度估计方法在概率对齐方面可能存在不足。 Method: 通过设计十个涉及概率的情景提示(例如掷六面骰子),对比有无明确概率提示下GPT-4.1和DeepSeek-Chat的响应,评估响应有效性及token级输出概率与理论分布的对齐程度。 Result: 两个模型在所有提示下均实现了完美的响应准确性,但其token级概率和熵值始终偏离理论概率分布。 Conclusion: 仅凭正确的输出不足以反映模型真正的不确定性,当前基于logits的UQ方法在概率对齐方面存在局限,需进一步改进以实现可靠的不确定性量化。 Abstract: Reliable uncertainty quantification (UQ) is essential for ensuring trustworthy downstream use of large language models, especially when they are deployed in decision-support and other knowledge-intensive applications. Model certainty can be estimated from token logits, with derived probability and entropy values offering insight into performance on the prompt task. However, this approach may be inadequate for probabilistic scenarios, where the probabilities of token outputs are expected to align with the theoretical probabilities of the possible outcomes. We investigate the relationship between token certainty and alignment with theoretical probability distributions in well-defined probabilistic scenarios. Using GPT-4.1 and DeepSeek-Chat, we evaluate model responses to ten prompts involving probability (e.g., roll a six-sided die), both with and without explicit probability cues in the prompt (e.g., roll a fair six-sided die). We measure two dimensions: (1) response validity with respect to scenario constraints, and (2) alignment between token-level output probabilities and theoretical probabilities. Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.

[30] Modeling the Construction of a Literary Archetype: The Case of the Detective Figure in French Literature

Jean Barré,Olga Seminck,Antoine Bourgois,Thierry Poibeau

Main category: cs.CL

TL;DR: 该研究通过计算分析探讨了150年来法国侦探小说中侦探原型的演变,揭示其从次要角色发展为推理核心,并在二战后受硬汉派影响变得更加复杂。

Details Motivation: 探究法国侦探小说中侦探角色的历史演变及其文学功能的变化。 Method: 使用定量方法和角色级嵌入,通过有监督模型分析跨越150年的文本数据。 Result: 模型成功捕捉到从1866年M. Lecoq到2017年Commissaire Adamsberg之间侦探原型的一致性与演变轨迹。 Conclusion: 侦探原型从‘推理机器’逐步演变为更复杂、更具社会深度的角色,反映了侦探小说向道德模糊和社会暴力的转向。 Abstract: This research explores the evolution of the detective archetype in French detective fiction through computational analysis. Using quantitative methods and character-level embeddings, we show that a supervised model is able to capture the unity of the detective archetype across 150 years of literature, from M. Lecoq (1866) to Commissaire Adamsberg (2017). Building on this finding, the study demonstrates how the detective figure evolves from a secondary narrative role to become the central character and the "reasoning machine" of the classical detective story. In the aftermath of the Second World War, with the importation of the hardboiled tradition into France, the archetype becomes more complex, navigating the genre's turn toward social violence and moral ambiguity.

[31] Do You Know About My Nation? Investigating Multilingual Language Models' Cultural Literacy Through Factual Knowledge

Eshaan Tanwar,Anwoy Chatterjee,Michael Saxon,Alon Albalak,William Yang Wang,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: XNationQA是一个用于评估多语言大模型文化素养的新基准,涵盖九个国家的地理、文化和历史,以七种语言呈现49,280个问题。研究发现模型在西方语言中表现更好,但对西方国家的文化知识掌握并不更优,且跨语言知识迁移能力有限。

Details Motivation: 现有多种语言问答基准存在西方中心主义倾向,缺乏对地区多样性的考量,导致无法公平评估多语言模型对非西方文化事实的理解能力。 Method: 构建XNationQA数据集,包含九个国家、七种语言的49,280个问题,并采用两个新的跨语言迁移指标对八种主流多语言大模型进行评测。 Result: 模型在西方语言中表现更优,但对本国主导语言的文化知识掌握反而不如英语;模型对西方国家的文化知识掌握并未优于非西方国家;开源模型跨语言知识迁移能力尤其薄弱。 Conclusion: 当前多语言大模型存在文化知识分布不均和跨语言知识迁移困难的问题,亟需改进训练数据与模型架构以实现更公平的文化理解能力。 Abstract: Most multilingual question-answering benchmarks, while covering a diverse pool of languages, do not factor in regional diversity in the information they capture and tend to be Western-centric. This introduces a significant gap in fairly evaluating multilingual models' comprehension of factual information from diverse geographical locations. To address this, we introduce XNationQA for investigating the cultural literacy of multilingual LLMs. XNationQA encompasses a total of 49,280 questions on the geography, culture, and history of nine countries, presented in seven languages. We benchmark eight standard multilingual LLMs on XNationQA and evaluate them using two novel transference metrics. Our analyses uncover a considerable discrepancy in the models' accessibility to culturally specific facts across languages. Notably, we often find that a model demonstrates greater knowledge of cultural information in English than in the dominant language of the respective culture. The models exhibit better performance in Western languages, although this does not necessarily translate to being more literate for Western countries, which is counterintuitive. Furthermore, we observe that models have a very limited ability to transfer knowledge across languages, particularly evident in open-source models.

[32] Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?

Berk Atil,Rebecca J. Passonneau,Fred Morstatter

Main category: cs.CL

TL;DR: 本文首次系统评估了十种语言下大语言模型的越狱攻击与防御效果,发现攻击成功率和防御鲁棒性在不同语言间存在差异,高资源语言在标准查询下更安全,但在对抗性查询下更脆弱,结果表明需要语言感知和跨语言的安全基准。

Details Motivation: 尽管已有大量关于大语言模型越狱攻击与防御的研究,但其跨语言泛化能力尚未得到充分探索,因此需要系统性评估多语言环境下的安全性。 Method: 在HarmBench和AdvBench上使用六种大语言模型,对逻辑表达式类和对抗性提示类两种越狱类型,在十种涵盖高、中、低资源的语言中进行多语言评估。 Result: 攻击成功率和防御鲁棒性随语言变化;高资源语言在标准查询中更安全,但在对抗性查询中更易受攻击;简单防御方法有效,但效果依赖于语言和模型。 Conclusion: 应开发考虑语言差异的、具备跨语言覆盖的安全基准,以提升大语言模型在全球范围内的安全对齐能力。 Abstract: Large language models (LLMs) undergo safety alignment after training and tuning, yet recent work shows that safety can be bypassed through jailbreak attacks. While many jailbreaks and defenses exist, their cross-lingual generalization remains underexplored. This paper presents the first systematic multilingual evaluation of jailbreaks and defenses across ten languages--spanning high-, medium-, and low-resource languages--using six LLMs on HarmBench and AdvBench. We assess two jailbreak types: logical-expression-based and adversarial-prompt-based. For both types, attack success and defense robustness vary across languages: high-resource languages are safer under standard queries but more vulnerable to adversarial ones. Simple defenses can be effective, but are language- and model-dependent. These findings call for language-aware and cross-lingual safety benchmarks for LLMs.

[33] Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies

Yuxuan Hu,Jianchao Tan,Jiaqi Zhang,Wen Zan,Pingwei Sun,Yifan Lu,Yerui Sun,Yuchen Xie,Xunliang Cai,Jing Zhang

Main category: cs.CL

TL;DR: 提出一种改进的原生稀疏注意力机制,通过层间交替使用局部与全局注意力,并引入潜在注意力机制,在减少KV缓存的同时提升长序列建模性能。

Details Motivation: 原生稀疏注意力在处理长程依赖时存在信息传播不足的问题,且KV缓存开销较大,限制了其在长序列任务中的效率和效果。 Method: 在不同层交替使用局部(滑动窗口)和全局(压缩、选择性)注意力,并在各分支中引入多头潜在注意力(MLA)和组头潜在注意力(GLA)进行优化。 Result: 相比原生稀疏注意力,KV缓存减少50%,在常识推理和长文本理解任务上性能显著提升,且在340M到1.3B参数规模的模型上均优于或媲美全注意力和原生稀疏注意力。 Conclusion: 交替式稀疏注意力结构结合潜在注意力机制能有效平衡效率与性能,是长上下文建模的一种高效解决方案。 Abstract: In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression, selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA's branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50\% versus NSA while improving the model's common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.

[34] TriCon-Fair: Triplet Contrastive Learning for Mitigating Social Bias in Pre-trained Language Models

Chong Lyu,Lin Li,Shiqing Wu,Jingling Yuan

Main category: cs.CL

TL;DR: 本文提出了一种名为TriCon-Fair的对比学习框架,通过解耦损失函数(结合三元组和语言建模项)消除正负样本间的耦合效应,有效减少大语言模型中的社会偏见传播,同时保持下游任务性能。

Details Motivation: 现有去偏方法独立处理有偏和无偏样本,忽略了它们之间的相互关系,导致正负耦合问题,使得某一类别的改进可能损害另一类别的表现,残余偏见难以消除。 Method: 提出TriCon-Fair框架,采用解耦损失函数,为每个锚点样本显式分配有偏的负样本和无偏的正样本,分离推拉动态,并联合优化语言建模目标以保持模型通用能力。 Result: 实验结果表明,TriCon-Fair在减少歧视性输出方面优于现有的去偏基准方法,同时保持了强大的下游任务性能。 Conclusion: TriCon-Fair提供了一种实用且符合伦理的解决方案,适用于敏感的自然语言处理应用,能够有效缓解大模型中的社会偏见问题。 Abstract: The increasing utilization of large language models raises significant concerns about the propagation of social biases, which may result in harmful and unfair outcomes. However, existing debiasing methods treat the biased and unbiased samples independently, thus ignoring their mutual relationship. This oversight enables a hidden negative-positive coupling, where improvements for one group inadvertently compromise the other, allowing residual social bias to persist. In this paper, we introduce TriCon-Fair, a contrastive learning framework that employs a decoupled loss that combines triplet and language modeling terms to eliminate positive-negative coupling. Our TriCon-Fair assigns each anchor an explicitly biased negative and an unbiased positive, decoupling the push-pull dynamics and avoiding positive-negative coupling, and jointly optimizes a language modeling (LM) objective to preserve general capability. Experimental results demonstrate that TriCon-Fair reduces discriminatory output beyond existing debiasing baselines while maintaining strong downstream performance. This suggests that our proposed TriCon-Fair offers a practical and ethical solution for sensitive NLP applications.

[35] Assessing LLM Reasoning Steps via Principal Knowledge Grounding

Hyeon Hwang,Yewon Cho,Chanwoong Yoon,Yein Park,Minju Song,Kyungjae Lee,Gangwoo Kim,Jaewoo Kang

Main category: cs.CL

TL;DR: 提出了一种新的评估框架,用于系统评估大语言模型在中间推理步骤中的知识基础,包含知识收集、评估指标和轻量级评估模型三个部分,并展示了其在发现推理缺陷和优化偏好中的应用。

Details Motivation: 为了验证大语言模型的推理是否真正基于准确的知识,解决现有推理方法可能缺乏知识支撑的问题。 Method: 构建了一个包含原子知识库、知识接地评估指标和轻量级评估LLM的评估框架,用以衡量模型在推理过程中对先验知识的记忆与应用。 Result: 该评估套件能有效识别缺失或误用的知识元素,揭示LLM的推理缺陷,并可集成到偏好优化中。 Conclusion: 所提出的知识接地评估框架能够可靠、低成本地评估LLM推理过程的知识基础,为改进模型推理能力提供了新途径。 Abstract: Step-by-step reasoning has become a standard approach for large language models (LLMs) to tackle complex tasks. While this paradigm has proven effective, it raises a fundamental question: How can we verify that an LLM's reasoning is accurately grounded in knowledge? To address this question, we introduce a novel evaluation suite that systematically assesses the knowledge grounding of intermediate reasoning. Our framework comprises three key components. (1) Principal Knowledge Collection, a large-scale repository of atomic knowledge essential for reasoning. Based on the collection, we propose (2) knowledge-grounded evaluation metrics designed to measure how well models recall and apply prerequisite knowledge in reasoning. These metrics are computed by our (3) evaluator LLM, a lightweight model optimized for cost-effective and reliable metric computation. Our evaluation suite demonstrates remarkable effectiveness in identifying missing or misapplied knowledge elements, providing crucial insights for uncovering fundamental reasoning deficiencies in LLMs. Beyond evaluation, we demonstrate how these metrics can be integrated into preference optimization, showcasing further applications of knowledge-grounded evaluation.

[36] ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Ahmed Masry,Megh Thakkar,Patrice Bechard,Sathwik Tejaswi Madhusudhan,Rabiul Awal,Shambhavi Mishra,Akshay Kalkunte Suresh,Srivatsava Daruru,Enamul Hoque,Spandana Gella,Torsten Scholak,Sai Rajeswar

Main category: cs.CL

TL;DR: ColMate是一种用于多模态文档检索的新型模型,通过OCR预训练、自监督对比学习和晚期交互机制,在ViDoRe V2基准上比现有模型提升3.61%,并展现出更强的跨领域泛化能力。

Details Motivation: 现有的多模态文档检索方法多沿用纯文本检索技术,未能充分考虑多模态文档的结构与视觉特征,限制了检索性能。 Method: 提出ColMate模型,采用基于OCR的预训练目标、自监督的掩码对比学习目标以及适用于多模态文档结构的晚期交互评分机制。 Result: 在ViDoRe V2基准上比现有模型提升了3.61%,并在跨领域基准上表现出更强的泛化能力。 Conclusion: ColMate有效融合了多模态表征学习与文档检索,显著提升了多模态文档检索的性能和适应性。 Abstract: Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

[37] The Biased Oracle: Assessing LLMs' Understandability and Empathy in Medical Diagnoses

Jianzhou Yao,Shunchang Liu,Guillaume Drui,Rikard Pettersson,Alessandro Blasimme,Sara Kijewski

Main category: cs.CL

TL;DR: 该研究评估了两种领先的大型语言模型(LLM)在医疗诊断场景中生成易于理解且富有同理心的患者解释的能力,发现模型虽能根据患者特征调整内容,但存在输出过于复杂和情感同理心偏见的问题,可能导致沟通不平等,需系统校准以提升公平性。

Details Motivation: 旨在探究大型语言模型在临床诊断沟通中生成既易懂又具同理心的患者解释的能力,并识别潜在的偏见与可及性问题。 Method: 通过可读性指标评估生成内容的易懂性,并采用LLM-as-a-Judge方法结合人类评估来衡量同理心,分析模型在不同社会人口学变量和患者条件下的表现。 Result: 发现LLM能根据患者特征调整解释,但常生成过于复杂的内容,并表现出对特定群体的情感同理心偏见,影响了沟通的可及性与公平性。 Conclusion: 大型语言模型在医疗沟通中具有潜力,但需系统性校准以消除复杂性和偏见,确保为所有患者提供公平、可及的支持。 Abstract: Large language models (LLMs) show promise for supporting clinicians in diagnostic communication by generating explanations and guidance for patients. Yet their ability to produce outputs that are both understandable and empathetic remains uncertain. We evaluate two leading LLMs on medical diagnostic scenarios, assessing understandability using readability metrics as a proxy and empathy through LLM-as-a-Judge ratings compared to human evaluations. The results indicate that LLMs adapt explanations to socio-demographic variables and patient conditions. However, they also generate overly complex content and display biased affective empathy, leading to uneven accessibility and support. These patterns underscore the need for systematic calibration to ensure equitable patient communication. The code and data are released: https://github.com/Jeffateth/Biased_Oracle

[38] The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles

Abhinav P M,Ojasva Saxena,Oswald C,Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: 本文研究了大型语言模型在七种主要印度语言中的文化推理能力,引入了一个多语言谜题数据集,并评估了五种模型在不同提示策略下的表现。结果显示,尽管Gemini 2.5 Pro整体表现最佳,但少样本方法提升有限,且各语言间准确率差异显著。此外,模型初始准确率与其自我纠错能力呈负相关,高表现模型更倾向于过度自信,而低表现模型更具自我意识。

Details Motivation: 探索大型语言模型在非英语语言中进行文化相关推理的能力,特别是在多语言背景下模型的推理与自我评估一致性问题尚未充分研究。 Method: 构建包含传统谜题和上下文重构变体的多语言谜题数据集,评估五种LLM(Gemini 2.5 Pro、Gemini 2.5 Flash、Mistral-Saba、LLaMA 4 Scout、LLaMA 4 Maverick)在七种提示策略下的表现,分两阶段测试:第一阶段评估解题准确率,第二阶段评估模型对自身错误的识别能力。 Result: Gemini 2.5 Pro整体表现最好,但少样本提示带来的增益有限,不同语言间性能差异明显;在自我评估阶段发现,模型初始准确率越高,其识别自身错误的能力越弱,即高准确率模型更过度自信(真阴性率仅4.34%),而低准确率模型如LLaMA 4 Scout更具自我意识(真阴性率达42.09%)。 Conclusion: 当前多语言推理模型存在明显缺陷,不仅在跨语言表现上不均衡,且高性能模型缺乏对自身错误的认知,未来需要发展既能有效推理又能识别自身局限性的模型。 Abstract: The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.

[39] Advancing Machine-Generated Text Detection from an Easy to Hard Supervision Perspective

Chenwang Wu,Yiu-ming Cheung,Bo Han,Defu Lian

Main category: cs.CL

TL;DR: 提出一种易到难增强框架,以在不精确标签条件下提供可靠的监督,提升机器生成文本检测效果。

Details Motivation: 现有机器生成文本检测方法假设标签为“黄金标准”,但存在边界模糊问题,且人类认知局限与检测器超智能导致不精确学习普遍存在。 Method: 采用易到难增强框架,利用针对较长文本的简单监督器(能力较弱但更稳定)来增强目标检测器;通过将检测器结构化地融入监督器,将其建模为性能下界,从而间接优化检测器。 Result: 在跨LLM、跨领域、混合文本和改写攻击等多种实际场景下的实验证明,该框架显著提升了检测性能。 Conclusion: 该框架有效应对了机器生成文本检测中的不精确学习问题,通过理论建模和结构设计逼近潜在的‘黄金’标签,提供了更可靠的监督方式。 Abstract: Existing machine-generated text (MGT) detection methods implicitly assume labels as the "golden standard". However, we reveal boundary ambiguity in MGT detection, implying that traditional training paradigms are inexact. Moreover, limitations of human cognition and the superintelligence of detectors make inexact learning widespread and inevitable. To this end, we propose an easy-to-hard enhancement framework to provide reliable supervision under such inexact conditions. Distinct from knowledge distillation, our framework employs an easy supervisor targeting relatively simple longer-text detection tasks (despite weaker capabilities), to enhance the more challenging target detector. Firstly, longer texts targeted by supervisors theoretically alleviate the impact of inexact labels, laying the foundation for reliable supervision. Secondly, by structurally incorporating the detector into the supervisor, we theoretically model the supervisor as a lower performance bound for the detector. Thus, optimizing the supervisor indirectly optimizes the detector, ultimately approximating the underlying "golden" labels. Extensive experiments across diverse practical scenarios, including cross-LLM, cross-domain, mixed text, and paraphrase attacks, demonstrate the framework's significant detection effectiveness. The code is available at: https://github.com/tmlr-group/Easy2Hard.

[40] MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL

Haolin Yang,Jipeng Zhang,Zhitao He,Yi R. Fung

Main category: cs.CL

TL;DR: MARS-SQL是一个结合任务分解和交互式强化学习的多智能体框架,用于提升自然语言到SQL的转换准确性,特别是在复杂查询场景下表现优异。

Details Motivation: 复杂的自然语言到SQL转换任务通常需要环境交互和自我修正能力,而现有方法难以有效处理这些问题。 Method: 提出MARS-SQL框架,包含三个专门代理:用于模式链接的Grounding Agent、用于查询生成的Generation Agent(通过多轮强化学习训练,采用Think-Act-Observe循环),以及用于最终选择的Validation Agent;在推理时生成多条交互轨迹,并由验证代理基于生成概率选择最优解。 Result: 在BIRD开发集上达到77.84%的执行准确率,在Spider测试集上达到89.75%,均取得当前最优性能。 Conclusion: MARS-SQL通过结合交互式强化学习与生成式验证机制,显著提升了复杂SQL查询生成的准确性和鲁棒性。 Abstract: Translating natural language to SQL remains difficult for complex queries. Such queries often need environmental interaction and self-correction. To address this, we introduce MARS-SQL, a novel multi-agent framework that combines principled task decomposition and interactive reinforcement learning (RL). Our system comprises three specialized agents: a Grounding Agent for schema linking, a Generation Agent for query generation, and a Validation Agent for final selection. The core of our framework is the Generation agent, which is trained via a multi-turn RL policy. Adopting a ReAct-style Think-Act-Observe loop, the agent iteratively generates thoughts, executes SQL actions against a live database, and revises its strategy based on execution feedback, enabling dynamic, stateful reasoning and self-correction. At inference time, we generate multiple interaction trajectories to explore diverse reasoning paths. The Validation agent, then selects the optimal trajectory by modeling verification as a next-token prediction task and choosing the solution with the highest generation probability. This structured workflow pipelines specialized agents. It combines interactive RL for generation with generative modeling for verification. The approach proves highly effective for robust and accurate SQL generation. Experiments show that MARS-SQL achieves state-of-the-art Execution Accuracy of 77.84% on the BIRD dev set and 89.75% on the Spider test set. Our code is available at https://github.com/YangHaolin0526/MARS-SQL.

[41] IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

Bosi Wen,Yilin Niu,Cunxiang Wang,Pei Ke,Xiaoying Ling,Ying Zhang,Aohan Zeng,Hongning Wang,Minlie Huang

Main category: cs.CL

TL;DR: 本文提出IF-CRITIC,一种高效的LLM批评模型,用于评估指令遵循能力,通过生成约束清单和多阶段过滤机制训练,显著提升评估效率与可靠性。

Details Motivation: 现有基于LLM-as-a-Judge的指令遵循评估方法存在成本高、评估不可靠等问题,亟需更高效准确的评估方案。 Method: 设计一个清单生成器分解指令并生成约束清单,结合多阶段批评过滤机制收集高质量训练数据,并采用约束级偏好优化方法训练IF-CRITIC模型。 Result: 实验表明,IF-CRITIC在评估性能上优于Deepseek-R1和o4-mini等强基线模型,并能在更低计算开销下为LLM提供可扩展的奖励信号,显著提升指令遵循优化效果。 Conclusion: IF-CRITIC能够高效且可靠地评估LLM的指令遵循能力,为后续模型优化提供了低成本、高性能的评价机制。 Abstract: Instruction following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic that can provide efficient and reliable assessments of constraint following in the instructions. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments demonstrate that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including Deepseek-R1 and o4-mini. With the scalable reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.

[42] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Wenjin Liu,Haoran Luo,Xueyuan Lin,Haoming Liu,Tiesunlong Shen,Jiapu Wang,Rui Mao,Erik Cambria

Main category: cs.CL

TL;DR: 提出Prompt-R1,一个端到端的强化学习框架,利用小规模语言模型与大规模语言模型协作,通过多轮提示交互提升问题解决能力。

Details Motivation: 用户在面对复杂问题时难以提供有效提示,限制了大模型性能,因此需要自动化提示生成机制。 Method: 采用小规模LLM生成提示,大规模LLM进行推理,通过双约束奖励机制优化正确性、生成质量和推理准确性。 Result: 在多个公开数据集上显著优于基线模型,支持多种大模型的推理与训练。 Conclusion: Prompt-R1提供了一种即插即用的有效框架,提升了大模型在复杂任务中的表现。 Abstract: Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.

[43] OceanAI: A Conversational Platform for Accurate, Transparent, Near-Real-Time Oceanographic Insights

Bowen Chen,Jayesh Gajbhar,Gregory Dusek,Rob Redmon,Patrick Hogan,Paul Liu,DelWayne Bohnenstiehl,Dongkuan,Xu,Ruoying He

Main category: cs.CL

TL;DR: OceanAI是一个将开源大语言模型与美国国家海洋和大气管理局(NOAA)实时海洋数据流集成的对话式平台,通过实时API调用生成可验证、可重现的自然语言回答和数据可视化,提升了科学AI的透明性、可重复性和可信度。

Details Motivation: 现有的通用AI系统在科学应用中常产生未经验证的‘幻觉’输出,缺乏对权威数据的引用,影响科学严谨性。因此需要一个能结合自然语言交互与真实科学数据的系统。 Method: 开发OceanAI平台,集成开源大语言模型与NOAA的多个权威海洋数据API,针对用户查询动态触发实时数据请求,解析并合成数据结果,生成带原始数据引用的回答和可视化图表。 Result: 在与三种主流AI产品的盲测比较中,只有OceanAI提供了来自NOAA的准确水位数据及原始数据来源;其他系统要么拒绝回答,要么提供无依据的结果。平台已支持多种NOAA数据产品,适用于海洋灾害预测、生态系统评估和水质监测等场景。 Conclusion: OceanAI通过将大语言模型与权威观测数据结合,实现了科学上可靠、可验证的AI对话系统,为海洋领域的AI辅助决策提供了一个可扩展的框架。 Abstract: Artificial intelligence is transforming the sciences, yet general conversational AI systems often generate unverified "hallucinations" undermining scientific rigor. We present OceanAI, a conversational platform that integrates the natural-language fluency of open-source large language models (LLMs) with real-time, parameterized access to authoritative oceanographic data streams hosted by the National Oceanic and Atmospheric Administration (NOAA). Each query such as "What was Boston Harbor's highest water level in 2024?" triggers real-time API calls that identify, parse, and synthesize relevant datasets into reproducible natural-language responses and data visualizations. In a blind comparison with three widely used AI chat-interface products, only OceanAI produced NOAA-sourced values with original data references; others either declined to answer or provided unsupported results. Designed for extensibility, OceanAI connects to multiple NOAA data products and variables, supporting applications in marine hazard forecasting, ecosystem assessment, and water-quality monitoring. By grounding outputs and verifiable observations, OceanAI advances transparency, reproducibility, and trust, offering a scalable framework for AI-enabled decision support within the oceans. A public demonstration is available at https://oceanai.ai4ocean.xyz.

[44] VayuChat: An LLM-Powered Conversational Interface for Air Quality Data Analytics

Vedant Acharya,Abhay Pisharodi,Rishabh Mondal,Mohammad Rafiuddin,Nipun Batra

Main category: cs.CL

TL;DR: VayuChat是一个基于大语言模型的对话式系统,旨在帮助用户通过自然语言提问来获取空气质量、气象和政策项目的信息,并生成可执行代码和交互式可视化结果,提升环境数据分析的可及性。

Details Motivation: 印度每年因空气污染导致约160万人过早死亡,但决策者难以将分散的数据转化为有效决策。现有工具需要专业知识且提供静态仪表板,无法回答关键政策问题。 Method: VayuChat整合了中央污染控制委员会(CPCB)监测站数据、邦级人口统计数据以及国家清洁空气计划(NCAP)资金记录,构建统一接口,利用大语言模型实现自然语言问答,并输出Python代码与交互式图表。 Result: 系统支持用户通过简单对话完成复杂环境数据分析,已在Hugging Face平台公开部署,并配有演示视频,验证了其在政策制定者、研究人员和公众中的可用性与实用性。 Conclusion: VayuChat降低了环境数据使用的门槛,展示了对话式AI在可持续发展决策支持中的潜力。 Abstract: Air pollution causes about 1.6 million premature deaths each year in India, yet decision makers struggle to turn dispersed data into decisions. Existing tools require expertise and provide static dashboards, leaving key policy questions unresolved. We present VayuChat, a conversational system that answers natural language questions on air quality, meteorology, and policy programs, and responds with both executable Python code and interactive visualizations. VayuChat integrates data from Central Pollution Control Board (CPCB) monitoring stations, state-level demographics, and National Clean Air Programme (NCAP) funding records into a unified interface powered by large language models. Our live demonstration will show how users can perform complex environmental analytics through simple conversations, making data science accessible to policymakers, researchers, and citizens. The platform is publicly deployed at https://huggingface.co/spaces/SustainabilityLabIITGN/ VayuChat. For further information check out video uploaded on https://www.youtube.com/watch?v=d6rklL05cs4.

[45] Building a Silver-Standard Dataset from NICE Guidelines for Clinical LLMs

Qing Ding,Eric Hua Qing Zhang,Felix Jozsa,Julia Ive

Main category: cs.CL

TL;DR: 本研究介绍了一个基于公开指南的验证数据集,用于评估大语言模型在临床推理中的表现。

Details Motivation: 目前缺乏标准化的基准来评估大语言模型在基于指南的临床推理中的能力。 Method: 利用GPT构建包含真实患者情景和临床问题的数据集,并对多种流行的大语言模型进行基准测试。 Result: 该数据集能够有效支持对大语言模型在临床实用性和指南遵循方面的系统性评估。 Conclusion: 所提出的数据集和框架为评估大语言模型在医疗领域的应用提供了可靠工具。 Abstract: Large language models (LLMs) are increasingly used in healthcare, yet standardised benchmarks for evaluating guideline-based clinical reasoning are missing. This study introduces a validated dataset derived from publicly available guidelines across multiple diagnoses. The dataset was created with the help of GPT and contains realistic patient scenarios, as well as clinical questions. We benchmark a range of recent popular LLMs to showcase the validity of our dataset. The framework supports systematic evaluation of LLMs' clinical utility and guideline adherence.

[46] HPLT~3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Stephan Oepen,Nikolay Arefev,Mikko Aulamo,Marta Bañón,Maja Buljan,Laurie Burchell,Lucas Charpentier,Pinzhen Chen,Mariya Fedorova,Ona de Gibert,Barry Haddow,Jan Hajič,Jindrič Helcl,Andrey Kutuzov,Zihao Li,Risto Luukkonen,Bhavitvya Malik,Vladislav Mikhailov,Amanda Myntti,Dayyán O'Brien,Lucie Poláková,Sampo Pyysalo,Gema Ramírez Sánchez,Janine Siewert,Pavel Stepachev,Jörg Tiedemann,Teemu Vahtola,Fedor Vitiugin,Tea Vojtěchová,Jaume Zaragoza

Main category: cs.CL

TL;DR: 本文介绍了一个为近200种语言提供开放、大规模、高质量且富含标注的文本数据集的项目,包含30万亿token,可能是目前最大的多语言LLM预训练数据集,并提供了完整开源的数据处理流程和多语言评估基准。

Details Motivation: 为了推动多语言大模型的发展,解决现有数据集规模小、质量低、标注不足的问题,提供一个大规模、高质量、开放可用的多语言预训练数据资源。 Method: 基于多源网络爬取数据,构建了包括文档筛选、HTML文本提取、语言识别、去重、质量评估、敏感信息标注等在内的完整开源处理流程,并自动挖掘和合成了大规模平行语料。 Result: 发布了30万亿token的多语言数据集,提供了9种欧洲语言的综合评测基准,训练并评估了57个单语编码器-解码器模型及若干GPT-like单语模型,验证了数据质量与模型性能。 Conclusion: 该数据集是当前最大规模的公开多语言预训练语料之一,配套工具链和评估体系完善,显著促进了多语言大模型的研究与开发。 Abstract: We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

[47] Improving Romanian LLM Pretraining Data using Diversity and Quality Filtering

Vlad Negoita,Mihai Masala,Traian Rebedea

Main category: cs.CL

TL;DR: 本文研究了罗马尼亚语预训练语料库的特征与覆盖范围,并通过轻量级多任务模型对LLM标注的文本进行多层次过滤,提升了数据质量及模型性能。

Details Motivation: 由于低资源语言(如罗马尼亚语)缺乏高质量语料库,亟需有效方法来提升其预训练数据的质量。 Method: 使用轻量级多任务模型在经过LLM标注的罗马尼亚语文本上进行训练,实现对教育价值、主题、格式等多层面的数据过滤。 Result: 发现了罗马尼亚语和英语数据在主题分布上的显著差异,并证明通过过滤可显著提升LLM在多个基准上的预训练效果。 Conclusion: 多层级过滤策略能有效提升低资源语言预训练数据的质量,进而改善大语言模型的性能。 Abstract: Large Language Models (LLMs) have recently exploded in popularity, often matching or outperforming human abilities on many tasks. One of the key factors in training LLMs is the availability and curation of high-quality data. Data quality is especially crucial for under-represented languages, where high-quality corpora are scarce. In this work we study the characteristics and coverage of Romanian pretraining corpora and we examine how they differ from English data. By training a lightweight multitask model on carefully LLM-annotated Romanian texts, we are able to analyze and perform multi-level filtering (e.g., educational value, topic, format) to generate high-quality pretraining datasets. Our experiments show noteworthy trends in the topics present in Romanian and English data, while also proving the effectiveness of filtering data through improved LLM pretraining performance across multiple benchmarks.

[48] TSVer: A Benchmark for Fact Verification Against Time-Series Evidence

Marek Strong,Andreas Vlachos

Main category: cs.CL

TL;DR: 本文提出了TSVer,一个专注于时间序列证据下时序和数值推理的事实核查新基准数据集,包含287个真实世界声明和400个时间序列,标注了时间范围、判决结果和推理依据,并通过LLM辅助标注保证质量,实验表明现有模型在该任务上仍有挑战。

Details Motivation: 现有事实核查数据集常缺乏结构化证据、判决理由不足或依赖合成声明,难以有效评估模型在时序和数值推理上的能力,因此需要一个基于真实声明、具备高质量标注的时间序列事实核查基准。 Method: 构建了一个包含真实世界声明和多样化时间序列的数据集TSVer,采用LLM辅助的多步注释流程,对每个声明标注相关时间范围、判决结果及详细推理依据,并计算了标注者间一致性(kappa=0.745)以评估标注质量。 Result: 开发了基于时间序列证据的声明验证基线方法,实验显示即使是最先进的推理模型(如Gemini-2.5-Pro)在TSVer上也表现有限,判决准确率仅为63.37%,Ev2R得分为48.63。 Conclusion: TSVer为时间序列驱动的事实核查提供了高质量、真实且具挑战性的评估基准,揭示了当前模型在时序和数值推理方面的不足,推动未来研究发展更强大的推理方法。 Abstract: Reasoning over temporal and numerical data, such as time series, is a crucial aspect of fact-checking. While many systems have recently been developed to handle this form of evidence, their evaluation remains limited by existing datasets, which often lack structured evidence, provide insufficient justifications for verdicts, or rely on synthetic claims. In this paper, we introduce TSVer, a new benchmark dataset for fact verification focusing on temporal and numerical reasoning with time-series evidence. TSVer contains 287 real-world claims sourced from 38 fact-checking organizations and a curated database of 400 time series covering diverse domains. Each claim is annotated with time frames across all pertinent time series, along with a verdict and justifications reflecting how the evidence is used to reach the verdict. Using an LLM-assisted multi-step annotation process, we improve the quality of our annotations and achieve an inter-annotator agreement of kappa=0.745 on verdicts. We also develop a baseline for verifying claims against time-series evidence and show that even the state-of-the-art reasoning models like Gemini-2.5-Pro are challenged by time series, achieving a 63.37 accuracy score on verdicts and an Ev2R score of 48.63 on verdict justifications.

[49] MicroRemed: Benchmarking LLMs in Microservices Remediation

Lingzhe Zhang,Yunpeng Zhai,Tong Jia,Chiming Duan,Minghua He,Leyi Pan,Zhaoyang Liu,Bolin Ding,Ying Li

Main category: cs.CL

TL;DR: 本文提出了MicroRemed,首个用于评估大语言模型在端到端微服务修复能力的基准,以及ThinkRemed,一种模拟站点可靠性工程师反思与感知推理的多智能体框架。实验表明,当前的大语言模型在MicroRemed上面临重大挑战,而ThinkRemed通过迭代推理和系统反思提高了修复性能。

Details Motivation: 现有的微服务修复方法依赖于人工编写的提示,限制了自动化水平和效率,需要一个能够评估并推动大语言模型在该领域发展的基准。 Method: 提出MicroRemed基准测试,要求模型直接从诊断报告生成可执行的Ansible playbook;设计ThinkRemed多智能体框架,实现类似SRE的反思性和感知性推理过程。 Result: MicroRemed对现有大语言模型构成显著挑战,ThinkRemed框架通过迭代推理和系统反思机制,在端到端修复任务中表现出更好的性能提升。 Conclusion: ThinkRemed结合MicroRemed为未来研究提供了重要基础,推动了大语言模型在自动化微服务修复中的应用发展。 Abstract: Large Language Models (LLMs) integrated with agent-based reasoning frameworks have recently shown strong potential for autonomous decision-making and system-level operations. One promising yet underexplored direction is microservice remediation, where the goal is to automatically recover faulty microservice systems. Existing approaches, however, still rely on human-crafted prompts from Site Reliability Engineers (SREs), with LLMs merely converting textual instructions into executable code. To advance research in this area, we introduce MicroRemed, the first benchmark for evaluating LLMs in end-to-end microservice remediation, where models must directly generate executable Ansible playbooks from diagnosis reports to restore system functionality. We further propose ThinkRemed, a multi-agent framework that emulates the reflective and perceptive reasoning of SREs. Experimental results show that MicroRemed presents substantial challenges to current LLMs, while ThinkRemed improves end-to-end remediation performance through iterative reasoning and system reflection. The benchmark is available at https://github.com/LLM4AIOps/MicroRemed.

[50] Learning When to Quit in Sales Conversations

Emaad Manzoor,Eva Ascarza,Oded Netzer

Main category: cs.CL

TL;DR: 本文研究了高销量外呼销售中销售人员的动态筛选决策,提出了一种基于生成式语言模型的“停止代理”,通过模仿最优停止策略来学习何时终止对话。该方法显著减少了失败通话时间并提升了销售效率。

Details Motivation: 销售人员在面对大量潜在客户时需频繁决定是否继续或放弃当前对话,但现有研究对这类决策的有效性及改进方法知之甚少。本文旨在理解并优化这一动态筛选过程。 Method: 将动态筛选决策建模为最优停止问题,利用生成式语言模型构建一个顺序决策代理(停止代理),通过回溯推断的最优停止策略进行训练,并应用于真实外呼销售对话数据。 Result: 在一家大型欧洲电信公司的外呼数据上应用该停止代理后,失败通话时间减少了54%,同时保留了几乎所有销售额;重新分配节省的时间可使预期销售额最多提升37%。分析发现,销售人员倾向于过度依赖少数明显的客户不感兴趣表达,且对失败风险预测不准。 Conclusion: 人工智能算法有望纠正人类在实时对话决策中的认知局限,显著提升销售团队的决策效率与整体业绩。 Abstract: Salespeople frequently face the dynamic screening decision of whether to persist in a conversation or abandon it to pursue the next lead. Yet, little is known about how these decisions are made, whether they are efficient, or how to improve them. We study these decisions in the context of high-volume outbound sales where leads are ample, but time is scarce and failure is common. We formalize the dynamic screening decision as an optimal stopping problem and develop a generative language model-based sequential decision agent - a stopping agent - that learns whether and when to quit conversations by imitating a retrospectively-inferred optimal stopping policy. Our approach handles high-dimensional textual states, scales to large language models, and works with both open-source and proprietary language models. When applied to calls from a large European telecommunications firm, our stopping agent reduces the time spent on failed calls by 54% while preserving nearly all sales; reallocating the time saved increases expected sales by up to 37%. Upon examining the linguistic cues that drive salespeople's quitting decisions, we find that they tend to overweight a few salient expressions of consumer disinterest and mispredict call failure risk, suggesting cognitive bounds on their ability to make real-time conversational decisions. Our findings highlight the potential of artificial intelligence algorithms to correct cognitively-bounded human decisions and improve salesforce efficiency.

[51] Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed,Muhammad Abdul-mageed,Shady Shehata

Main category: cs.CL

TL;DR: 本文提出了DebateBias-8K,一种多语言、辩论风格的基准测试,用于揭示生成式大模型在现实叙事场景中的偏见问题。研究涵盖四个敏感领域和七种语言,发现主流模型普遍存在刻板印象,且低资源语言中偏见更严重,表明当前基于英语的安全对齐无法全球泛化。

Details Motivation: 现有偏见评估多依赖英语分类任务,难以反映生成式模型在开放对话中的真实偏见表现,因此需要更贴近实际应用的多语言评估基准。 Method: 构建包含8,400个结构化辩论提示的DebateBias-8K数据集,覆盖四个敏感主题和七种语言;使用GPT-4o、Claude 3、DeepSeek和LLaMA 3生成超10万条回复,并进行自动分类分析。 Result: 所有模型均再现了根深蒂固的刻板印象:阿拉伯人被强烈关联恐怖主义与宗教(≥95%),非洲人被关联社会经济‘落后’(≤77%),西方群体则被视为现代或进步;低资源语言中偏见加剧,显示英语主导的对齐训练无法跨语言泛化。 Conclusion: 当前多语言公平性存在显著差距:现有对齐方法虽能减少显性毒性,但在开放式生成中仍产生系统性偏见;需发展更具文化包容性的对齐策略。 Abstract: Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women's rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic "backwardness" (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

[52] ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

Lvhua Wu,Xuefeng Jiang,Sheng Sun,Tian Wen,Yuwei Wang,Min Liu

Main category: cs.CL

TL;DR: 提出ZoFia,一种两阶段零样本虚假新闻检测框架,通过分层显著性与多LLM交互系统提升检测性能。

Details Motivation: 大语言模型在处理快速演变的新闻流时存在知识时效性和幻觉问题,静态数据集训练的模型泛化能力不足。 Method: 引入分层显著性与SC-MMR算法选择关键词检索最新外部证据,构建多角色LLM协作分析与对抗辩论系统。 Result: 在两个公开数据集上显著优于现有零样本基线及多数少样本方法。 Conclusion: ZoFia能有效提升零样本虚假新闻检测的准确性与可解释性,具备良好鲁棒性。 Abstract: The rapid spread of fake news threatens social stability and public trust, rendering its detection an imperative research priority. Although large language models (LLMs) excel at numerous natural language processing tasks with their remarkable contextual understanding and extensive prior knowledge, the time-bounded knowledge coverage and tendency for generating hallucination content reduce their reliability when handling fast-evolving news streams. Furthermore, models trained on existing static datasets also often lack the generalization needed for emerging news topics. To address these challenges, we propose ZoFia, a novel two-stage zero-shot fake news detection framework. First, we introduce Hierarchical Salience to quantify the importance of entities in the news content, and propose the SC-MMR algorithm to effectively select an informative and diverse set of keywords that serve as queries for retrieving up-to-date external evidence. Subsequently, a multi LLM interactive system, in which each agent assumes a distinct role, performs multi-view collaborative analysis and adversarial debate over the news text and its related information, and finally produces an interpretable and robust judgment. Comprehensive experiments on two public datasets demonstrate that ZoFia obviously outperforms existing zero-shot baselines and most of few-shot methods. Our codes will be open-sourced to facilitate related communities.

[53] Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning

Ru Wang,Wei Huang,Qi Cao,Yusuke Iwasawa,Yutaka Matsuo,Jiaxian Guo

Main category: cs.CL

TL;DR: 提出Self-Harmony框架,利用同一模型作为Solver和Reframer,通过原问题与改写问题的一致性生成可靠学习信号,在无标签测试时强化学习中实现SOTA性能。

Details Motivation: 现有TTRL方法依赖多数投票等合成信号易陷入流行但错误的答案,缺乏稳定可靠的学习信号。 Method: 使用单一模型同时担任Solver和Reframer角色,通过原始与改写问题的问答一致性,采用调和平均聚合答案频率生成伪标签。 Result: 在30个实验设置中28项排名第一,显著优于现有方法,且所有实验中无训练失败,表现出极强鲁棒性。 Conclusion: Self-Harmony通过稳定性选择机制有效避免了虚假答案依赖,为无监督测试时学习提供了高效、可靠的解决方案。 Abstract: Test-time reinforcement learning (TTRL) offers a label-free paradigm for adapting models using only synthetic signals at inference, but its success hinges on constructing reliable learning signals. Standard approaches such as majority voting often collapse to spurious yet popular answers. We introduce Self-Harmony, a framework built on a simple intuition: the correct answer should remain stable across both an original question and its paraphrase. Self-Harmony operationalizes this by employing a single model in two complementary roles: a Solver to produce answers and a Reframer to rephrase the input. Based on this, we further propose a pseudo-label method: instead of majority voting, it aggregates answer frequencies across these original and reframed views using the harmonic mean. This is a process that naturally selects for solutions stable under reframing, thereby avoiding the common trap of favoring view-dependent, spurious answers. Crucially, this requires no human supervision or auxiliary models. Across diverse reasoning benchmarks, Self-Harmony achieves state-of-the-art results at the label-free test-time setting, ranking first in 28 of 30 settings across multiple methods. Beyond accuracy, it demonstrates unprecedented robustness, with zero training failures in all experiments, underscoring its stability and reliability.

[54] DEER: Disentangled Mixture of Experts with Instance-Adaptive Routing for Generalizable Machine-Generated Text Detection

Guoxin Ma,Xiaoming Liu,Zhanhan Zhang,Chengzhengxu Li,Shengchao Liu,Yu Lan

Main category: cs.CL

TL;DR: 提出了一种名为DEER的双阶段解耦专家混合框架,用于提升机器生成文本检测在领域内和跨领域的性能。

Details Motivation: 现有机器生成文本检测方法在领域迁移下性能显著下降,亟需能同时捕捉领域特定和通用特征的鲁棒模型。 Method: 设计了一个解耦的专家混合架构(DEER),包含领域特定专家和共享专家,并采用基于强化学习的路由机制动态选择专家,以应对推理时领域标签缺失的问题。 Result: 在五个领域内和五个跨领域数据集上实验表明,DEER在F1分数和准确率上均优于现有最先进方法,领域内平均F1提升1.39%,跨领域提升5.32%。消融实验证明了解耦专家和自适应路由的有效性。 Conclusion: DEER通过分离领域特定与通用特征并实现动态专家路由,有效提升了机器生成文本检测的鲁棒性和泛化能力。 Abstract: Detecting machine-generated text (MGT) has emerged as a critical challenge, driven by the rapid advancement of large language models (LLMs) capable of producing highly realistic, human-like content. However, the performance of current approaches often degrades significantly under domain shift. To address this challenge, we propose a novel framework designed to capture both domain-specific and domain-general MGT patterns through a two-stage Disentangled mixturE-of-ExpeRts (DEER) architecture. First, we introduce a disentangled mixture-of-experts module, in which domain-specific experts learn fine-grained, domain-local distinctions between human and machine-generated text, while shared experts extract transferable, cross-domain features. Second, to mitigate the practical limitation of unavailable domain labels during inference, we design a reinforcement learning-based routing mechanism that dynamically selects the appropriate experts for each input instance, effectively bridging the train-inference gap caused by domain uncertainty. Extensive experiments on five in-domain and five out-of-domain benchmark datasets demonstrate that DEER consistently outperforms state-of-the-art methods, achieving average F1-score improvements of 1.39% and 5.32% on in-domain and out-of-domain datasets respectively, along with accuracy gains of 1.35% and 3.61% respectively. Ablation studies confirm the critical contributions of both disentangled expert specialization and adaptive routing to model performance.

[55] AraFinNews: Arabic Financial Summarisation with Domain-Adapted LLMs

Mo El-Haj,Paul Rayson

Main category: cs.CL

TL;DR: 本文介绍了AraFinNews,这是迄今为止最大的公开阿拉伯语金融新闻数据集,并研究了领域特异性对基于大语言模型的阿拉伯语金融文本摘要的影响。

Details Motivation: 为了提升阿拉伯语金融文本摘要的事实准确性和数值可靠性,需要专门的领域适应和高质量的数据集。 Method: 构建了一个包含21.25万篇文章-标题对的大型阿拉伯语金融新闻数据集AraFinNews,并评估了mT5、AraT5和领域适配的FinAraT5等Transformer模型在金融领域预训练下的表现。 Result: 实验结果表明,经过领域适配的模型(如FinAraT5)生成的摘要更忠实、连贯,尤其在处理定量信息和实体方面表现更优。 Conclusion: 领域特定的预训练对于提高阿拉伯语金融文本摘要的事实一致性和叙述流畅性至关重要。 Abstract: This paper investigates the impact of domain specificity on abstractive summarisation of Arabic financial texts using large language models (LLMs). We introduce AraFinNews, the largest publicly available Arabic financial news dataset to date, comprising 212,500 article--headline pairs spanning nearly a decade of reporting from October 2015 to July 2025. Designed as the Arabic equivalent of major English summarisation corpora such as CNN/DailyMail, AraFinNews provides a robust benchmark for evaluating domain-specific language understanding and generation in financial contexts. Using this resource, we evaluate transformer-based models -- including mT5, AraT5, and the domain-adapted FinAraT5 -- to examine how financial-domain pretraining influences factual accuracy, numerical reliability, and stylistic alignment with professional reporting. Experimental results show that domain-adapted models generate more faithful and coherent summaries, particularly in handling quantitative and entity-centric information. The findings highlight the importance of domain-specific adaptation for improving factual consistency and narrative fluency in Arabic financial summarisation. The dataset is freely available for non-commercial research at https://github.com/ArabicNLP-UK/AraFinNews.

[56] When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Min Fang,Zhihui Fu,Qibin Zhao,Jun Wang

Main category: cs.CL

TL;DR: 提出ReSpec,一种检索增强的推测解码框架,通过熵引导触发、反馈驱动候选选择和源感知宽松验证策略,显著提升大模型推理速度并保持输出质量。

Details Motivation: 现有推测解码方法在 draft 模型有效性上存在不足:基于模型的方法准确但成本高,检索增强方法依赖启发式切换导致不必要的检索。需要更高效的自适应机制。 Method: 1) 熵引导的自适应触发机制,在上下文不确定性低时才启动检索;2) 基于历史反馈的候选选择,组织多个高质量候选进行并行验证;3) 源感知的宽松验证策略,对模型生成稿严格验证,对检索稿宽松验证。 Result: 在Spec-Bench上实验表明,ReSpec相比EAGLE-2和SAM-Decoding分别加速超过33%和25%,同时保持输出质量。 Conclusion: ReSpec通过将启发式draft切换转化为自适应决策,在效率与准确性之间取得更好平衡,实现了当前最优的推测解码加速效果。 Abstract: Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting model. While model-based methods like EAGLE-2 are accurate but costly, retrieval-enhanced methods like SAM-Decoding rely on heuristic switching strategies that often trigger unnecessary retrievals. To address this, we propose ReSpec (\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding), a novel framework that transforms heuristic drafter switching into adaptive decision-making. ReSpec features three core innovations: 1) An \textbf{entropy-guided adaptive trigger} quantifies contextual predictability to initiate retrieval only when uncertainty is low, avoiding costly low-quality speculations. 2) A \textbf{feedback-driven candidate selection} leverages historical feedback to organize multiple high-quality candidates for parallel verification, maximizing retrieval utility. 3) A source-aware \textbf{relaxed verification strategy} applies strict checks to model-generated drafts while using a relaxed verification for retrieved drafts, achieving a better balance between accuracy and efficiency. Extensive experiments on Spec-Bench demonstrate that ReSpec achieves state-of-the-art acceleration,outperforming EAGLE-2 and SAM-Decoding by over $33\%$ and $25\%$, respectively, while maintaining output quality.

[57] "Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou,Zhexin Zhang,Zhi Li,Limin Sun

Main category: cs.CL

TL;DR: 本文研究了在AI辅助审稿中,论文中隐藏的提示注入攻击对AI评审的影响,并提出了静态和迭代两种攻击方式,展示了其有效性及跨场景的鲁棒性,同时探讨了防御方法及其局限性。

Details Motivation: 随着AI模型在科研审稿中的应用增加,一些论文可能包含恶意提示以操纵AI评审结果,因此需要系统性研究此类安全威胁。 Method: 提出两类攻击:静态攻击使用固定注入提示,迭代攻击则针对模拟评审模型优化提示;并通过实验评估攻击效果及防御措施的有效性。 Result: 两类攻击均能显著影响前沿AI评审模型,常导致满分评价,且具有跨设置的鲁棒性;基于检测的防御可降低攻击成功率,但自适应攻击者仍可部分绕过。 Conclusion: AI辅助审稿面临提示注入攻击的严重威胁,需引起重视并建立更严格的防护机制。 Abstract: With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.

[58] FirstAidQA: A Synthetic Dataset for First Aid and Emergency Response in Low-Connectivity Settings

Saiyma Sittul Muna,Rezwan Islam Salvi,Mushfiqur Rahman Mushfique,Ajwad Abrar

Main category: cs.CL

TL;DR: 本文提出了一个名为FirstAidQA的合成数据集,包含5,500个高质量的急救与应急响应场景问答对,旨在解决在低连接或无连接环境中部署大语言模型的局限性。

Details Motivation: 现有的大语言模型计算开销大,难以在低性能设备上运行,且缺乏针对急救领域的高质量数据集,限制了在紧急情况下的应用。 Method: 利用ChatGPT-4o-mini结合《Vital First Aid Book (2019)》文本,通过基于提示的上下文学习生成问答对,并进行文本清洗、上下文分块、过滤及人工验证以确保质量。 Result: 构建了一个包含5,500个高质量问答对的公开数据集FirstAidQA,适用于指令微调和小型语言模型优化,支持离线、快速、可靠的应急AI系统开发。 Conclusion: FirstAidQA填补了急救领域轻量级AI模型训练数据的空白,推动了资源受限环境下安全关键型AI的研究与应用。 Abstract: In emergency situations, every second counts. The deployment of Large Language Models (LLMs) in time-sensitive, low or zero-connectivity environments remains limited. Current models are computationally intensive and unsuitable for low-tier devices often used by first responders or civilians. A major barrier to developing lightweight, domain-specific solutions is the lack of high-quality datasets tailored to first aid and emergency response. To address this gap, we introduce FirstAidQA, a synthetic dataset containing 5,500 high-quality question answer pairs that encompass a wide range of first aid and emergency response scenarios. The dataset was generated using a Large Language Model, ChatGPT-4o-mini, with prompt-based in-context learning, using texts from the Vital First Aid Book (2019). We applied preprocessing steps such as text cleaning, contextual chunking, and filtering, followed by human validation to ensure accuracy, safety, and practical relevance of the QA pairs. FirstAidQA is designed to support instruction-tuning and fine-tuning of LLMs and Small Language Models (SLMs), enabling faster, more reliable, and offline-capable systems for emergency settings. We publicly release the dataset to advance research on safety-critical and resource-constrained AI applications in first aid and emergency response. The dataset is available on Hugging Face at https://huggingface.co/datasets/i-am-mushfiq/FirstAidQA.

[59] DeepSpecs: Expert-Level Questions Answering in 5G

Aman Ganapathy Manvattira,Yifei Xu,Ziyue Dang,Songwu Lu

Main category: cs.CL

TL;DR: 本文提出了一种名为DeepSpecs的检索增强生成(RAG)系统,通过结构化和时间推理来提升对5G标准文档的问答能力,相较于现有方法在跨引用解析和规范演进追踪方面表现更优。

Details Motivation: 现有的RAG系统依赖语义相似性,难以准确解析5G标准文档中的跨引用关系和版本演变,无法满足专家级问题解答的需求。 Method: 构建三个富含元数据的数据库(SpecDB、ChangeDB、TDocDB),通过递归检索实现跨引用的显式解析,并利用版本差异和变更请求追踪规范演化。 Result: 在多个LLM后端上,DeepSpecs均优于基础模型和最先进的电信RAG系统;消融实验表明,显式跨引用解析和演化感知检索显著提升了回答质量。 Conclusion: 建模5G标准的结构和时间特性可有效提升技术文档问答系统的性能,DeepSpecs为此提供了可行的解决方案。 Abstract: 5G technology enables mobile Internet access for billions of users. Answering expert-level questions about 5G specifications requires navigating thousands of pages of cross-referenced standards that evolve across releases. Existing retrieval-augmented generation (RAG) frameworks, including telecom-specific approaches, rely on semantic similarity and cannot reliably resolve cross-references or reason about specification evolution. We present DeepSpecs, a RAG system enhanced by structural and temporal reasoning via three metadata-rich databases: SpecDB (clause-aligned specification text), ChangeDB (line-level version diffs), and TDocDB (standardization meeting documents). DeepSpecs explicitly resolves cross-references by recursively retrieving referenced clauses through metadata lookup, and traces specification evolution by mining changes and linking them to Change Requests that document design rationale. We curate two 5G QA datasets: 573 expert-annotated real-world questions from practitioner forums and educational resources, and 350 evolution-focused questions derived from approved Change Requests. Across multiple LLM backends, DeepSpecs outperforms base models and state-of-the-art telecom RAG systems; ablations confirm that explicit cross-reference resolution and evolution-aware retrieval substantially improve answer quality, underscoring the value of modeling the structural and temporal properties of 5G standards.

[60] DEEPAMBIGQA: Ambiguous Multi-hop Questions for Benchmarking LLM Answer Completeness

Jiabao Ji,Min Li,Priyanshu Kumar,Shiyu Chang,Saloni Potdar

Main category: cs.CL

TL;DR: 本文提出了DeepAmbigQAGen生成管道和DeepAmbigQA数据集,用于评估在名称歧义和多步推理下的开放域问答性能,实验表明现有模型(包括GPT-5)在回答此类复杂问题时仍存在严重不足。

Details Motivation: 现有问答基准很少同时评估名称歧义和多步推理的挑战,而实际应用中复杂问题往往需要同时解决这两方面困难,因此需要更贴近真实场景的评测任务。 Method: 提出DeepAmbigQAGen自动生成包含名称歧义和多跳推理的自然且可验证的问题,并基于文本语料库和链接知识图构建了包含3600个问题的DeepAmbigQA数据集。 Result: 实验显示,即使是最先进的GPT-5模型,在歧义问题上的精确匹配率仅为0.13,非歧义问题为0.21,表明当前模型在答案完整性方面表现不佳。 Conclusion: 当前的大型语言模型在处理具有名称歧义和多步推理的复杂问题时仍存在显著缺陷,亟需构建更强大的问答系统以提升信息收集与答案完整性的能力。 Abstract: Large language models (LLMs) with integrated search tools show strong promise in open-domain question answering (QA), yet they often struggle to produce complete answer set to complex questions such as Which actor from the film Heat won at least one Academy Award?, which requires (1) distinguishing between multiple films sharing the same title and (2) reasoning across a large set of actors to gather and integrate evidence. Existing QA benchmarks rarely evaluate both challenges jointly. To address this, we introduce DeepAmbigQAGen, an automatic data generation pipeline that constructs QA tasks grounded in text corpora and linked knowledge graph, generating natural and verifiable questions that systematically embed name ambiguity and multi-step reasoning. Based on this, we build DeepAmbigQA, a dataset of 3,600 questions requiring multi-hop reasoning and half of them explicit name ambiguity resolving. Experiments reveal that, even state-of-the-art GPT-5 show incomplete answers, achieving only 0.13 exact match on ambiguous questions and 0.21 on non-ambiguous questions. These findings highlight the need for more robust QA systems aimed at information gathering and answer completeness.

[61] Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

Wenrui Cai,Chengyu Wang,Junbing Yan,Jun Huang,Xiangzhong Fang

Main category: cs.CL

TL;DR: 本文扩展了DistilQwen模型家族,提出了四种专为工业需求设计的小型高效推理模型系列,包括慢思考模型、自适应思考模型和蒸馏奖励模型,在多个基准测试中表现出高推理效率和强推理性能,并支持在阿里云PAI平台上进行可扩展的训练与推理。

Details Motivation: 为了满足实际应用中对小型高效推理模型的需求,平衡推理性能与推理速度,推动知识蒸馏技术的发展。 Method: 基于Qwen模型初始化,通过知识蒸馏技术构建四种模型系列:慢思考模型、两种自适应思考模型和蒸馏奖励模型,并在阿里云PAI平台上实现可扩展训练与推理。 Result: 在多个基准测试中,所提出的模型在保持高推理效率的同时展现出优异的推理性能,蒸馏奖励模型能有效支持推理模型的强化学习优化。 Conclusion: 扩展的DistilQwen模型家族能够有效满足工业界对高效、可扩展推理模型的需求,具备良好的实用性和部署价值。 Abstract: Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.

[62] PrefixNLI: Detecting Factual Inconsistencies as Soon as They Arise

Sapir Harary,Eran Hirsch,Aviv Slobodkin,David Wan,Mohit Bansal,Ido Dagan

Main category: cs.CL

TL;DR: 本文提出了一种新的方法,通过在文本前缀上进行蕴含检测来提高大语言模型生成内容的事实一致性。作者构建了专门的数据集并训练了一个名为MiniTruePrefixes的模型,在前缀级别的蕴含检测任务中显著优于现有的NLI模型,并在抽象摘要任务中验证了其有效性。

Details Motivation: 现有的NLI模型主要用于完整句子的事实性判断,但在自回归生成过程中,决策是基于不断扩展的文本前缀做出的,存在不匹配问题。因此,需要一种能够在文本前缀上进行有效蕴含检测的方法以提升生成结果的事实性。 Method: 将蕴含检测任务推广到任意文本前缀上,构建了适用于该任务的训练和评估数据集,并训练了一个轻量级专用模型MiniTruePrefixes。将其集成到受控解码框架中,用于指导LLM生成更忠实于证据的内容。 Result: MiniTruePrefixes在前缀级蕴含检测任务上比现有NLI基线模型高出5-14个F1点;在抽象摘要任务中,使用MiniTruePrefixes引导的LLaMA-3.2-3B-Instruct在事实一致性方面达到与8B模型相当的水平,且运行速度更快,仅使用一半内存。 Conclusion: 将蕴含检测扩展到文本前缀层面可有效提升大模型生成内容的事实性,MiniTruePrefixes作为一种专用小模型,在性能、效率和实用性方面均表现出优势,具备在实际生成系统中部署的潜力。 Abstract: Natural Language Inference (NLI) models have been used in various ways to improve the factuality of LLM outputs. This is typically done by applying an NLI model to judge whether the model output is entailed from the supposed evidence, triggering some corrective actions, such as beam reranking at inference time or RL rewards during training. While NLI models are trained to detect factual inconsistencies over complete sentences, decisions in the common autoregressive generation architecture are made for each evolving text prefix, during decoding. Addressing this setting, we generalize the entailment detection task to apply over arbitrary text prefixes, and suggest its utility for improving generation faithfulness. Providing suitable evaluation and training datasets for this task, we train MiniTruePrefixes, a novel specialized model that better detects factual inconsistencies over text prefixes, outperforming comparable baseline NLI models by 5-14 F1 points in prefix-level entailment. We further demonstrate that integrating MiniTruePrefixes into a controlled decoding framework substantially improves factual consistency in abstractive summarization. When guided by MiniTruePrefixes, LLaMA-3.2-3B-Instruct matches the faithfulness and runtime of the 8B model from the same model family, while using only half the memory.

[63] Safer in Translation? Presupposition Robustness in Indic Languages

Aadi Palnitkar,Arjun Suresh,Rishi Rajesh,Puneet Puli

Main category: cs.CL

TL;DR: 本文提出了Cancer-Myth-Indic,一个包含2500个翻译项目的印度语言基准,用于评估大型语言模型在多语言环境下的癌症相关误解回答能力。

Details Motivation: 由于现有的医学基准大多为英文,缺乏对多语言环境下大型语言模型的评估,因此需要构建非英语的评估基准来填补这一空白。 Method: 通过将Cancer-Myth数据集中的500个项目均匀采样并翻译成五种印度地区广泛使用但服务不足的语言,每种语言500个项目,总共2500个项目。翻译过程中遵循风格指南以保留隐含的预设,并由母语译者完成。 Result: 构建了Cancer-Myth-Indic基准,包含了与癌症相关的错误预设问题,用于测试和评估多个流行的大型语言模型在此类问题上的表现。 Conclusion: Cancer-Myth-Indic有助于弥补现有文献中关于多语言大型语言模型评估的空白,特别是在印度地区语言中的应用。 Abstract: Increasingly, more and more people are turning to large language models (LLMs) for healthcare advice and consultation, making it important to gauge the efficacy and accuracy of the responses of LLMs to such queries. While there are pre-existing medical benchmarks literature which seeks to accomplish this very task, these benchmarks are almost universally in English, which has led to a notable gap in existing literature pertaining to multilingual LLM evaluation. Within this work, we seek to aid in addressing this gap with Cancer-Myth-Indic, an Indic language benchmark built by translating a 500-item subset of Cancer-Myth, sampled evenly across its original categories, into five under-served but widely used languages from the subcontinent (500 per language; 2,500 translated items total). Native-speaker translators followed a style guide for preserving implicit presuppositions in translation; items feature false presuppositions relating to cancer. We evaluate several popular LLMs under this presupposition stress.

[64] The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation

İbrahim Ethem Deveci,Duygu Ataman

Main category: cs.CL

TL;DR: 本文探讨了大型语言模型和推理模型在现有基准测试上的表现饱和问题,质疑超越基准是否真正反映推理能力,并分析了OpenAI、Anthropic和Google三类模型在不同推理任务上的性能演变趋势,旨在为未来的推理评估研究提供参考。

Details Motivation: 随着模型能力的提升和训练数据可能包含基准数据,现有推理基准逐渐饱和,难以真实反映模型的推理能力,因此需要重新审视基准的有效性。 Method: 分析OpenAI、Anthropic和Google三类模型在多年间多个推理基准上的性能演变趋势,比较不同推理任务的表现变化。 Result: 发现模型在多个基准上性能趋于饱和,且提升可能不完全代表推理能力增强,当前基准测试面临挑战,需更严谨的设计以准确评估推理能力。 Conclusion: 超越现有基准并不一定意味着真正的推理能力提升,未来的研究需要设计更具挑战性和抗污染的评估方法,以更准确地衡量模型的推理水平。 Abstract: The rapid rise of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has been accompanied by an equally rapid increase of benchmarks used to assess them. However, due to both improved model competence resulting from scaling and novel training advances as well as likely many of these datasets being included in pre or post training data, results become saturated, driving a continuous need for new and more challenging replacements. In this paper, we discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure? We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities across different benchmarks evolve over the years. We also analyze performance trends over the years across different reasoning tasks and discuss the current situation of benchmarking and remaining challenges. By offering a comprehensive overview of benchmarks and reasoning tasks, our work aims to serve as a first reference to ground future research in reasoning evaluation and model development.

[65] Confounding Factors in Relating Model Performance to Morphology

Wessel Poelman,Thomas Bauwens,Miryam de Lhoneux

Main category: cs.CL

TL;DR: 本文探讨了形态学特征对语言建模和分词的影响,指出先前研究中的矛盾证据源于实验设计中的混淆因素。作者重新评估了关于黏着语与屈折语语言建模难度差异的三个假设,并提出基于二元分词的指标作为形态复杂度的梯度代理,无需专家标注即可预测因果语言建模的难度。

Details Motivation: 由于不同研究在分析形态学对语言建模影响时存在实验设计上的混淆因素,导致结论冲突,因此需要更可靠的方法来厘清形态学与语言建模之间的关系。 Method: 识别现有研究中的混淆因素,重新评估Arnett & Bergen (2025)提出的三个假设,并引入基于token二元组的内在指标来预测语言建模难度。 Result: 发现原有假设均受混淆因素影响,而token bigram指标可有效作为形态复杂度的代理,能无监督地预测语言建模的难度。 Conclusion: 要可靠回答形态学如何影响语言建模,必须控制实验中的混淆因素,并采用如token bigram等更稳健的内在评估指标。 Abstract: The extent to which individual language characteristics influence tokenization and language modeling is an open question. Differences in morphological systems have been suggested as both unimportant and crucial to consider (Cotterell et al., 2018; Gerz et al., 2018a; Park et al., 2021, inter alia). We argue this conflicting evidence is due to confounding factors in experimental setups, making it hard to compare results and draw conclusions. We identify confounding factors in analyses trying to answer the question of whether, and how, morphology relates to language modeling. Next, we re-assess three hypotheses by Arnett & Bergen (2025) for why modeling agglutinative languages results in higher perplexities than fusional languages: they look at morphological alignment of tokenization, tokenization efficiency, and dataset size. We show that each conclusion includes confounding factors. Finally, we introduce token bigram metrics as an intrinsic way to predict the difficulty of causal language modeling, and find that they are gradient proxies for morphological complexity that do not require expert annotation. Ultimately, we outline necessities to reliably answer whether, and how, morphology relates to language modeling.

[66] RAGSmith: A Framework for Finding the Optimal Composition of Retrieval-Augmented Generation Methods Across Datasets

Muhammed Yusuf Kartal,Suha Kagan Kose,Korhan Sevinç,Burak Aktas

Main category: cs.CL

TL;DR: RAGSmith是一个模块化框架,通过端到端的架构搜索优化检索增强生成(RAG)系统,在多个领域中显著优于基线方法。

Details Motivation: RAG系统的性能依赖于多个组件的复杂交互,孤立优化各模块效果有限,因此需要一种整体性的优化方法。 Method: 提出RAGSmith框架,将RAG设计视为在九个技术类别和46,080种可行配置中的端到端架构搜索,采用遗传算法联合优化检索与生成指标。 Result: 在六个维基百科衍生领域中,RAGSmith平均比基线提升3.8%(范围1.2%~6.9%),最高提升达检索+12.5%、生成+7.5%;搜索仅探索约0.2%的配置空间即找到鲁棒结构:向量检索加生成后反思/修订,并根据领域调整其他模块;段落压缩未被选中。 Conclusion: RAGSmith提供了实用且领域感知的RAG系统构建指南,验证了进化搜索在全管道优化中的有效性。 Abstract: Retrieval-Augmented Generation (RAG) quality depends on many interacting choices across retrieval, ranking, augmentation, prompting, and generation, so optimizing modules in isolation is brittle. We introduce RAGSmith, a modular framework that treats RAG design as an end-to-end architecture search over nine technique families and 46{,}080 feasible pipeline configurations. A genetic search optimizes a scalar objective that jointly aggregates retrieval metrics (recall@k, mAP, nDCG, MRR) and generation metrics (LLM-Judge and semantic similarity). We evaluate on six Wikipedia-derived domains (Mathematics, Law, Finance, Medicine, Defense Industry, Computer Science), each with 100 questions spanning factual, interpretation, and long-answer types. RAGSmith finds configurations that consistently outperform naive RAG baseline by +3.8\% on average (range +1.2\% to +6.9\% across domains), with gains up to +12.5\% in retrieval and +7.5\% in generation. The search typically explores $\approx 0.2\%$ of the space ($\sim 100$ candidates) and discovers a robust backbone -- vector retrieval plus post-generation reflection/revision -- augmented by domain-dependent choices in expansion, reranking, augmentation, and prompt reordering; passage compression is never selected. Improvement magnitude correlates with question type, with larger gains on factual/long-answer mixes than interpretation-heavy sets. These results provide practical, domain-aware guidance for assembling effective RAG systems and demonstrate the utility of evolutionary search for full-pipeline optimization.

[67] LiveSearchBench: An Automatically Constructed Benchmark for Retrieval and Reasoning over Dynamic Knowledge

Heng Zhou,Ao Yu,Yuchen Fan,Jianing Shi,Li Kang,Hejia Geng,Yongting Zhang,Yutao Fan,Yuhao Wu,Tiancheng He,Yiran Qin,Lei Bai,Zhenfei Yin

Main category: cs.CL

TL;DR: LiveSearchBench 是一个自动化构建检索依赖型基准的管道,用于评估大语言模型在处理新知识时的表现,强调检索与推理能力而非记忆。

Details Motivation: 传统静态基准测试偏重记忆能力,忽视检索作用,无法反映真实世界知识的动态性。 Method: 通过计算Wikidata快照间的差异,筛选高质量三元组,并生成三种推理难度的自然语言问题,使用SPARQL验证确保答案唯一可验证。 Result: 实验显示模型在面对训练后的新事实时性能显著下降,尤其是在多跳查询上;检索增强和更大规模模型仅部分缓解该问题。 Conclusion: LiveSearchBench 推动评测从静态记忆转向对最新检索与推理能力的评估,为持续演进的知识环境下的LLM评估提供了系统化基础。 Abstract: Evaluating large language models (LLMs) on question answering often relies on static benchmarks that reward memorization and understate the role of retrieval, failing to capture the dynamic nature of world knowledge. We present LiveSearchBench, an automated pipeline for constructing retrieval-dependent benchmarks from recent knowledge updates. Our method computes deltas between successive Wikidata snapshots, filters candidate triples for quality, and synthesizes natural-language questions at three levels of reasoning difficulty, each guaranteed to admit a unique, verifiable answer through SPARQL validation. The pipeline is fully automated, scalable across time, and minimizes human intervention, enabling continual regeneration of temporally grounded benchmarks. Experiments show a pronounced performance drop when models confront facts that post-date pretraining, with the gap most salient on multi-hop queries. Retrieval augmented methods and larger, instruction-tuned models provide partial gains but fail to close this recency gap. By design, LiveSearchBench shifts evaluation from static memorization toward tasks that require up-to-date retrieval and reasoning, offering a foundation for systematic, long-term assessment of LLMs under evolving knowledge.

[68] "Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG

Sergio Torres Aguilar

Main category: cs.CL

TL;DR: 本文提出了一种基于草稿优化的可复现流程,利用开源大语言模型实现拉丁语翻译,性能媲美顶级闭源系统。

Details Motivation: 低资源且形态复杂的语言(如拉丁语)翻译面临巨大挑战,现有方法难以兼顾质量与可复现性。 Method: 首先使用微调的NLLB-1.3B生成结构准确的初稿,再用零样本大模型(Llama-3.3或Qwen3)进行润色,并结合检索增强生成(RAG)提升上下文信息。 Result: 在领域内和新的12世纪拉丁文书信OOD测试集上,该方法性能与GPT-5基线无统计学差异。 Conclusion: 无需任务特定微调,开源RAG系统即可达到顶尖闭源模型的翻译水平,推动可复现与开放研究。 Abstract: Translating a morphology-rich, low-resource language like Latin poses significant challenges. This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models (LLMs) to a performance level statistically comparable to top-tier proprietary systems. Our method first uses a fine-tuned NLLB-1.3B model to generate a high-quality, structurally faithful draft. A zero-shot LLM (Llama-3.3 or Qwen3) then polishes this draft, a process that can be further enhanced by augmenting the context with retrieved out-context examples (RAG). We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025). Our central finding is that this open-source RAG system achieves performance statistically comparable to the GPT-5 baseline, without any task-specific LLM fine-tuning. We release the pipeline, the Chartres OOD set, and evaluation scripts and models to facilitate replicability and further research.

[69] BARD: budget-aware reasoning distillation

Lujie Niu,Lei Shen,Yi Jiang,Caixia Yuan,Xiaojie Wang,Wenbo Su,Bo zheng

Main category: cs.CL

TL;DR: 本文提出了Budget-Aware Reasoning Distillation (BARD),一种能够在知识蒸馏过程中控制推理长度的新框架,通过两阶段训练使小模型在保持高性能的同时实现计算效率的精细调控。

Details Motivation: 长链思维(CoT)蒸馏虽有效但冗余且难以控制计算开销,导致资源利用低效,因此需要一种能同时传递推理能力并可控推理长度的方法。 Method: BARD采用两阶段训练:第一阶段在教师模型生成的不同预算水平压缩的长CoT数据上进行监督微调,使模型理解预算约束;第二阶段结合强化学习,同时优化推理性能和预算符合度。 Result: 实验表明,8B规模的学生模型在AIME24、AIME25、GPQA等难题上表现优异,并能在广泛预算范围内精确自适应地控制推理长度。 Conclusion: BARD成功实现了推理能力的有效蒸馏与推理长度的细粒度控制,提升了小模型在资源受限场景下的实用性与灵活性。 Abstract: While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.

[70] Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

Neha Sharma,Navneet Agarwal,Kairit Sirts

Main category: cs.CL

TL;DR: 该论文提出使用大语言模型(LLM)作为一致且可靠的标注工具,用于文本中的认知扭曲检测,并引入基于Cohen's kappa的跨数据集评估框架,实验表明GPT-4生成的标注具有高一致性,且在下游任务中表现优于人工标注。

Details Motivation: 由于认知扭曲检测具有高度主观性,人工标注者之间一致性低,导致标注不可靠,因此需要更稳定和可扩展的标注方法。 Method: 利用大语言模型(如GPT-4)进行多次独立标注,分析其标注一致性,并提出一种基于Cohen's kappa的、与数据集无关的评估框架,以实现跨数据集和跨研究的公平比较。 Result: GPT-4实现了较高的标注一致性(Fleiss's Kappa = 0.78),在基于LLM标注数据训练的模型在测试集上的表现优于基于人工标注数据训练的模型。 Conclusion: 大语言模型可作为主观NLP任务中可扩展且内部一致的标注替代方案,提升下游任务性能。 Abstract: Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

[71] Synthetic Eggs in Many Baskets: The Impact of Synthetic Data Diversity on LLM Fine-Tuning

Max Schaffelder,Albert Gatt

Main category: cs.CL

TL;DR: 研究了合成数据来源多样性对微调大语言模型的影响,重点关注分布崩溃、对抗鲁棒性和自我偏好偏差三个方面。

Details Motivation: 随着合成数据在语言模型开发中的广泛应用,理解其对模型行为的影响至关重要。 Method: 通过在不同来源的合成数据上微调大语言模型,分析其在输出分布、对抗鲁棒性和自我偏好偏差方面的表现。 Result: 使用多样来源的合成数据可缓解分布崩溃,保持输出多样性和质量;合成数据虽削弱安全机制但保留较高输出质量;微调能减少自我偏好偏差,人工数据效果最好,多源合成数据次之。 Conclusion: 合成数据来源的多样性有助于改善模型性能和输出质量,但也可能带来潜在风险,需权衡使用。 Abstract: As synthetic data becomes widely used in language model development, understanding its impact on model behavior is crucial. This paper investigates the impact of the diversity of sources of synthetic data on fine-tuned large language models. We focus on three key dimensions: distribution collapse, adversarial robustness, and self-preference bias. Our findings reveal that fine-tuning models on synthetic data from diverse sources can mitigate distribution collapse, preserving the breadth of the output distribution and the diversity of the output text. Furthermore, while both human and synthetic fine-tuning data can remove safeguards, the latter preserves higher output quality, thus making outputs potentially more usable and dangerous. Finally, fine-tuning reduces self-preference bias, with human data being the most effective, followed by multi-source synthetic data.

[72] BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Ayesha Afroza Mohsin,Mashrur Ahsan,Nafisa Maliyat,Shanta Maria,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan

Main category: cs.CL

TL;DR: 提出了一种结合帕累托优化大模型和思维链提示的孟加拉语文本去毒化新方法,并构建了BanglaNirTox平行语料库用于模型微调。

Details Motivation: 孟加拉语中的有毒语言问题严重,但因资源有限,相关去毒化研究不足。 Method: 使用帕累托类优化的大语言模型结合思维链(CoT)提示生成去毒化句子,并构建BanglaNirTox平行语料库用于微调模型。 Result: 实验表明,结合帕累托优化LLM与CoT提示显著提升了孟加拉语文本去毒化的质量和一致性。 Conclusion: 该方法有效推动了低资源语言如孟加拉语的文本去毒化研究,展示了数据生成与模型优化结合的潜力。 Abstract: Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.

[73] Difficulty-Controllable Cloze Question Distractor Generation

Seokhoon Kang,Yejin Jeon,Seonjeong Hwang,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 提出了一种通过数据增强和多任务学习策略生成可控难度干扰项的新框架,解决了多项选择完形填空题中高质量干扰项生成的难题。

Details Motivation: 现有方法在生成高质量干扰项时缺乏适应性和难度控制,且缺少难度标注的数据集,限制了研究进展。 Method: 采用双向干扰项生成过程构建高质量、难度标注的数据集,并利用集成问答系统对候选干扰项进行难度分类;基于该数据集,通过多任务学习训练可控制难度的生成模型,并设计辅助任务以增强模型对干扰项语义理解和难度评估能力。 Result: 实验结果表明,该方法在不同难度级别上均能生成高质量的干扰项,并在干扰项难度与人类感知对齐方面显著优于GPT-4o。 Conclusion: 所提出的框架有效实现了干扰项生成的可控性和质量提升,为语言测评中的自动题目生成提供了新的解决方案。 Abstract: Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process in order to produce diverse and plausible distractors. These candidates are subsequently refined through filtering and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is leveraged to train a difficulty-controllable generation model via multitask learning. The framework includes carefully designed auxiliary tasks that enhance the model's semantic understanding of distractors and its ability to estimate their difficulty. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.

[74] Math anxiety and associative knowledge structure are entwined in psychology students but not in Large Language Models like GPT-3.5 and GPT-4o

Luciana Ciringione,Emma Franchino,Simone Reigl,Isaia D'Onofrio,Anna Serbati,Oleksandra Poquet,Florence Gabriel,Massimo Stella

Main category: cs.CL

TL;DR: 该研究利用行为性心智网络框架,探讨心理学本科生对数学和焦虑相关概念的认知与情感关联,并比较了真实学生与GPT模拟学生在数学焦虑预测上的差异。

Details Motivation: 理解数学焦虑的成因及其对学生认知结构的影响,探索个体和群体在数学与焦虑概念关联上的差异,以改善学生的心理健康和职业发展。 Method: 通过四项实验,基于行为性心智网络分析真实心理学本科生(n1=70, n2=57)与GPT模拟学生(GPT-3.5: n=300; GPT-4o: n=300)的概念关联模式,使用个体网络特征预测数学焦虑量表得分,并比较群体层次的概念感知。 Result: 真实学生中,'焦虑'的高网络度和正性情感评分以及'数学'的负性评分可预测更高的总体及评价性数学焦虑;但该模型不适用于GPT模拟数据。高数学焦虑学生对'焦虑'具有情绪极化表征,且'数学'相比'科学'被更负面地看待。 Conclusion: 个体对数学和焦虑的概念认知与情感关联在数学焦虑中起关键作用,GPT模型尚不能准确模拟人类的情感与语义结构,需结合认知网络视角来干预和管理学生的数学焦虑。 Abstract: Math anxiety poses significant challenges for university psychology students, affecting their career choices and overall well-being. This study employs a framework based on behavioural forma mentis networks (i.e. cognitive models that map how individuals structure their associative knowledge and emotional perceptions of concepts) to explore individual and group differences in the perception and association of concepts related to math and anxiety. We conducted 4 experiments involving psychology undergraduates from 2 samples (n1 = 70, n2 = 57) compared against GPT-simulated students (GPT-3.5: n2 = 300; GPT-4o: n4 = 300). Experiments 1, 2, and 3 employ individual-level network features to predict psychometric scores for math anxiety and its facets (observational, social and evaluational) from the Math Anxiety Scale. Experiment 4 focuses on group-level perceptions extracted from human students, GPT-3.5 and GPT-4o's networks. Results indicate that, in students, positive valence ratings and higher network degree for "anxiety", together with negative ratings for "math", can predict higher total and evaluative math anxiety. In contrast, these models do not work on GPT-based data because of differences in simulated networks and psychometric scores compared to humans. These results were also reconciled with differences found in the ways that high/low subgroups of simulated and real students framed semantically and emotionally STEM concepts. High math-anxiety students collectively framed "anxiety" in an emotionally polarising way, absent in the negative perception of low math-anxiety students. "Science" was rated positively, but contrasted against the negative perception of "math". These findings underscore the importance of understanding concept perception and associations in managing students' math anxiety.

[75] ECO Decoding: Entropy-Based Control for Controllability and Fluency in Controllable Dialogue Generation

Seungmin Shin,Dooyoung Kim,Youngjoong Ko

Main category: cs.CL

TL;DR: 提出了一种基于熵的动态控制解码方法ECO,能够在保持生成流畅性的同时提升可控对话生成的可控性。

Details Motivation: 固定常数控制强度难以平衡可控性和生成流畅性之间的矛盾。 Method: 根据语言模型和属性分类器概率分布的熵,在每一步生成过程中动态调整控制强度。 Result: 在DailyDialog和MultiWOZ数据集上验证了ECO解码在单属性和多属性场景下均优于现有方法,且缓解了多属性生成中的概率插值问题。 Conclusion: ECO解码能有效动态调节控制强度,在提升可控性的同时保持生成质量,具有良好的通用性和鲁棒性。 Abstract: Controllable Dialogue Generation (CDG) enables chatbots to generate responses with desired attributes, and weighted decoding methods have achieved significant success in the CDG task. However, using a fixed constant value to manage the bias of attribute probabilities makes it challenging to find an ideal control strength that satisfies both controllability and fluency. To address this issue, we propose ECO decoding (Entropy-based COntrol), which dynamically adjusts the control strength at each generation step according to the model's entropy in both the language model and attribute classifier probability distributions. Experiments on the DailyDialog and MultiWOZ datasets demonstrate that ECO decoding consistently improves controllability while maintaining fluency and grammaticality, outperforming prior decoding methods across various models and settings. Furthermore, ECO decoding alleviates probability interpolation issues in multi-attribute generation and consequently demonstrates strong performance in both single and multi-attribute scenarios.

[76] BIRD: Bronze Inscription Restoration and Dating

Wenjie Hua,Hoang H. Nguyen,Gangyan Ge

Main category: cs.CL

TL;DR: 本文提出了BIRD数据集和一种结合字形网络的同字异体感知掩码语言模型,用于改善青铜器铭文的修复与断代。

Details Motivation: 青铜器铭文碎片化严重且难以断代,缺乏标准化的数据集和有效的计算方法。 Method: 构建了基于标准学术转录和年代标签的BIRD数据集,提出融合领域自适应预训练与字形网络(GN)的同字异体感知掩码语言模型,并采用字形偏向采样策略。 Result: 实验证明字形网络(GN)能提升铭文修复效果,字形偏向采样有助于提高断代性能。 Conclusion: 该方法为青铜器铭文的自动化修复与断代提供了有效解决方案,推动了古文字研究的数字化进展。 Abstract: Bronze inscriptions from early China are fragmentary and difficult to date. We introduce BIRD(Bronze Inscription Restoration and Dating), a fully encoded dataset grounded in standard scholarly transcriptions and chronological labels. We further propose an allograph-aware masked language modeling framework that integrates domain- and task-adaptive pretraining with a Glyph Net (GN), which links graphemes and allographs. Experiments show that GN improves restoration, while glyph-biased sampling yields gains in dating.

[77] Imperfect Language, Artificial Intelligence, and the Human Mind: An Interdisciplinary Approach to Linguistic Errors in Native Spanish Speakers

Francisco Portillo López

Main category: cs.CL

TL;DR: 本研究通过分析母语为西班牙语者的语言错误,结合理论语言学、神经语言学和自然语言处理,评估大语言模型对人类语言错误的理解与生成能力。

Details Motivation: 语言错误揭示了人类语言认知结构,并暴露当前AI系统在模拟真实语言使用中的局限性。 Method: 构建包含500多个真实西班牙语错误的语料库,从理论语言学、神经语言学和NLP角度分类分析,并测试GPT、Gemini等模型对其的解释与修正能力。 Result: 该研究将提升对西班牙语语言特性的理解,并推动更符合人类认知的NLP系统发展。 Conclusion: 整合多学科方法可促进更贴近人类语言实际使用的大语言模型设计,使AI更能处理不完美、多变和模糊的真实语言输入。 Abstract: Linguistic errors are not merely deviations from normative grammar; they offer a unique window into the cognitive architecture of language and expose the current limitations of artificial systems that seek to replicate them. This project proposes an interdisciplinary study of linguistic errors produced by native Spanish speakers, with the aim of analyzing how current large language models (LLM) interpret, reproduce, or correct them. The research integrates three core perspectives: theoretical linguistics, to classify and understand the nature of the errors; neurolinguistics, to contextualize them within real-time language processing in the brain; and natural language processing (NLP), to evaluate their interpretation against linguistic errors. A purpose-built corpus of authentic errors of native Spanish (+500) will serve as the foundation for empirical analysis. These errors will be tested against AI models such as GPT or Gemini to assess their interpretative accuracy and their ability to generalize patterns of human linguistic behavior. The project contributes not only to the understanding of Spanish as a native language but also to the development of NLP systems that are more cognitively informed and capable of engaging with the imperfect, variable, and often ambiguous nature of real human language.

[78] ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Nikola Ljubešić,Peter Rupnik,Ivan Porupski,Taja Kuzman Pungeršek

Main category: cs.CL

TL;DR: ParlaSpeech是一个包含克罗地亚语、捷克语、波兰语和塞尔维亚语的口语议会语料库集合,共6000小时,具有丰富的自动注释层,包括语言学标注、情感预测、填充停顿、语音对齐和重音位置,显著提升了多学科下游研究的价值。

Details Motivation: 为了提升斯拉夫语言口语议会语料库在多学科研究中的可用性,提供高质量、多层级自动注释的语音-文本对齐数据。 Method: 基于ParlaMint转录本及其元数据,自动对齐各议会的语音录音,构建语料库,并添加多层次自动注释,包括文本模态的语言学标注与情感预测,以及语音模态的填充停顿、词级/音位级对齐和主重音位置标注。 Result: 成功构建了四种斯拉夫语言的大型标注口语语料库(总计6000小时),并提供了JSONL、TextGrid格式下载及在线检索功能;通过情感声学相关性分析展示了其应用价值。 Conclusion: ParlaSpeech显著增强了现有议会语料库的研究实用性,为语音、语言学、情感分析等多个领域提供了宝贵资源。 Abstract: ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

[79] A Graph-based RAG for Energy Efficiency Question Answering

Riccardo Campi,Nicolò Oreste Pinciroli Vago,Mathyas Giudici,Pablo Barrachina Rodriguez-Guisado,Marco Brambilla,Piero Fraternali

Main category: cs.CL

TL;DR: 本文研究了在基于图的检索增强生成(RAG)架构中使用大语言模型(LLM)进行能源效率(EE)问答的潜力。系统首先从能源领域的指导和监管文件中自动提取知识图谱(KG),然后通过导航和推理该图谱,以多种语言为用户提供准确答案。通过RAGAs框架、包含101个问答对的验证数据集以及领域专家进行人工验证。结果表明,该架构具有潜力,在约四分之三的情况下能正确回答问题(准确率为75.2±2.7%),对更通用的EE问题表现更佳(最高达81.0±4.1%),且具备良好的多语言能力(因翻译导致的准确率损失仅为4.4%)。

Details Motivation: 能源效率领域的指导和监管文件通常复杂且分散,用户难以快速获取准确信息。现有问答系统在处理多语言需求和复杂语义推理方面存在局限。因此,需要一种能够自动整合知识并支持跨语言精确问答的技术方案。 Method: 采用基于图的检索增强生成(RAG)架构,结合大语言模型(LLM)。首先利用LLM从能源领域的文档中自动构建知识图谱(KG),然后在该图谱上进行路径推理与信息检索,最终生成多语言答案。系统评估采用RAGAs框架,并结合包含101个问答对的数据集及领域专家进行人工验证。 Result: 系统整体准确率为75.2±2.7%,其中针对更通用的能源效率问题准确率可达81.0±4.1%。多语言环境下表现良好,翻译带来的性能损失仅为4.4%。验证结果显示系统具备较强的语义理解和推理能力,但在处理高度专业化或隐含逻辑的问题时仍有改进空间。 Conclusion: 基于图的RAG架构结合大语言模型在能源效率问答中展现出良好潜力,尤其在知识组织、多语言支持和可解释性方面具有优势。未来工作可优化知识图谱构建精度与推理路径搜索策略,以进一步提升复杂问题的处理能力。 Abstract: In this work, we investigate the use of Large Language Models (LLMs) within a graph-based Retrieval Augmented Generation (RAG) architecture for Energy Efficiency (EE) Question Answering. First, the system automatically extracts a Knowledge Graph (KG) from guidance and regulatory documents in the energy field. Then, the generated graph is navigated and reasoned upon to provide users with accurate answers in multiple languages. We implement a human-based validation using the RAGAs framework properties, a validation dataset comprising 101 question-answer pairs, and domain experts. Results confirm the potential of this architecture and identify its strengths and weaknesses. Validation results show how the system correctly answers in about three out of four of the cases (75.2 +- 2.7%), with higher results on questions related to more general EE answers (up to 81.0 +- 4.1%), and featuring promising multilingual abilities (4.4% accuracy loss due to translation).

[80] Evaluating Cultural Knowledge Processing in Large Language Models: A Cognitive Benchmarking Framework Integrating Retrieval-Augmented Generation

Hung-Shin Lee,Chen-Chi Chang,Ching-Yuan Chen,Yun-Hsiang Hsu

Main category: cs.CL

TL;DR: 提出了一种结合Bloom分类法和检索增强生成(RAG)的认知基准框架,用于评估大语言模型在处理和应用特定文化知识方面的能力,特别是在台湾客家数字文化档案上的表现。

Details Motivation: 为了评估大语言模型在处理和应用特定文化知识时的认知能力,解决现有评估方法在文化相关性和层次认知任务上的不足。 Method: 将Bloom分类法与检索增强生成(RAG)相结合,构建一个包含记忆、理解、应用、分析、评价和创造六个层级的认知评估框架,并利用台湾客家数字文化档案进行测试。 Result: 该框架能够有效衡量大语言模型生成回答的语义准确性和文化相关性,在多个认知层级上展现出对文化知识处理能力的细致评估。 Conclusion: 所提出的认知基准框架为评估大语言模型在文化特定知识任务中的表现提供了系统化、多层次的方法,有助于提升模型的文化敏感性和认知能力。 Abstract: This study proposes a cognitive benchmarking framework to evaluate how large language models (LLMs) process and apply culturally specific knowledge. The framework integrates Bloom's Taxonomy with Retrieval-Augmented Generation (RAG) to assess model performance across six hierarchical cognitive domains: Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Using a curated Taiwanese Hakka digital cultural archive as the primary testbed, the evaluation measures LLM-generated responses' semantic accuracy and cultural relevance.

[81] EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

Ayesha Gull,Muhammad Usman Safder,Rania Elbadry,Preslav Nakov,Zhuohan Xie

Main category: cs.CL

TL;DR: 本文提出了EngChain,一个用于验证多步工程问题解决的基准测试,旨在评估大型语言模型在复杂工程推理任务中的表现。

Details Motivation: 现有的基准测试无法捕捉到工程领域中科学原理、定量建模和实际约束相结合的综合推理能力,因此需要一个新的评估工具。 Method: EngChain包含90个问题,涵盖三个工程分支、9个领域和20个不同区域,问题基于符号模板生成并具有高度随机化;采用两阶段评估:首先验证每一步推理的数值和语义有效性,然后使用LLM-As-A-Judge系统对推理错误进行分类。 Result: 该基准能够有效评估大型语言模型在多步工程问题中的推理能力,并通过自动化方式识别和分类错误类型。 Conclusion: EngChain填补了现有基准在工程综合推理评估方面的空白,为高风险专业领域的语言模型评估提供了新方法。 Abstract: Large Language Models (LLMs) are increasingly being applied to specialized, high-stakes domains like engineering, which demands rigorous evaluation of their complex reasoning capabilities. While current benchmarks assess language understanding, factual recall, mathematics or code generation, none capture the integrative reasoning central to engineering where scientific principles, quantitative modeling and practical constraints must converge. To address this gap, we introduce EngChain, a benchmark for verifiable multi-step engineering problem-solving. EngChain contains 90 problems spanning three engineering branches, organized into 9 domains and 20 distinct areas. The problems are generated from symbolic templates with a high degree of randomization to ensure diversity and eliminate the risk of contamination. With this benchmark, we move beyond final answer accuracy with a two-stage evaluation: we first quantitatively verify the numerical and semantic validity of each reasoning step and then introduce LLM-As-A-Judge, an automated system to qualitatively categorize the identified reasoning errors.

[82] SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia

Chaoqun Liu,Mahani Aljunied,Guizhen Chen,Hou Pong Chan,Weiwen Xu,Yu Rong,Wenxuan Zhang

Main category: cs.CL

TL;DR: SeaLLMs-Audio是首个针对东南亚多种语言的大型音频-语言模型,支持印尼语、泰语、越南语、英语和中文,具备多语言、多模态和多任务能力,在多种音频任务中表现优异。

Details Motivation: 推动东南亚地区音频大模型的发展,填补该区域多语言音频-语言模型的空白,并支持多样化的语音交互与理解任务。 Method: 在大规模音频语料库上训练SeaLLMs-Audio模型,设计支持多种输入模态(纯音频、纯文本、音文结合)和多种任务(如语音识别、语音翻译、情感识别、语音问答等)。同时构建SeaBench-Audio基准用于自动化评估。 Result: SeaLLMs-Audio在东南亚语言的多项音频任务中表现出竞争力,能够有效支持细粒度音频理解和语音对话;SeaBench-Audio为区域化LALM评估提供了标准化测试平台。 Conclusion: SeaLLMs-Audio是迈向东南亚多语言音频-语言模型的重要一步,有望促进该地区的学术研究与工业应用。 Abstract: We introduce SeaLLMs-Audio, the first large audio-language model (LALM) tailored for multiple Southeast Asian (SEA) languages-Indonesian (id), Thai (th), and Vietnamese (vi)-alongside English (en) and Chinese (zh). Trained on a large-scale audio corpus, SeaLLMs-Audio exhibits strong performance across diverse audio-centric tasks, spanning fine-grained audio understanding and voice-based interaction. Its key features include: 1) Multilingual: the model primarily supports 5 languages, namely Indonesian, Thai, Vietnamese, English, and Chinese; 2) Multimodal: the model accepts flexible input modalities, including audio only, text only, as well as audio with text; 3) Multi-task: the model supports a wide range of tasks, including audio analysis tasks such as Audio Captioning, Automatic Speech Recognition, Speech-to-Text Translation, Speech Emotion Recognition, Speech Question Answering, and Speech Summarization. It also enables voice-based dialogue, including answering factual, mathematical, and general knowledge queries. As a significant step towards advancing audio LLMs in Southeast Asia, we expect SeaLLMs-Audio to benefit both the regional research community and industry. To automate LALM evaluation for Southeast Asia, we introduce SeaBench-Audio, a benchmark spanning multiple tasks. Experiments show that SeaLLMs-Audio achieves competitive performance compared with other LALMs on SEA languages.

[83] Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI

Sharan Maiya,Henning Bartsch,Nathan Lambert,Evan Hubinger

Main category: cs.CL

TL;DR: 本文介绍了首个开源的角色训练实现,利用宪法式AI和合成内省数据管道来更有效地塑造聊天机器人的助手角色,相较于系统提示约束或激活引导等方法更具鲁棒性和生成一致性,且不影响模型通用能力。

Details Motivation: 现代聊天机器人中的“AI助手”角色会影响交互质量、感知智能及价值对齐,但角色训练在学术界尚未得到充分研究。 Method: 通过宪法式AI和新的合成内省数据流水线,对三个流行的开源模型进行微调,使用11种示例角色(如幽默、关怀或恶意)进行角色训练,并引入揭示偏好分析方法来评估角色变化。 Result: 所提出的方法在对抗性提示下表现出更强的鲁棒性,生成内容更加连贯真实,且对模型在通用基准上的性能几乎没有影响。 Conclusion: 该角色训练方法能有效、可控地塑造AI助手性格,提升交互质量,同时保持模型原有能力,具备实际应用与研究推广价值。 Abstract: The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even malevolent. To track the effects of our approach, we introduce a method which analyzes revealed preferences, uncovering clear and holistic changes in character. We find these changes are more robust to adversarial prompting than the above two alternatives, while also leading to more coherent and realistic generations. Finally, we demonstrate this fine-tuning has little to no effect on general capabilities as measured by common benchmarks. We describe and open-source our full post-training method, the implementation of which can be found at https://github.com/maiush/OpenCharacterTraining.

[84] Multi-Step Knowledge Interaction Analysis via Rank-2 Subspace Disentanglement

Sekh Mainul Islam,Pepa Atanasova,Isabelle Augenstein

Main category: cs.CL

TL;DR: 本文提出了一种新的秩-2投影子空间方法,用于更准确地区分大语言模型中参数化知识(PK)和上下文知识(CK)的贡献,并首次实现了对自然语言解释(NLEs)中多步知识交互的分析。实验表明,该方法能有效捕捉不同知识交互模式,揭示了幻觉性NLE更多依赖PK,而忠实于上下文的NLE则平衡使用PK与CK,思维链提示可减少对PK的依赖并增强CK使用。

Details Motivation: 理解大语言模型在生成自然语言解释时如何结合参数化知识和上下文知识,是评估其解释可靠性的关键。然而现有研究仅将其视为单一选择且局限于单步分析,忽略了更丰富的交互形式。因此需要一种更精细的方法来探究多步推理过程中的知识动态。 Method: 提出一种新的秩-2投影子空间方法,以解耦参数化知识(PK)和上下文知识(CK)的贡献,并应用于多步自然语言解释序列的分析,从而识别不同知识来源在生成过程中的作用。 Result: 在四个问答数据集和三个开源指令调优大模型上的实验表明:秩-1子空间无法充分表示多样化的知识交互,而提出的秩-2方法能有效捕捉这些交互;幻觉性解释主要沿PK方向,忠实解释则平衡PK与CK;思维链提示促使生成结果向CK偏移,减少对PK的依赖。 Conclusion: 本文提供了首个通过秩-2子空间解耦进行大语言模型多步知识交互系统研究的框架,揭示了PK与CK在自然语言解释生成中的动态关系,为提升解释的可信度和可控性提供了新工具。 Abstract: Natural Language Explanations (NLEs) describe how Large Language Models (LLMs) make decisions, drawing on both external Context Knowledge (CK) and Parametric Knowledge (PK) stored in model weights. Understanding their interaction is key to assessing the grounding of NLEs, yet it remains underexplored. Prior work has largely examined only single-step generation, typically the final answer, and has modelled PK and CK interaction only as a binary choice in a rank-1 subspace. This overlooks richer forms of interaction, such as complementary or supportive knowledge. We propose a novel rank-2 projection subspace that disentangles PK and CK contributions more accurately and use it for the first multi-step analysis of knowledge interactions across longer NLE sequences. Experiments on four QA datasets and three open-weight instruction-tuned LLMs show that diverse knowledge interactions are poorly represented in a rank-1 subspace but are effectively captured in our rank-2 formulation. Our multi-step analysis reveals that hallucinated NLEs align strongly with the PK direction, context-faithful ones balance PK and CK, and Chain-of-Thought prompting for NLEs shifts generated NLEs toward CK by reducing PK reliance. This work provides the first framework for systematic studies of multi-step knowledge interactions in LLMs through a richer rank-2 subspace disentanglement. Code and data: https://github.com/copenlu/pk-ck-knowledge-disentanglement.

[85] Efficient Tool-Calling Multi-Expert NPC Agent for Commonsense Persona-Grounded Dialogue

Mahammad Nuriyev

Main category: cs.CL

TL;DR: 提出了一种基于Qwen3和LoRA的多专家系统,用于构建能在交互环境中进行自然对话和上下文动作执行的NPC,在CPDC 2025挑战赛中排名第二。

Details Motivation: 提升NPC在复杂交互环境中的对话自然性和行为合理性,满足实时性与资源效率需求。 Method: 采用Qwen3作为基础模型,结合LoRA适配器实例化三个专家模块:工具调用、工具响应解释和直接对话。 Result: 系统在L40S GPU上实现了高效计算,响应快速且资源占用低,在CPDC 2025挑战赛中综合排名第二。 Conclusion: 多专家架构结合LoRA能有效平衡性能与效率,适用于需自然对话与动作执行的NPC构建。 Abstract: We present a multi-expert system for creating Non-Player Characters (NPCs) capable of both natural dialogue and contextual action execution in interactive environments. Using Qwen3 as the base model and Low-Rank Adaptation (LoRA) adapters, we instantiate three specialists: tool calling, tool-response interpretation, and direct dialogue. Our system comfortably meets the computational efficiency requirements, delivering fast responses and maintaining modest resource usage on L40S GPUs. In the Commonsense Persona-Grounded Dialogue Challenge 2025, our method ranked second overall. Code available at: https://github.com/MahammadNuriyev62/CPDC-challenge-2025-solution/

[86] Accumulating Context Changes the Beliefs of Language Models

Jiayi Geng,Howard Chen,Ryan Liu,Manoel Horta Ribeiro,Robb Willer,Graham Neubig,Thomas L. Griffiths

Main category: cs.CL

TL;DR: 本文研究了语言模型在持续对话或阅读过程中由于上下文积累而导致的信念变化风险,发现GPT-5和Grok 4等模型的信念和行为均发生显著偏移,揭示了其在自主代理系统中潜在的不可靠性。

Details Motivation: 随着语言模型在脑力激荡和科研等应用中的广泛使用,其上下文记忆不断累积,可能导致模型对世界的理解(即信念)在无用户干预的情况下悄然改变,从而引发行为不一致或偏离原始对齐目标的问题。 Method: 通过让模型参与多轮道德困境讨论或阅读对立政治立场文本,观察其陈述信念的变化;同时设计需要工具使用的任务,分析其行为背后的隐含信念是否同步发生改变。 Result: GPT-5在10轮讨论后信念改变达54.7%,Grok 4在阅读对立政治文本后信念偏移27.2%;工具使用行为的变化与陈述信念的转变一致,表明信念迁移会影响实际行为。 Conclusion: 语言模型的信念具有高度可塑性,在长期交互或阅读后可能发生显著偏移,这对依赖其一致性与可靠性的应用场景构成潜在风险,需引起重视。 Abstract: Language model (LM) assistants are increasingly used in applications such as brainstorming and research. Improvements in memory and context size have allowed these models to become more autonomous, which has also resulted in more text accumulation in their context windows without explicit user intervention. This comes with a latent risk: the belief profiles of models -- their understanding of the world as manifested in their responses or actions -- may silently change as context accumulates. This can lead to subtly inconsistent user experiences, or shifts in behavior that deviate from the original alignment of the models. In this paper, we explore how accumulating context by engaging in interactions and processing text -- talking and reading -- can change the beliefs of language models, as manifested in their responses and behaviors.Our results reveal that models' belief profiles are highly malleable: GPT-5 exhibits a 54.7% shift in its stated beliefs after 10 rounds of discussion about moral dilemmas and queries about safety, while Grok 4 shows a 27.2% shift on political issues after reading texts from the opposing position. We also examine models' behavioral changes by designing tasks that require tool use, where each tool selection corresponds to an implicit belief. We find that these changes align with stated belief shifts, suggesting that belief shifts will be reflected in actual behavior in agentic systems. Our analysis exposes the hidden risk of belief shift as models undergo extended sessions of talking or reading, rendering their opinions and actions unreliable.

[87] Plan-and-Write: Structure-Guided Length Control for LLMs without Model Retraining

Adewale Akinfaderin,Shreyas Subramanian,Akarsha Sehwag

Main category: cs.CL

TL;DR: 本文提出了一种无需模型重训练的提示工程方法,通过结构化引导实现大语言模型输出长度的精确控制,在多个先进模型上显著提升了长度符合度,尤其适用于生产环境中对响应长度有严格要求的应用场景。

Details Motivation: 现有长度控制方法通常需要昂贵的模型重训练或复杂的推理工具,缺乏高效、低成本的解决方案。 Method: 提出一种结构引导的提示工程方法,通过在提示中引入明确的规划和字数统计机制,使模型能够主动跟踪并遵守指定的长度限制。 Result: 在六种先进大语言模型上的实验表明,该方法在文档摘要任务中显著提高了长度符合度,某些模型在短到中等长度约束下长度 adherence 提升高达37.6%,同时保持或提升了输出质量。 Conclusion: 该方法提供了一种即插即用、成本低廉的长度控制方案,特别适合无法进行模型重训练的实际应用场景。 Abstract: Length control in Large Language Models (LLMs) is a crucial but under-addressed challenge, with applications ranging from voice interfaces requiring concise responses to research summaries needing comprehensive outputs. Current approaches to length control, including Regularized DPO, Length-Instruction Fine Tuning, and tool-augmented methods, typically require expensive model retraining or complex inference-time tooling. This paper presents a prompt engineering methodology that enables precise length control without model retraining. Our structure-guided approach implements deliberate planning and word counting mechanisms within the prompt, encouraging the model to carefully track and adhere to specified length constraints. Comprehensive evaluations across six state-of-the-art LLMs demonstrate that our method significantly improves length fidelity for several models compared to standard prompting when applied to document summarization tasks, particularly for shorter-to-medium length constraints. The proposed technique shows varying benefits across different model architectures, with some models demonstrating up to 37.6% improvement in length adherence. Quality evaluations further reveal that our approach maintains or enhances overall output quality compared to standard prompting techniques. Our approach provides an immediately deployable solution for applications requiring precise length control, particularly valuable for production environments where model retraining is impractical or cost-prohibitive.

[88] KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski,Adrian Łańcucki

Main category: cs.CL

TL;DR: 本文提出KVTC,一种轻量级的变换编码器,用于压缩大语言模型中的键值缓存,实现高达20倍的压缩率,并在多个基准测试中优于现有方法。

Details Motivation: 大规模服务大语言模型需要高效的键值缓存管理,而传统的缓存复用方式面临内存占用高、需频繁卸载或重新计算的问题。 Method: KVTC结合了基于PCA的特征去相关、自适应量化和熵编码技术,仅需短暂校准即可压缩键值缓存,且不改变模型参数。 Result: 在Llama 3、Mistral NeMo和R1-Qwen 2.5等模型上测试显示,KVTC可实现最高20倍的压缩比,在特定场景下甚至超过40倍,同时保持推理和长上下文准确性。 Conclusion: KVTC是一种实用的构建模块,适用于具有可重用键值缓存的内存高效型大语言模型服务。 Abstract: Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

[89] Towards Robust Mathematical Reasoning

Thang Luong,Dawsen Hwang,Hoang H. Nguyen,Golnaz Ghiasi,Yuri Chervonyi,Insuk Seo,Junsu Kim,Garrett Bingham,Jonathan Lee,Swaroop Mishra,Alex Zhai,Clara Huiyi Hu,Henryk Michalewski,Jimin Kim,Jeonghyun Ahn,Junhwi Bae,Xingyou Song,Trieu H. Trinh,Quoc V. Le,Junehyuk Jung

Main category: cs.CL

TL;DR: 本文提出了IMO-Bench,一个针对国际数学奥林匹克(IMO)水平的高级推理评测基准,包含IMO-AnswerBench和IMO-Proof Bench两个部分,用于评估模型的短答案和证明生成能力,并推动基础模型在数学推理方面的发展。

Details Motivation: 现有数学推理评测基准要么过于简单,要么仅关注短答案正确性,缺乏对高水平证明能力的系统评估,因此需要更严格的北星级指标来推动模型发展。 Method: 设计了IMO-Bench,包括400道多样化的奥赛题目(IMO-AnswerBench)和带详细评分标准的证明题评测集(IMO-Proof Bench),并结合自动评分与人工评分验证;使用Gemini Deep Think模型进行实验。 Result: Gemini Deep Think在IMO-AnswerBench上达到80.0%,在高级IMO-Proof Bench上达到65.7%,分别超越最佳非Gemini模型6.9%和42.4%;构建的自动评分器与人工评分高度相关,并发布包含1000个人工评分的IMO-GradingBench。 Conclusion: IMO-Bench为数学推理提供了更具挑战性和系统性的评测标准,有效推动了模型在高阶数学问题解决和证明生成上的进步,有望成为未来数学推理研究的重要基准。 Abstract: Finding the right north-star metrics is highly critical for advancing the mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focus on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks, vetted by a panel of top specialists and that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMO-AnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-Proof Bench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO level problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-Proof Bench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://imobench.github.io/.

[90] Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems

Elias Lumer,Faheem Nizar,Anmol Gulati,Pradeep Honaganahalli Basavaraju,Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: 本文提出了一种名为“工具到代理检索”(Tool-to-Agent Retrieval)的新框架,通过将工具及其父代理嵌入共享向量空间并利用元数据关联,实现细粒度的工具级或代理级检索,显著提升了召回率和排序效果。

Details Motivation: 现有检索方法通常基于粗粒度的代理描述进行查询匹配,忽略了工具的细粒度功能,导致代理选择不佳。为此,作者希望解决工具与代理间语义鸿沟问题,提升检索精度。 Method: 提出Tool-to-Agent Retrieval框架,将工具和其所属代理共同嵌入同一向量空间,并通过元数据关系连接二者,支持在工具级和代理级之间灵活检索,避免多工具聚合带来的上下文稀释。 Result: 在八个嵌入模型和LiveMCPBench基准上的实验表明,该方法相比现有最先进代理检索器,在Recall@5上平均提升19.4%,nDCG@5上提升17.7%。 Conclusion: Tool-to-Agent Retrieval通过联合建模工具与代理的表示,实现了更精准的检索,为大规模多代理系统中的工具调度提供了有效解决方案。 Abstract: Recent advances in LLM Multi-Agent Systems enable scalable orchestration of sub-agents, each coordinating hundreds or thousands of tools or Model Context Protocol (MCP) servers. However, existing retrieval methods typically match queries against coarse agent-level descriptions before routing, which obscures fine-grained tool functionality and often results in suboptimal agent selection. We introduce Tool-to-Agent Retrieval, a unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships. By explicitly representing tool capabilities and traversing metadata to the agent level, Tool-to-Agent Retrieval enables granular tool-level or agent-level retrieval, ensuring that agents and their underlying tools or MCP servers are equally represented without the context dilution that arises from chunking many tools together. Evaluating Tool-to-Agent Retrieval across eight embedding models, our approach achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on the LiveMCPBench benchmark.

cs.CV [Back]

[91] Generative human motion mimicking through feature extraction in denoising diffusion settings

Alexander Okupnik,Johannes Schneider,Kyriakos Flouris

Main category: cs.CV

TL;DR: 本文提出了一种基于动作捕捉数据的交互式模型,用于实现人与AI之间的创造性舞蹈互动。该模型通过结合扩散模型、动作补全和风格迁移技术,生成既时间连贯又响应输入动作的舞蹈动作,是首个仅使用单人动作数据和高层特征来模拟创造性舞蹈互动的模型。

Details Motivation: 当前大型语言模型缺乏人类互动的具身性,而舞蹈作为一种原始的人类表达形式,有望弥补这一缺陷,增强人机交互的创造性与自然性。 Method: 基于单人动作捕捉数据,利用两种扩散模型的思想,结合动作补全(motion inpainting)和动作风格迁移技术,提取高层特征生成具有创造性的舞蹈动作序列,不依赖于低层次的人-人互动数据。 Result: 模型生成的动作在特征分布上与测试集高度收敛,表现出良好的真实性和多样性,能够对输入动作进行创造性增强,同时保持时间上的连贯性。 Conclusion: 该模型为实现人与AI共同创造性舞蹈提供了可行路径,是迈向具身化、富有表现力的人机协作的重要第一步。 Abstract: Recent success with large language models has sparked a new wave of verbal human-AI interaction. While such models support users in a variety of creative tasks, they lack the embodied nature of human interaction. Dance, as a primal form of human expression, is predestined to complement this experience. To explore creative human-AI interaction exemplified by dance, we build an interactive model based on motion capture (MoCap) data. It generates an artificial other by partially mimicking and also "creatively" enhancing an incoming sequence of movement data. It is the first model, which leverages single-person motion data and high level features in order to do so and, thus, it does not rely on low level human-human interaction data. It combines ideas of two diffusion models, motion inpainting, and motion style transfer to generate movement representations that are both temporally coherent and responsive to a chosen movement reference. The success of the model is demonstrated by quantitatively assessing the convergence of the feature distribution of the generated samples and the test set which serves as simulating the human performer. We show that our generations are first steps to creative dancing with AI as they are both diverse showing various deviations from the human partner while appearing realistic.

[92] Deep Learning Models for Coral Bleaching Classification in Multi-Condition Underwater Image Datasets

Julio Jerison E. Macrohon,Gordon Hung

Main category: cs.CV

TL;DR: 本研究提出了一种基于机器学习的珊瑚白化分类系统,利用全球多样化数据集训练并比较了ResNet、ViT和CNN三种先进模型,结果显示CNN在准确率上达到88%,表现最优。

Details Motivation: 珊瑚礁面临污染、海洋酸化和海水温度异常等日益严重的威胁,亟需高效的保护与监测手段。 Method: 基于包含健康与白化珊瑚样本的全球多样化数据集,采用ResNet、ViT和CNN三种先进的计算机视觉模型进行基准测试,并通过全面的超参数调优提升性能。 Result: 经过调优后,CNN模型取得了88%的最高准确率,优于现有基准,表现出在珊瑚白化分类中的优越性能。 Conclusion: 该研究为实现自主化的珊瑚监测提供了有效方案,并对主流计算机视觉模型在珊瑚分类任务中的表现进行了全面分析。 Abstract: Coral reefs support numerous marine organisms and are an important source of coastal protection from storms and floods, representing a major part of marine ecosystems. However coral reefs face increasing threats from pollution, ocean acidification, and sea temperature anomalies, making efficient protection and monitoring heavily urgent. Therefore, this study presents a novel machine-learning-based coral bleaching classification system based on a diverse global dataset with samples of healthy and bleached corals under varying environmental conditions, including deep seas, marshes, and coastal zones. We benchmarked and compared three state-of-the-art models: Residual Neural Network (ResNet), Vision Transformer (ViT), and Convolutional Neural Network (CNN). After comprehensive hyperparameter tuning, the CNN model achieved the highest accuracy of 88%, outperforming existing benchmarks. Our findings offer important insights into autonomous coral monitoring and present a comprehensive analysis of the most widely used computer vision models.

[93] Automating Coral Reef Fish Family Identification on Video Transects Using a YOLOv8-Based Deep Learning Pipeline

Jules Gerard,Leandro Di Bella,Filip Huyghe,Marc Kochzius

Main category: cs.CV

TL;DR: 本研究评估了基于YOLOv8的深度学习流程,用于自动化识别肯尼亚和坦桑尼亚海域视频样带中的鱼类科级分类,建立了西印度洋地区首个区域特异性基准,mAP@0.5达到0.52,结果表明深度学习可作为传统珊瑚礁监测方法的有效补充。

Details Motivation: 由于水下视觉普查劳动强度大,西印度洋地区的珊瑚礁监测受限,亟需自动化工具提升监测效率。 Method: 采用YOLOv8深度学习模型,基于肯尼亚和坦桑尼亚采集的视频样带数据,对24个鱼类科进行训练与测试,并在不同配置下评估模型性能。 Result: 最佳模型在mAP@0.5指标上达到0.52,对常见鱼类科识别准确率高,但对稀有或形态复杂类群检测效果较弱。 Conclusion: 深度学习模型具备作为传统珊瑚礁鱼类监测方法可扩展补充手段的潜力,尤其适用于大规模长期监测项目。 Abstract: Coral reef monitoring in the Western Indian Ocean is limited by the labor demands of underwater visual censuses. This work evaluates a YOLOv8-based deep learning pipeline for automating family-level fish identification from video transects collected in Kenya and Tanzania. A curated dataset of 24 families was tested under different configurations, providing the first region-specific benchmark for automated reef fish monitoring in the Western Indian Ocean. The best model achieved mAP@0.5 of 0.52, with high accuracy for abundant families but weaker detection of rare or complex taxa. Results demonstrate the potential of deep learning as a scalable complement to traditional monitoring methods.

[94] Mutual Information guided Visual Contrastive Learning

Hanyang Chen,Yanchao Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于互信息的数据增强方法,利用真实世界分布中的自然扰动(如颜色变化和运动)来选择具有高互信息的图像块作为正样本,从而提升对比学习中特征表示的泛化能力。

Details Motivation: 现有的对比学习数据增强依赖人为假设或工程经验,可能次优;本文旨在通过从真实分布中计算互信息来自动生成更有效的训练数据对,以提高模型在开放环境中的泛化性能。 Method: 提出一种基于互信息的数据选择方法,将场景中在自然扰动下具有高互信息的图像块视为正样本,并结合多种先进的表示学习框架进行对比学习。 Result: 在多个基准任务和主流表示学习框架上验证了该方法的有效性,相比传统数据增强策略能更好地提升特征学习的性能。 Conclusion: 基于互信息的数据选择是一种有前景的方法,能够减少对人工设计增强策略的依赖,并为未来自监督学习中的数据配对提供新方向。 Abstract: Representation learning methods utilizing the InfoNCE loss have demonstrated considerable capacity in reducing human annotation effort by training invariant neural feature extractors. Although different variants of the training objective adhere to the information maximization principle between the data and learned features, data selection and augmentation still rely on human hypotheses or engineering, which may be suboptimal. For instance, data augmentation in contrastive learning primarily focuses on color jittering, aiming to emulate real-world illumination changes. In this work, we investigate the potential of selecting training data based on their mutual information computed from real-world distributions, which, in principle, should endow the learned features with better generalization when applied in open environments. Specifically, we consider patches attached to scenes that exhibit high mutual information under natural perturbations, such as color changes and motion, as positive samples for learning with contrastive loss. We evaluate the proposed mutual-information-informed data augmentation method on several benchmarks across multiple state-of-the-art representation learning frameworks, demonstrating its effectiveness and establishing it as a promising direction for future research.

[95] Benchmarking Federated Learning Frameworks for Medical Imaging Deployment: A Comparative Study of NVIDIA FLARE, Flower, and Owkin Substra

Riya Gupta,Alexander Chowdhury,Sahil Nalawade

Main category: cs.CV

TL;DR: 本研究评估了三种主流联邦学习框架(NVIDIA FLARE、Flower 和 Owkin Substra)在医学影像应用中的适用性,结果表明各框架在性能、可扩展性和隐私保护方面各有优势。

Details Motivation: 联邦学习在医疗AI中具有巨大潜力,但不同框架在实际医疗场景中的表现尚不明确,需系统评估以指导实际部署。 Method: 使用PathMNIST数据集,从模型性能、收敛效率、通信开销、可扩展性和开发体验五个方面对NVIDIA FLARE、Flower和Owkin Substra进行基准测试。 Result: NVIDIA FLARE在生产环境可扩展性方面表现最佳,Flower适合原型设计和学术研究,Owkin Substra在隐私保护和合规性方面表现突出。 Conclusion: 三种联邦学习框架各有优势,适用于不同的医疗应用场景,应根据具体需求选择合适的框架。 Abstract: Federated Learning (FL) has emerged as a transformative paradigm in medical AI, enabling collaborative model training across institutions without direct data sharing. This study benchmarks three prominent FL frameworks NVIDIA FLARE, Flower, and Owkin Substra to evaluate their suitability for medical imaging applications in real-world settings. Using the PathMNIST dataset, we assess model performance, convergence efficiency, communication overhead, scalability, and developer experience. Results indicate that NVIDIA FLARE offers superior production scalability, Flower provides flexibility for prototyping and academic research, and Owkin Substra demonstrates exceptional privacy and compliance features. Each framework exhibits strengths optimized for distinct use cases, emphasizing their relevance to practical deployment in healthcare environments.

[96] Enhancing rice leaf images: An overview of image denoising techniques

Rupjyoti Chutia,Dibya Jyoti Bora

Main category: cs.CV

TL;DR: 本文对基于CLAHE的图像去噪方法在水稻叶片图像增强中的应用进行了系统性比较研究,旨在提高图像质量以支持病害检测和生长分析等后续任务。

Details Motivation: 图像增强是数字图像处理的关键预处理步骤,尤其在农业领域如水稻叶片分析中具有重要意义,但需要有效结合去噪与对比度增强方法以提升整体性能。 Method: 采用多种经典去噪方法与CLAHE(限制对比度自适应直方图均衡化)相结合,在水稻叶片图像数据集上进行实验,并使用多种评价指标进行综合评估。 Result: 实验结果表明,结合去噪与CLAHE的方法能有效提升水稻叶片图像的质量,不同去噪算法在保持细节与去除噪声方面表现各有优劣,整体显著改善了后续分析任务的可靠性。 Conclusion: 该研究为数字图像处理方法的有效性提供了坚实基础,并为农业研究及其他领域的图像增强技术应用提供了有价值的参考。 Abstract: Digital image processing involves the systematic handling of images using advanced computer algorithms, and has gained significant attention in both academic and practical fields. Image enhancement is a crucial preprocessing stage in the image-processing chain, improving image quality and emphasizing features. This makes subsequent tasks (segmentation, feature extraction, classification) more reliable. Image enhancement is essential for rice leaf analysis, aiding in disease detection, nutrient deficiency evaluation, and growth analysis. Denoising followed by contrast enhancement are the primary steps. Image filters, generally employed for denoising, transform or enhance visual characteristics like brightness, contrast, and sharpness, playing a crucial role in improving overall image quality and enabling the extraction of useful information. This work provides an extensive comparative study of well-known image-denoising methods combined with CLAHE (Contrast Limited Adaptive Histogram Equalization) for efficient denoising of rice leaf images. The experiments were performed on a rice leaf image dataset to ensure the data is relevant and representative. Results were examined using various metrics to comprehensively test enhancement methods. This approach provides a strong basis for assessing the effectiveness of methodologies in digital image processing and reveals insights useful for future adaptation in agricultural research and other domains.

[97] Which LiDAR scanning pattern is better for roadside perception: Repetitive or Non-repetitive?

Zhiqi Qi,Runxin Zhao,Hanyang Zhuang,Chunxiang Wang,Ming Yang

Main category: cs.CV

TL;DR: 本文提出了一个名为“InfraLiDARs' Benchmark”的新数据集,用于系统研究不同LiDAR扫描模式对路侧感知性能的影响,发现非重复扫描LiDAR与128线重复扫描LiDAR在检测性能上相当,且成本更低,适合实际应用。

Details Motivation: 现有研究较少关注不同LiDAR扫描模式对基础设施感知性能的影响,尤其是重复与非重复扫描方式在点云分布和目标检测上的差异,亟需系统性分析以指导路侧感知系统的优化部署。 Method: 在CARLA仿真环境中构建了同时运行多种扫描模式(重复与非重复)的路侧LiDAR系统,采集并发布了‘InfraLiDARs' Benchmark’数据集,并对多种3D目标检测算法在不同扫描模式下的表现进行了统计分析。 Result: 非重复扫描LiDAR与128线重复扫描LiDAR在各类场景中表现出相近的检测性能;尽管非重复LiDAR感知范围较短,但因其低成本而具备实际部署优势。 Conclusion: 研究揭示了不同LiDAR扫描模式对感知性能的影响,为路侧感知系统中LiDAR选型和算法匹配提供了依据,并公开数据集以促进后续研究。 Abstract: LiDAR-based roadside perception is a cornerstone of advanced Intelligent Transportation Systems (ITS). While considerable research has addressed optimal LiDAR placement for infrastructure, the profound impact of differing LiDAR scanning patterns on perceptual performance remains comparatively under-investigated. The inherent nature of various scanning modes - such as traditional repetitive (mechanical/solid-state) versus emerging non-repetitive (e.g. prism-based) systems - leads to distinct point cloud distributions at varying distances, critically dictating the efficacy of object detection and overall environmental understanding. To systematically investigate these differences in infrastructure-based contexts, we introduce the "InfraLiDARs' Benchmark," a novel dataset meticulously collected in the CARLA simulation environment using concurrently operating infrastructure-based LiDARs exhibiting both scanning paradigms. Leveraging this benchmark, we conduct a comprehensive statistical analysis of the respective LiDAR scanning abilities and evaluate the impact of these distinct patterns on the performance of various leading 3D object detection algorithms. Our findings reveal that non-repetitive scanning LiDAR and the 128-line repetitive LiDAR were found to exhibit comparable detection performance across various scenarios. Despite non-repetitive LiDAR's limited perception range, it's a cost-effective option considering its low price. Ultimately, this study provides insights for setting up roadside perception system with optimal LiDAR scanning patterns and compatible algorithms for diverse roadside applications, and publicly releases the "InfraLiDARs' Benchmark" dataset to foster further research.

[98] World Simulation with Video Foundation Models for Physical AI

NVIDIA,:,Arslan Ali,Junjie Bai,Maciej Bala,Yogesh Balaji,Aaron Blakeman,Tiffany Cai,Jiaxin Cao,Tianshi Cao,Elizabeth Cha,Yu-Wei Chao,Prithvijit Chattopadhyay,Mike Chen,Yongxin Chen,Yu Chen,Shuai Cheng,Yin Cui,Jenna Diamond,Yifan Ding,Jiaojiao Fan,Linxi Fan,Liang Feng,Francesco Ferroni,Sanja Fidler,Xiao Fu,Ruiyuan Gao,Yunhao Ge,Jinwei Gu,Aryaman Gupta,Siddharth Gururani,Imad El Hanafi,Ali Hassani,Zekun Hao,Jacob Huffman,Joel Jang,Pooya Jannaty,Jan Kautz,Grace Lam,Xuan Li,Zhaoshuo Li,Maosheng Liao,Chen-Hsuan Lin,Tsung-Yi Lin,Yen-Chen Lin,Huan Ling,Ming-Yu Liu,Xian Liu,Yifan Lu,Alice Luo,Qianli Ma,Hanzi Mao,Kaichun Mo,Seungjun Nah,Yashraj Narang,Abhijeet Panaskar,Lindsey Pavao,Trung Pham,Morteza Ramezanali,Fitsum Reda,Scott Reed,Xuanchi Ren,Haonan Shao,Yue Shen,Stella Shi,Shuran Song,Bartosz Stefaniak,Shangkun Sun,Shitao Tang,Sameena Tasmeen,Lyne Tchapmi,Wei-Cheng Tseng,Jibin Varghese,Andrew Z. Wang,Hao Wang,Haoxiang Wang,Heng Wang,Ting-Chun Wang,Fangyin Wei,Jiashu Xu,Dinghao Yang,Xiaodong Yang,Haotian Ye,Seonghyeon Ye,Xiaohui Zeng,Jing Zhang,Qinsheng Zhang,Kaiwen Zheng,Andrew Zhu,Yuke Zhu

Main category: cs.CV

TL;DR: Cosmos-Predict2.5 是一个基于流架构的统一世界生成模型,支持文本、图像和视频到世界的生成,并结合 Cosmos-Reason1 实现更好的文本对齐与控制;通过大规模训练和强化学习优化,在视频质量和指令对齐方面显著优于前代模型,同时配套的 Cosmos-Transfer2.5 框架实现了高效高保真的现实与仿真转换,二者共同推动具身智能的发展。

Details Motivation: 为了提升物理人工智能中世界模拟的准确性与可控性,实现多模态输入(文本、图像、视频)到动态世界生成的统一建模,并支持机器人和自主系统所需的高质量合成数据与闭环仿真。 Method: 采用基于流的架构设计,集成 Text2World、Image2World 和 Video2World 生成能力,结合视觉语言模型 Cosmos-Reason1 增强文本理解与控制;在 2 亿视频片段上训练,并使用基于强化学习的后训练优化;同时推出轻量高效的 Cosmos-Transfer2.5 框架用于 Sim2Real 和 Real2Real 转换。 Result: Cosmos-Predict2.5 在 2B 和 14B 规模上显著提升了视频生成质量与指令对齐能力;Cosmos-Transfer2.5 虽比前代小 3.5 倍,但生成更高质量、更稳健的长时域视频;模型、代码和基准测试已开源。 Conclusion: Cosmos-Predict2.5 和 Cosmos-Transfer2.5 构成了支持可扩展具身智能的强大工具集,通过开放资源促进物理人工智能的研究与应用落地。 Abstract: We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.

[99] Habitat and Land Cover Change Detection in Alpine Protected Areas: A Comparison of AI Architectures

Harald Kristen,Daniel Kulmer,Manuela Hirschmugl

Main category: cs.CV

TL;DR: 本研究利用深度学习对奥地利Gesaeuse国家公园的高山栖息地进行变化检测,比较了后分类与直接变化检测方法,评估了多种模型在复杂、不均衡环境下的性能,结果显示Clay v1.0和LiDAR数据显著提升检测精度。

Details Motivation: 高山生态系统面临快速气候变化和干扰,需要高频监测,但传统人工制图成本高,现有地理空间基础模型在复杂自然环境中应用存在模糊类别边界和类别不平衡的问题。 Method: 采用深度学习方法,比较后分类变化检测(使用Prithvi-EO-2.0、Clay v1.0和U-Net)与直接变化检测(使用ChangeViT和U-Net),基于高分辨率多模态数据(RGB、NIR、LiDAR、地形属性)进行实验。 Result: Clay v1.0在多类变化检测中准确率达51%(U-Net为41%),二分类检测中两者均达67%;直接变化检测在二分类IoU上表现更优(0.53 vs 0.35),但多类准确率仅28%;加入LiDAR使语义分割准确率从30%提升至50%;跨时间验证显示Clay模型更稳健。 Conclusion: 地理空间基础模型(尤其是Clay v1.0)结合LiDAR数据在复杂高山环境中的变化检测具有优势,尽管整体精度低于均质环境,但结果更贴近现实场景,未来可通过对象级后处理和物理约束进一步提升实用性。 Abstract: Rapid climate change and other disturbances in alpine ecosystems demand frequent habitat monitoring, yet manual mapping remains prohibitively expensive for the required temporal resolution. We employ deep learning for change detection using long-term alpine habitat data from Gesaeuse National Park, Austria, addressing a major gap in applying geospatial foundation models (GFMs) to complex natural environments with fuzzy class boundaries and highly imbalanced classes. We compare two paradigms: post-classification change detection (CD) versus direct CD. For post-classification CD, we evaluate GFMs Prithvi-EO-2.0 and Clay v1.0 against U-Net CNNs; for direct CD, we test the transformer ChangeViT against U-Net baselines. Using high-resolution multimodal data (RGB, NIR, LiDAR, terrain attributes) covering 4,480 documented changes over 15.3 km2, results show Clay v1.0 achieves 51% overall accuracy versus U-Net's 41% for multi-class habitat change, while both reach 67% for binary change detection. Direct CD yields superior IoU (0.53 vs 0.35) for binary but only 28% accuracy for multi-class detection. Cross-temporal evaluation reveals GFM robustness, with Clay maintaining 33% accuracy on 2020 data versus U-Net's 23%. Integrating LiDAR improves semantic segmentation from 30% to 50% accuracy. Although overall accuracies are lower than in more homogeneous landscapes, they reflect realistic performance for complex alpine habitats. Future work will integrate object-based post-processing and physical constraints to enhance applicability.

[100] LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

Huanlin Gao,Ping Chen,Fuyuan Shi,Chao Tan,Zhaoxiang Liu,Fang Zhao,Kai Wang,Shiguo Lian

Main category: cs.CV

TL;DR: LeMiCa是一种无需训练、高效的扩散视频生成加速框架,通过将缓存调度建模为带误差权重的有向图,并提出字典序极小极大路径优化策略,有效控制全局误差累积,显著提升生成视频的全局内容与风格一致性。

Details Motivation: 现有缓存策略多关注局部启发式误差的减少,忽视了全局误差的累积,导致加速后视频内容质量下降。因此需要一种能控制最坏情况路径误差的缓存调度方法以提升整体一致性。 Method: 将缓存调度建模为带误差权重的有向图,提出字典序极小极大路径优化(Lexicographic Minimax Path Optimization)策略,显式约束最坏情况下的路径误差,从而优化全局生成一致性。 Result: 在多个文本到视频基准上实验表明,LeMiCa在Latte模型上实现2.9倍加速,在Open-Sora上LPIPS达到0.05,优于先前缓存技术,且感知质量损失极小。 Conclusion: LeMiCa是一种鲁棒、通用的扩散视频生成加速范式,能在显著提升推理速度的同时保持高质量生成效果,为高效可靠的视频合成提供了坚实基础。 Abstract: We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa

[101] Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Wenli Xiao,Haotian Lin,Andy Peng,Haoru Xue,Tairan He,Yuqi Xie,Fengyuan Hu,Jimmy Wu,Zhengyi Luo,Linxi "Jim" Fan,Guanya Shi,Yuke Zhu

Main category: cs.CV

TL;DR: 提出了一种名为PLD的三阶段框架,通过残差强化学习和分布感知数据收集来提升视觉-语言-动作模型的性能,显著提高了任务成功率并具备良好的泛化能力。

Details Motivation: 监督微调依赖昂贵的人类示范,限制了大模型的可扩展性和泛化性,因此需要一种更高效、可扩展的自我改进方法。 Method: 采用三阶段框架:第一阶段训练轻量级残差执行器探测VLA通才模型的失败区域;第二阶段使用混合 rollout 策略收集与部署分布一致且包含恢复行为的数据;第三阶段通过标准SFT将精选轨迹蒸馏回通才模型。 Result: 在LIBERO上达到99%的任务成功率,在SimplerEnv上提升超过50%,在真实世界的Franka和YAM机械臂操作任务中实现100%成功率。消融实验表明残差探测和分布感知回放对提升新旧任务性能至关重要。 Conclusion: PLD为视觉-语言-动作模型提供了一条可扩展的自改进路径,减少了对人工标注数据的依赖,同时提升了模型在已见和未见任务上的表现。 Abstract: Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a three-stage plug-and-play framework that improves VLAs through residual reinforcement learning (RL) and distribution-aware data collection. In Stage 1, we train lightweight residual actors to probe failure regions of the VLA generalist. In Stage 2, we use a hybrid rollout scheme that aligns collected trajectories with the generalist's deployment distribution while capturing recovery behaviors. In Stage 3, we distill the curated trajectories back into the generalist with standard SFT. PLD achieves near-saturated 99% task success on LIBERO, over 50% gains in SimplerEnv, and 100% success on real-world Franka and YAM arm manipulation tasks. Ablations show that residual probing and distribution-aware replay are key to collecting deployment-aligned data that improves both seen and unseen tasks, offering a scalable path toward self-improving VLA models.

[102] SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation

Jiaming Liu,Dingwei Fan,Junyong Zhao,Chunlin Li,Haipeng Si,Liang Sun

Main category: cs.CV

TL;DR: 提出SpinalSAM-R1,一种结合微调SAM与DeepSeek-R1的多模态视觉-语言交互系统,用于脊柱CT图像分割,通过解剖学引导注意力机制和语义驱动交互协议提升性能,在低标注需求下实现高效准确分割,并开发了支持多种提示方式的交互软件。

Details Motivation: 现有分割模型(如SAM)在脊柱CT图像分割中受限于高标注需求和较差的领域适应性,且因CT图像对比度低、椎体边界复杂导致分割困难。 Method: 提出SpinalSAM-R1,将微调后的SAM与大语言模型DeepSeek-R1结合,引入解剖学引导的注意力机制提升分割精度,并设计语义驱动的交互协议实现自然语言指导下的分割优化;采用LoRA进行高效微调,并开发基于PyQt5的交互式软件支持点、框、文本提示。 Result: 在脊柱CT图像上验证,模型表现出优越的分割性能;交互软件支持11种临床操作,命令解析准确率达94.3%,响应时间低于800ms。 Conclusion: SpinalSAM-R1有效提升了脊柱CT图像的分割性能与交互能力,具有良好的临床应用潜力,且开源软件促进了进一步研究与应用。 Abstract: The anatomical structure segmentation of the spine and adjacent structures from computed tomography (CT) images is a key step for spinal disease diagnosis and treatment. However, the segmentation of CT images is impeded by low contrast and complex vertebral boundaries. Although advanced models such as the Segment Anything Model (SAM) have shown promise in various segmentation tasks, their performance in spinal CT imaging is limited by high annotation requirements and poor domain adaptability. To address these limitations, we propose SpinalSAM-R1, a multimodal vision-language interactive system that integrates a fine-tuned SAM with DeepSeek-R1, for spine CT image segmentation. Specifically, our SpinalSAM-R1 introduces an anatomy-guided attention mechanism to improve spine segmentation performance, and a semantics-driven interaction protocol powered by DeepSeek-R1, enabling natural language-guided refinement. The SpinalSAM-R1 is fine-tuned using Low-Rank Adaptation (LoRA) for efficient adaptation. We validate our SpinalSAM-R1 on the spine anatomical structure with CT images. Experimental results suggest that our method achieves superior segmentation performance. Meanwhile, we develop a PyQt5-based interactive software, which supports point, box, and text-based prompts. The system supports 11 clinical operations with 94.3\% parsing accuracy and sub-800 ms response times. The software is released on https://github.com/6jm233333/spinalsam-r1.

[103] A filtering scheme for confocal laser endomicroscopy (CLE)-video sequences for self-supervised learning

Nils Porsche,Flurin Müller-Diesing,Sweta Banerjee,Miguel Goncalves,Marc Aubreville

Main category: cs.CV

TL;DR: 提出一种针对共聚焦激光显微内镜(CLE)视频序列的过滤方法,以提升自监督学习(SSL)预训练的效率和性能,在两个医学数据集上显著提高了分类准确率,并减少67%的训练时间。

Details Motivation: CLE图像难以解读且标注数据稀缺,导致机器学习模型易过拟合;现有SSL方法因视频帧间高度相关导致数据分布不均,影响训练效率。 Method: 设计一种CLE视频序列过滤功能以减少SSL训练中的数据冗余,并采用四个主流网络及基于视觉Transformer的师生网络结构,在鼻窦肿瘤和皮肤鳞状细胞癌数据集上评估下游任务性能。 Result: 过滤后的SSL预训练模型在两个数据集上分别达到67.48%和73.52%的测试准确率,显著优于非SSL基线模型,并将训练时间减少了67%。 Conclusion: SSL是有效的CLE预训练方法,所提出的视频过滤策略可显著提升SSL训练效率与模型性能,有助于推动CLE图像的自动化诊断应用。 Abstract: Confocal laser endomicroscopy (CLE) is a non-invasive, real-time imaging modality that can be used for in-situ, in-vivo imaging and the microstructural analysis of mucous structures. The diagnosis using CLE is, however, complicated by images being hard to interpret for non-experienced physicians. Utilizing machine learning as an augmentative tool would hence be beneficial, but is complicated by the shortage of histopathology-correlated CLE imaging sequences with respect to the plurality of patterns in this domain, leading to overfitting of machine learning models. To overcome this, self-supervised learning (SSL) can be employed on larger unlabeled datasets. CLE is a video-based modality with high inter-frame correlation, leading to a non-stratified data distribution for SSL training. In this work, we propose a filter functionality on CLE video sequences to reduce the dataset redundancy in SSL training and improve SSL training convergence and training efficiency. We use four state-of-the-art baseline networks and a SSL teacher-student network with a vision transformer small backbone for the evaluation. These networks were evaluated on downstream tasks for a sinonasal tumor dataset and a squamous cell carcinoma of the skin dataset. On both datasets, we found the highest test accuracy on the filtered SSL-pretrained model, with 67.48% and 73.52%, both considerably outperforming their non-SSL baselines. Our results show that SSL is an effective method for CLE pretraining. Further, we show that our proposed CLE video filter can be utilized to improve training efficiency in self-supervised scenarios, resulting in a reduction of 67% in training time.

[104] FreeSliders: Training-Free, Modality-Agnostic Concept Sliders for Fine-Grained Diffusion Control in Images, Audio, and Video

Rotem Ezra,Hedi Zisling,Nimrod Berman,Ilan Naiman,Alexey Gorkor,Liran Nochumsohn,Eliya Nachmani,Omri Azencot

Main category: cs.CV

TL;DR: 本文提出了FreeSliders,一种无需训练且适用于多种模态的细粒度概念控制生成方法,通过推理时部分估计Concept Sliders公式实现跨图像、视频和音频的即插即用式语义编辑,并扩展了多模态基准测试,提出新的评估指标和两阶段自动校准策略以提升编辑质量。

Details Motivation: 现有的Concept Sliders方法需要针对每个概念进行训练和模型微调,限制了其在新模态上的可扩展性,难以实现真正灵活的细粒度可控生成。 Method: FreeSliders在推理过程中部分估计Concept Sliders的公式,完全避免训练;引入可扩展的多模态(图像、视频、音频)评估基准,并提出三个评估属性及相应指标;设计两阶段策略自动检测饱和点并重新参数化遍历路径,实现感知均匀且语义合理的编辑。 Result: 实验表明FreeSliders在多个模态上均优于现有基线方法,实现了无需训练的即插即用概念控制,支持高质量的细粒度语义编辑,并建立了首个支持多模态可控生成评估的工具集。 Conclusion: FreeSliders为实现跨模态、无需训练的细粒度可控生成提供了有效解决方案,推动了可控生成模型的通用性和实用性发展。 Abstract: Diffusion models have become state-of-the-art generative models for images, audio, and video, yet enabling fine-grained controllable generation, i.e., continuously steering specific concepts without disturbing unrelated content, remains challenging. Concept Sliders (CS) offer a promising direction by discovering semantic directions through textual contrasts, but they require per-concept training and architecture-specific fine-tuning (e.g., LoRA), limiting scalability to new modalities. In this work we introduce FreeSliders, a simple yet effective approach that is fully training-free and modality-agnostic, achieved by partially estimating the CS formula during inference. To support modality-agnostic evaluation, we extend the CS benchmark to include both video and audio, establishing the first suite for fine-grained concept generation control with multiple modalities. We further propose three evaluation properties along with new metrics to improve evaluation quality. Finally, we identify an open problem of scale selection and non-linear traversals and introduce a two-stage procedure that automatically detects saturation points and reparameterizes traversal for perceptually uniform, semantically meaningful edits. Extensive experiments demonstrate that our method enables plug-and-play, training-free concept control across modalities, improves over existing baselines, and establishes new tools for principled controllable generation. An interactive presentation of our benchmark and method is available at: https://azencot-group.github.io/FreeSliders/

[105] AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency

Piyushkumar Patel

Main category: cs.CV

TL;DR: MOVAI提出了一种新的层次化框架,通过整合场景理解与时间感知扩散模型,显著提升了文本到视频生成的质量和时序一致性。

Details Motivation: 现有文本到视频生成方法在时序一致性、组合理解及细粒度控制方面存在不足,限制了复杂场景的生成质量。 Method: 提出MOVAI框架,包含三个核心模块:组合式场景解析器(CSP)、时空注意力机制(TSAM)和渐进式视频优化模块(PVR),实现对复杂场景的分层建模与多尺度时序优化。 Result: 在标准基准上实验表明,MOVAI在LPIPS、FVD和用户偏好上分别提升15.3%、12.7%和18.9%,显著优于现有方法,尤其在多对象复杂场景中表现优异。 Conclusion: MOVAI通过引入层次化语义解析与时空联合建模,有效解决了文本到视频生成中的时序连贯性和细粒度控制难题,推动了高质量视频生成的发展。 Abstract: Text to video generation has emerged as a critical frontier in generative artificial intelligence, yet existing approaches struggle with maintaining temporal consistency, compositional understanding, and fine grained control over visual narratives. We present MOVAI (Multimodal Original Video AI), a novel hierarchical framework that integrates compositional scene understanding with temporal aware diffusion models for high fidelity text to video synthesis. Our approach introduces three key innovations: (1) a Compositional Scene Parser (CSP) that decomposes textual descriptions into hierarchical scene graphs with temporal annotations, (2) a Temporal-Spatial Attention Mechanism (TSAM) that ensures coherent motion dynamics across frames while preserving spatial details, and (3) a Progressive Video Refinement (PVR) module that iteratively enhances video quality through multi-scale temporal reasoning. Extensive experiments on standard benchmarks demonstrate that MOVAI achieves state-of-the-art performance, improving video quality metrics by 15.3% in LPIPS, 12.7% in FVD, and 18.9% in user preference studies compared to existing methods. Our framework shows particular strength in generating complex multi-object scenes with realistic temporal dynamics and fine-grained semantic control.

[106] Chain of Time: In-Context Physical Simulation with Image Generation Models

YingQiao Wang,Eric Bigelow,Boyi Li,Tomer Ullman

Main category: cs.CV

TL;DR: 提出了一种受认知启发的“时间链”方法,用于改进和解释视觉-语言模型中的物理模拟,无需额外微调,在合成和真实场景中均显著提升性能,并揭示了模型在时序物理推理中的动态机制与局限。

Details Motivation: 受机器学习中的上下文推理和人类心智模拟的启发,旨在提升视觉-语言模型在物理模拟中的表现并增强其可解释性。 Method: 提出“Chain of Time”方法,通过在推理时生成模拟过程中的中间图像序列来实现物理模拟,无需额外微调。 Result: 在2D图形模拟和3D自然视频等多领域验证中,显著提升了最先进图像生成模型的性能;分析揭示了模型对速度、重力、碰撞等时序物理属性的模拟能力及其在推断某些物理参数时的困难。 Conclusion: Chain of Time方法有效增强了视觉-语言模型的物理模拟能力,并提供了对模型内部动态过程的深入理解,揭示了传统评估难以发现的模型优势与局限。 Abstract: We propose a novel cognitively-inspired method to improve and interpret physical simulation in vision-language models. Our ``Chain of Time" method involves generating a series of intermediate images during a simulation, and it is motivated by in-context reasoning in machine learning, as well as mental simulation in humans. Chain of Time is used at inference time, and requires no additional fine-tuning. We apply the Chain-of-Time method to synthetic and real-world domains, including 2-D graphics simulations and natural 3-D videos. These domains test a variety of particular physical properties, including velocity, acceleration, fluid dynamics, and conservation of momentum. We found that using Chain-of-Time simulation substantially improves the performance of a state-of-the-art image generation model. Beyond examining performance, we also analyzed the specific states of the world simulated by an image model at each time step, which sheds light on the dynamics underlying these simulations. This analysis reveals insights that are hidden from traditional evaluations of physical reasoning, including cases where an image generation model is able to simulate physical properties that unfold over time, such as velocity, gravity, and collisions. Our analysis also highlights particular cases where the image generation model struggles to infer particular physical parameters from input images, despite being capable of simulating relevant physical processes.

[107] End-to-End Framework Integrating Generative AI and Deep Reinforcement Learning for Autonomous Ultrasound Scanning

Hanae Elmekki,Amanda Spilkin,Ehsan Zakeri,Antonela Mariel Zanuttini,Ahmed Alagha,Hani Sami,Jamal Bentahar,Lyes Kadem,Wen-Fang Xie,Philippe Pibarot,Rabeb Mizouni,Hadi Otrok,Azzam Mourad,Sami Muhaidat

Main category: cs.CV

TL;DR: 本文提出了一种结合生成式AI与深度强化学习(DRL)的端到端框架,用于实现可重复、自动化的 cardiac 超声扫描。

Details Motivation: 现有DRL方法在cardiac超声扫描中缺乏可重复性,依赖专有数据且使用简化模型,限制了其实际应用。此外,操作者依赖性和专业人员短缺也促使自动化解决方案的需求增加。 Method: 该框架包括两个部分:一是结合GAN与VAE的条件生成模拟器,用于生成逼真的动作相关超声图像;二是利用该模拟器训练DRL模块,以学习自主、准确的扫描策略。同时发布了一个公开的真实cardiac超声数据集以确保可重复性。 Result: 实验表明,所提出的VAE-GAN在定性和定量评估中均优于现有GAN变体,DRL扫描系统在不同配置下均表现出有效性,且框架支持图像类型分类、质量评估和条件生成。 Conclusion: 该研究首次实现了基于生成式AI和DRL的可重复、端到端自动化cardiac超声扫描框架,具有良好的扩展性,可推广至其他器官的自动化成像。 Abstract: Cardiac ultrasound (US) is among the most widely used diagnostic tools in cardiology for assessing heart health, but its effectiveness is limited by operator dependence, time constraints, and human error. The shortage of trained professionals, especially in remote areas, further restricts access. These issues underscore the need for automated solutions that can ensure consistent, and accessible cardiac imaging regardless of operator skill or location. Recent progress in artificial intelligence (AI), especially in deep reinforcement learning (DRL), has gained attention for enabling autonomous decision-making. However, existing DRL-based approaches to cardiac US scanning lack reproducibility, rely on proprietary data, and use simplified models. Motivated by these gaps, we present the first end-to-end framework that integrates generative AI and DRL to enable autonomous and reproducible cardiac US scanning. The framework comprises two components: (i) a conditional generative simulator combining Generative Adversarial Networks (GANs) with Variational Autoencoders (VAEs), that models the cardiac US environment producing realistic action-conditioned images; and (ii) a DRL module that leverages this simulator to learn autonomous, accurate scanning policies. The proposed framework delivers AI-driven guidance through expert-validated models that classify image type and assess quality, supports conditional generation of realistic US images, and establishes a reproducible foundation extendable to other organs. To ensure reproducibility, a publicly available dataset of real cardiac US scans is released. The solution is validated through several experiments. The VAE-GAN is benchmarked against existing GAN variants, with performance assessed using qualitative and quantitative approaches, while the DRL-based scanning system is evaluated under varying configurations to demonstrate effectiveness.

[108] VLM6D: VLM based 6Dof Pose Estimation based on RGB-D Images

Md Selim Sarowar,Sungho Kim

Main category: cs.CV

TL;DR: 提出了一种名为VLM6D的双流架构,结合RGB和深度数据的优势,实现鲁棒且精确的6D物体姿态估计,在Occluded-LineMOD数据集上达到SOTA性能。

Details Motivation: 现有方法在从合成数据泛化到真实场景时,面对光照变化、无纹理物体和严重遮挡等问题仍表现脆弱,难以实现鲁棒且精确的6D姿态估计。 Method: 设计双流架构VLM6D,分别使用自监督Vision Transformer(DINOv2)处理RGB图像,利用其对视觉结构的先验知识增强对光照和纹理变化的鲁棒性;同时使用PointNet++处理深度生成的点云,提升对遮挡和稀疏几何数据的推理能力;最后融合两种特征进行多任务预测。 Result: 在Occluded-LineMOD数据集上实现了新的SOTA性能,验证了方法在复杂真实场景下的优越鲁棒性和精度。 Conclusion: VLM6D通过有效融合视觉与几何双流特征,显著提升了6D姿态估计在真实场景中的泛化能力和鲁棒性,尤其在严重遮挡、光照变化和无纹理条件下表现突出。 Abstract: The primary challenge in computer vision is precisely calculating the pose of 6D objects, however many current approaches are still fragile and have trouble generalizing from synthetic data to real-world situations with fluctuating lighting, textureless objects, and significant occlusions. To address these limitations, VLM6D, a novel dual-stream architecture that leverages the distinct strengths of visual and geometric data from RGB-D input for robust and precise pose estimation. Our framework uniquely integrates two specialized encoders: a powerful, self-supervised Vision Transformer (DINOv2) processes the RGB modality, harnessing its rich, pre-trained understanding of visual grammar to achieve remarkable resilience against texture and lighting variations. Concurrently, a PointNet++ encoder processes the 3D point cloud derived from depth data, enabling robust geometric reasoning that excels even with the sparse, fragmented data typical of severe occlusion. These complementary feature streams are effectively fused to inform a multi task prediction head. We demonstrate through comprehensive experiments that VLM6D obtained new SOTA performance on the challenging Occluded-LineMOD, validating its superior robustness and accuracy.

[109] Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation

Gaby Maroun,Salah Eddine Bekhouche,Fadi Dornaika

Main category: cs.CV

TL;DR: 提出一种结合ConvNeXt和Vision Transformers的混合架构,用于面部年龄估计,在多个基准数据集上实现了更低的平均绝对误差(MAE),并通过消融研究验证了关键组件的有效性。

Details Motivation: 为了提升年龄估计的精度,充分利用CNN局部特征提取能力和Transformer全局注意力机制的优势,解决传统方法在复杂视觉任务中的局限性。 Method: 提出ConvNeXt-ViT混合架构,结合预训练模型,使用线性层和高级正则化技术优化模型,并系统探索不同配置,引入改进的注意力机制以增强对年龄相关面部特征的关注。 Result: 在MORPH II、CACD和AFAD数据集上取得了优异的性能,显著降低了MAE,消融实验表明各组件特别是改进的注意力机制对性能提升至关重要。 Conclusion: ConvNeXt-ViT混合架构不仅优于传统方法,还为年龄估计及其他视觉任务提供了强有力的框架,展示了CNN与Transformer融合在复杂计算机视觉任务中的巨大潜力。 Abstract: Age estimation from facial images is a complex and multifaceted challenge in computer vision. In this study, we present a novel hybrid architecture that combines ConvNeXt, a state-of-the-art advancement of convolutional neural networks (CNNs), with Vision Transformers (ViT). While each model independently delivers excellent performance on a variety of tasks, their integration leverages the complementary strengths of the CNNs localized feature extraction capabilities and the Transformers global attention mechanisms. Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets, including MORPH II, CACD, and AFAD, and achieved superior performance in terms of mean absolute error (MAE). To address computational constraints, we leverage pre-trained models and systematically explore different configurations, using linear layers and advanced regularization techniques to optimize the architecture. Comprehensive ablation studies highlight the critical role of individual components and training strategies, and in particular emphasize the importance of adapted attention mechanisms within the CNN framework to improve the model focus on age-relevant facial features. The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation and related visual tasks. This work underscores the transformative potential of hybrid architectures and represents a promising direction for the seamless integration of CNNs and transformers to address complex computer vision challenges.

[110] FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

Janghoon Cho,Jungsoo Lee,Munawar Hayat,Kyuwoong Hwang,Fatih Porikli,Sungha Choi

Main category: cs.CV

TL;DR: 本文提出了一种基于设施选址函数的高效视觉token压缩框架FLoC,用于解决长视频理解中视觉token过多导致的可扩展性问题。该方法无需训练、模型无关且查询无关,能快速选择紧凑且具有代表性和多样性的token子集,并在多个大规模基准上优于现有压缩技术。

Details Motivation: 由于长视频生成的视觉token数量庞大,现有的视频大模型在处理长视频时面临可扩展性差的问题,限制了其应用。因此需要一种高效、通用的视觉token压缩方法。 Method: 提出FLoC框架,基于设施选址函数和懒惰贪心算法,在预定义token预算下快速选择最具代表性且多样化的视觉token子集,实现高效压缩,且无需训练,适用于多种视频大模型。 Result: 在Video-MME、MLVU和LongVideoBench等多个大规模基准上,FLoC显著优于近期压缩方法,在保持近似最优性能的同时大幅减少视觉token数量,提升处理速度。 Conclusion: FLoC是一种高效、通用、无需训练的视觉token压缩方法,有效解决了长视频理解中的可扩展性瓶颈,具备良好的鲁棒性和实际应用价值。 Abstract: Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, this paper proposes FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, and LongVideoBench, demonstrate that our framework consistently surpasses recent compression techniques, highlighting not only its effectiveness and robustness in addressing the critical challenges of long video understanding, but also its efficiency in processing speed.

[111] BlurGuard: A Simple Approach for Robustifying Image Protection Against AI-Powered Editing

Jinsu Kim,Yunhun Nam,Minseon Kim,Sangpil Kim,Jongheon Jeong

Main category: cs.CV

TL;DR: 本文提出了一种通过自适应区域高斯模糊来增强图像保护中对抗性噪声不可逆性的简单方法,提升了现有技术在面对多种噪声逆转手段时的鲁棒性。

Details Motivation: 现有的对抗性噪声保护方法容易被简单技术(如JPEG压缩)逆转,缺乏实际安全性,因此需要更不可逆且难以检测的保护机制。 Method: 提出对对抗性噪声应用自适应的分区域高斯模糊,调整其频率谱,使噪声更难被检测和去除。 Result: 实验表明该方法在多种图像编辑场景和逆转技术下,显著提升了现有方法的最坏情况保护性能,并降低了噪声引起的感知质量下降。 Conclusion: 引入“不可逆性”作为图像保护的重要标准,并验证了通过频域调制可有效增强对抗性噪声的实际防护能力。 Abstract: Recent advances in text-to-image models have increased the exposure of powerful image editing techniques as a tool, raising concerns about their potential for malicious use. An emerging line of research to address such threats focuses on implanting "protective" adversarial noise into images before their public release, so future attempts to edit them using text-to-image models can be impeded. However, subsequent works have shown that these adversarial noises are often easily "reversed," e.g., with techniques as simple as JPEG compression, casting doubt on the practicality of the approach. In this paper, we argue that adversarial noise for image protection should not only be imperceptible, as has been a primary focus of prior work, but also irreversible, viz., it should be difficult to detect as noise provided that the original image is hidden. We propose a surprisingly simple method to enhance the robustness of image protection methods against noise reversal techniques. Specifically, it applies an adaptive per-region Gaussian blur on the noise to adjust the overall frequency spectrum. Through extensive experiments, we show that our method consistently improves the per-sample worst-case protection performance of existing methods against a wide range of reversal techniques on diverse image editing scenarios, while also reducing quality degradation due to noise in terms of perceptual metrics. Code is available at https://github.com/jsu-kim/BlurGuard.

[112] CompAgent: An Agentic Framework for Visual Compliance Verification

Rahul Ghosh,Baishali Chaudhury,Hari Prasanna Das,Meghana Ashok,Ryan Razkenari,Sungmin Hong,Chun-Hao Liu

Main category: cs.CV

TL;DR: 本文提出了CompAgent,首个用于视觉合规性验证的代理框架,通过结合多模态大语言模型与视觉工具(如目标检测、面部分析等),实现对复杂政策规则的细粒度视觉内容审核,在多个基准上优于现有方法。

Details Motivation: 现有方法依赖于人工标注数据和特定任务模型,成本高且泛化能力差;而现有的多模态大模型虽具备广泛知识,但在细粒度视觉推理和结构化规则应用方面表现不足。 Method: 提出CompAgent框架,包含一个规划代理和一个验证代理:规划代理根据合规策略动态选择合适的视觉工具(如NSFW检测器、图像描述模型等);验证代理则融合图像、工具输出和策略上下文进行多模态推理。 Result: 在公开基准测试中,CompAgent优于专用分类器、直接提示MLLM和手工路由基线,在UnsafeBench数据集上达到最高76%的F1分数,比当前最优方法提升10%。 Conclusion: 基于代理的规划与工具增强推理在视觉合规验证中具有更高的可扩展性、准确性和适应性,为内容审核提供了新的有效范式。 Abstract: Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent multi-modal large language models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools - such as object detectors, face analyzers, NSFW detectors, and captioning models - and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A verification agent then integrates image, tool outputs, and policy context to perform multi-modal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.

[113] From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

Mengfei Liang,Yiting Qu,Yukun Jiang,Michael Backes,Yang Zhang

Main category: cs.CV

TL;DR: 本文提出AIFo,一种基于多智能体协作的AI生成图像检测框架,通过结合多种取证工具和结构化辩论机制,在跨源证据推理下实现高准确率(97.05%)且可解释的检测。

Details Motivation: 现有AI生成图像检测方法在可解释性、泛化能力和多源信息整合方面存在局限,难以应对快速演化的生成模型和复杂真实场景。 Method: 设计一个无需训练的多智能体框架,利用LLM驱动的代理协调反向图像搜索、元数据提取、预训练分类器和视觉语言模型等多种取证工具,并引入多智能体辩论机制与记忆增强推理模块进行证据融合与决策。 Result: 在6000张图像上的实验表明,AIFo在实验室和真实场景中均显著优于传统分类器和最先进的视觉语言模型,达到97.05%的准确率。 Conclusion: 基于智能体的程序化推理为AI生成图像检测提供了更鲁棒、可解释且可适应的新范式。 Abstract: The rapid evolution of AI-generated images poses unprecedented challenges to information integrity and media authenticity. Existing detection approaches suffer from fundamental limitations: traditional classifiers lack interpretability and fail to generalize across evolving generative models, while vision-language models (VLMs), despite their promise, remain constrained to single-shot analysis and pixel-level reasoning. To address these challenges, we introduce AIFo (Agent-based Image Forensics), a novel training-free framework that emulates human forensic investigation through multi-agent collaboration. Unlike conventional methods, our framework employs a set of forensic tools, including reverse image search, metadata extraction, pre-trained classifiers, and VLM analysis, coordinated by specialized LLM-based agents that collect, synthesize, and reason over cross-source evidence. When evidence is conflicting or insufficient, a structured multi-agent debate mechanism allows agents to exchange arguments and reach a reliable conclusion. Furthermore, we enhance the framework with a memory-augmented reasoning module that learns from historical cases to improve future detection accuracy. Our comprehensive evaluation spans 6,000 images across both controlled laboratory settings and challenging real-world scenarios, including images from modern generative platforms and diverse online sources. AIFo achieves 97.05% accuracy, substantially outperforming traditional classifiers and state-of-the-art VLMs. These results demonstrate that agent-based procedural reasoning offers a new paradigm for more robust, interpretable, and adaptable AI-generated image detection.

[114] A Retrospect to Multi-prompt Learning across Vision and Language

Ziliang Chen,Xin Huang,Quanlong Guan,Liang Lin,Weiqi Luo

Main category: cs.CV

TL;DR: 本文回顾了视觉-语言多提示学习,并提出了一种基于能量的多提示学习方法(EMPL),通过从由视觉-语言预训练模型隐式定义的能量分布中采样生成多个提示嵌入,在参数高效的同时平衡领域内与领域外的开放词汇泛化能力。

Details Motivation: 现有研究多集中于单提示范式,缺乏对多提示学习潜力的深入探索;本文旨在系统性地回顾并挖掘视觉-语言多提示学习的技术潜力。 Method: 提出能量-based多提示学习(EMPL),利用能量模型生成多个可学习的提示嵌入,结合理论分析与实证验证其在视觉-语言迁移中的优势。 Result: 实验表明EMPL在多种下游任务上表现出色,有效提升了开放词汇识别性能,并实现了领域内与领域外泛化的良好平衡。 Conclusion: 多提示学习优于单提示范式,EMPL为视觉-语言模型的高效适应提供了一种理论严谨且参数高效的解决方案。 Abstract: The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing researches milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL.

[115] An Efficient and Generalizable Transfer Learning Method for Weather Condition Detection on Ground Terminals

Wenxuan Zhang,Peng Hu

Main category: cs.CV

TL;DR: 本文提出了一种高效的迁移学习方法,用于地面终端组件对细粒度天气条件的本地检测,以提升低轨卫星互联网在恶劣天气下的可靠性。

Details Motivation: 恶劣天气(如雨雪)严重影响低轨卫星互联网地面终端的性能和可靠性,现有方法缺乏有效的细粒度天气检测能力,难以支持故障诊断与缓解。 Method: 采用高效的迁移学习(TL)方法,使地面终端组件能够本地化检测由典型和恶劣天气引起的雪、湿等代表性天气状况。 Result: 所提方法在检测性能上优于YOLOv7、YOLOv9、Faster R-CNN和R-YOLO等典型深度学习模型,并展现出更强的泛化能力,适用于多种场景。 Conclusion: 该迁移学习方法为卫星互联网地面终端提供了高效、可推广的天气相关状态检测方案,有助于提升系统在复杂天气条件下的可靠性和运维能力。 Abstract: The increasing adoption of satellite Internet with low-Earth-orbit (LEO) satellites in mega-constellations allows ubiquitous connectivity to rural and remote areas. However, weather events have a significant impact on the performance and reliability of satellite Internet. Adverse weather events such as snow and rain can disturb the performance and operations of satellite Internet's essential ground terminal components, such as satellite antennas, significantly disrupting the space-ground link conditions between LEO satellites and ground stations. This challenge calls for not only region-based weather forecasts but also fine-grained detection capability on ground terminal components of fine-grained weather conditions. Such a capability can assist in fault diagnostics and mitigation for reliable satellite Internet, but its solutions are lacking, not to mention the effectiveness and generalization that are essential in real-world deployments. This paper discusses an efficient transfer learning (TL) method that can enable a ground component to locally detect representative weather-related conditions. The proposed method can detect snow, wet, and other conditions resulting from adverse and typical weather events and shows superior performance compared to the typical deep learning methods, such as YOLOv7, YOLOv9, Faster R-CNN, and R-YOLO. Our TL method also shows the advantage of being generalizable to various scenarios.

[116] DM-QPMNET: Dual-modality fusion network for cell segmentation in quantitative phase microscopy

Rajatsubhra Chakraborty,Ana Espinosa-Momox,Riley Haskin,Depeng Xu,Rosario Porras-Aguilar

Main category: cs.CV

TL;DR: 提出DM-QPMNet,一种用于单次定量相位显微成像中细胞分割的双编码器网络,通过多头注意力融合偏振强度图像和相位图的模态特异性特征,实现更鲁棒的分割性能。

Details Motivation: 传统阈值方法对噪声和细胞密度敏感,而简单的深度学习通道拼接未能充分利用偏振强度图像与相位图之间的互补性。 Method: 设计DM-QPMNet,采用双编码器结构分别处理两种模态,通过中间层的多头注意力机制进行内容感知的特征融合,并引入双源跳跃连接和单模态归一化以提升稳定性与性能。 Result: 在多种基准上显著优于单一体系结构拼接和单模态基线方法,验证了模态特异性编码与可学习融合的有效性。 Conclusion: DM-QPMNet通过有原则的多模态融合策略,有效利用ssQPM中同步捕获的互补照明与相位线索,实现了鲁棒的细胞分割。 Abstract: Cell segmentation in single-shot quantitative phase microscopy (ssQPM) faces challenges from traditional thresholding methods that are sensitive to noise and cell density, while deep learning approaches using simple channel concatenation fail to exploit the complementary nature of polarized intensity images and phase maps. We introduce DM-QPMNet, a dual-encoder network that treats these as distinct modalities with separate encoding streams. Our architecture fuses modality-specific features at intermediate depth via multi-head attention, enabling polarized edge and texture representations to selectively integrate complementary phase information. This content-aware fusion preserves training stability while adding principled multi-modal integration through dual-source skip connections and per-modality normalization at minimal overhead. Our approach demonstrates substantial improvements over monolithic concatenation and single-modality baselines, showing that modality-specific encoding with learnable fusion effectively exploits ssQPM's simultaneous capture of complementary illumination and phase cues for robust cell segmentation.

[117] Towards 1000-fold Electron Microscopy Image Compression for Connectomics via VQ-VAE with Transformer Prior

Fuming Yang,Yicong Li,Hanspeter Pfister,Jeff W. Lichtman,Yaron Meirovitch

Main category: cs.CV

TL;DR: 提出基于VQ-VAE的电子显微镜数据压缩框架,支持16x到1024x压缩比,结合Transformer先验和FiLM实现按需解码与ROI驱动的高分辨率重建。

Details Motivation: 应对Petascale电子显微镜数据带来的存储、传输和分析挑战,提升大规模EM数据处理效率。 Method: 采用VQ-VAE架构进行压缩,引入Transformer先验预测底层token,并通过FiLM和拼接恢复纹理;设计ROI驱动的工作流,仅在需要区域从1024x压缩潜变量中重建高分辨率图像。 Result: 实现了16x至1024x的可变压缩比,支持顶部解码以实现极致压缩,并能在关键区域恢复细节纹理,显著减少解码计算和存储开销。 Conclusion: 该框架有效平衡了高压缩比与局部高保真重建需求,适用于大规模电子显微镜数据的高效存储与按需分析。 Abstract: Petascale electron microscopy (EM) datasets push storage, transfer, and downstream analysis toward their current limits. We present a vector-quantized variational autoencoder-based (VQ-VAE) compression framework for EM that spans 16x to 1024x and enables pay-as-you-decode usage: top-only decoding for extreme compression, with an optional Transformer prior that predicts bottom tokens (without changing the compression ratio) to restore texture via feature-wise linear modulation (FiLM) and concatenation; we further introduce an ROI-driven workflow that performs selective high-resolution reconstruction from 1024x-compressed latents only where needed.

[118] Hyperbolic Optimal Transport

Yan Bin Ng,Xianfeng Gu

Main category: cs.CV

TL;DR: 本文提出了一种用于计算双曲空间中最优传输映射的新算法,通过将欧几里得和球面几何的方法扩展到双曲情形,并使用几何变分技术实现高效求解。

Details Motivation: 现有最优传输方法主要针对欧几里得空间和球面,而在处理层次化数据、网络和多亏格曲面等涉及双曲空间的问题时存在局限,因此需要发展适用于双曲空间的算法。 Method: 采用几何变分技术,将已有的欧几里得和球面最优传输方法推广至双曲空间,设计了一种新颖且高效的算法。 Result: 在合成数据和多亏格曲面模型上的实验验证了所提方法的有效性,能够准确计算双曲空间中的最优传输映射。 Conclusion: 该方法成功拓展了最优传输的应用范围至双曲空间,为处理具有复杂几何结构的数据提供了新的工具。 Abstract: The optimal transport (OT) problem aims to find the most efficient mapping between two probability distributions under a given cost function, and has diverse applications in many fields such as machine learning, computer vision and computer graphics. However, existing methods for computing optimal transport maps are primarily developed for Euclidean spaces and the sphere. In this paper, we explore the problem of computing the optimal transport map in hyperbolic space, which naturally arises in contexts involving hierarchical data, networks, and multi-genus Riemann surfaces. We propose a novel and efficient algorithm for computing the optimal transport map in hyperbolic space using a geometric variational technique by extending methods for Euclidean and spherical geometry to the hyperbolic setting. We also perform experiments on synthetic data and multi-genus surface models to validate the efficacy of the proposed method.

[119] Object-Aware 4D Human Motion Generation

Shurui Gui,Deep Anil Patel,Xiner Li,Martin Renqiang Min

Main category: cs.CV

TL;DR: 提出了一种基于3D高斯表示和运动扩散先验的零样本4D人体运动生成框架MSDI,通过结合大语言模型和运动优化实现自然且物理合理的动作生成。

Details Motivation: 现有视频扩散模型生成的视频存在不真实变形、语义错误和物理不一致问题,主要因缺乏3D物理先验。为此,本文旨在引入对象感知的3D物理约束以提升4D人体运动生成的真实性与合理性。 Method: 提出Motion Score Distilled Interaction (MSDI) 框架,利用预生成的3D人体与物体,结合大语言模型(LLMs)的空间与语义信息,以及通过Motion Diffusion Score Distillation Sampling (MSDS) 提取的运动先验,进行空间感知的运动优化,蒸馏预训练运动扩散模型的得分梯度,在无需重训练的情况下优化人体动作。 Result: 实验表明,该方法能生成自然且物理上合理的4D人体运动,有效遵循3D空间和语义约束,在未见对象和动作组合上表现出良好的泛化能力,优于需联合训练的现有方法。 Conclusion: MSDI提供了一种可扩展的零样本解决方案,通过融合3D高斯表示、运动扩散先验与大语言模型的语义理解,显著提升了4D人体运动生成的真实性和物理一致性。 Abstract: Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.

[120] Merlin L48 Spectrogram Dataset

Aaron Sun,Subhransu Maji,Grant Van Horn

Main category: cs.CV

TL;DR: 本文提出了L48数据集,一个来自鸟类声音记录的细粒度真实世界多标签数据集,用于单正好多标签(SPML)学习。与以往在合成数据上评估的方法不同,L48提供了更贴近现实且更具挑战性的基准,并揭示了现有SPML方法在真实场景下的性能差距和弱点。

Details Motivation: 现有的SPML方法主要在由完全标注数据集随机采样生成的合成数据上进行评估,无法反映真实场景中的细粒度复杂性和误分类困难,因此需要更真实、更具挑战性的基准来评估和改进SPML方法。 Method: 提出L48数据集,该数据集基于真实的鸟类声音 recordings,具有自然的单正好多标签设置,并引入两种扩展设置,利用领域先验提供额外的负标签信息;在此数据集上对现有SPML方法进行基准测试并分析其表现。 Result: 在L48上的实验表明,现有SPML方法相较于在合成数据上的表现有显著差异,暴露出其在处理真实、细粒度数据时的不足,验证了新数据集的挑战性和必要性。 Conclusion: L48为SPML研究提供了一个更真实、更具挑战性的基准,突显了当前方法的局限性,并推动未来在更复杂现实场景下的算法发展。 Abstract: In the single-positive multi-label (SPML) setting, each image in a dataset is labeled with the presence of a single class, while the true presence of other classes remains unknown. The challenge is to narrow the performance gap between this partially-labeled setting and fully-supervised learning, which often requires a significant annotation budget. Prior SPML methods were developed and benchmarked on synthetic datasets created by randomly sampling single positive labels from fully-annotated datasets like Pascal VOC, COCO, NUS-WIDE, and CUB200. However, this synthetic approach does not reflect real-world scenarios and fails to capture the fine-grained complexities that can lead to difficult misclassifications. In this work, we introduce the L48 dataset, a fine-grained, real-world multi-label dataset derived from recordings of bird sounds. L48 provides a natural SPML setting with single-positive annotations on a challenging, fine-grained domain, as well as two extended settings in which domain priors give access to additional negative labels. We benchmark existing SPML methods on L48 and observe significant performance differences compared to synthetic datasets and analyze method weaknesses, underscoring the need for more realistic and difficult benchmarks.

[121] BeetleFlow: An Integrative Deep Learning Pipeline for Beetle Image Processing

Fangxun Liu,S M Rayeed,Samuel Stevens,Alyson East,Cheng Hsuan Chiang,Colin Lee,Daniel Yi,Junke Yang,Tejas Naik,Ziyi Wang,Connor Kilrain,Elijah H Buckwalter,Jiacheng Hou,Saul Ibaven Bueno,Shuheng Wang,Xinyue Ma,Yifan Liu,Zhiyuan Tao,Ziheng Zhang,Eric Sokol,Michael Belitz,Sydne Record,Charles V. Stewart,Wei-Lun Chao

Main category: cs.CV

TL;DR: 提出了一种用于处理大规模甲虫图像数据的三阶段自动化管道,结合了基于Transformer的检测与分割模型,显著提高了昆虫学研究的效率。

Details Motivation: 生物学家在生态学研究中需要处理大量甲虫图像,手动整理效率低下,亟需自动化方法来高效处理成千上万托盘中的甲虫图像数据。 Method: 设计了一个三阶段管道:首先使用基于Transformer的开放词汇目标检测器和视觉-语言模型迭代检测托盘上的所有甲虫;然后对每个甲虫图像进行分类和裁剪;最后利用手动标注的670张图像微调两种Transformer-based分割模型,实现精细的形态学分割。 Result: 该管道成功实现了甲虫的自动检测、裁剪与高精度细粒度形态分割,整合多种深度学习方法,专门针对甲虫图像处理优化。 Conclusion: 所提出的自动化管道能有效提升大规模甲虫图像数据的处理效率,有助于加速后续生物学研究。 Abstract: In entomology and ecology research, biologists often need to collect a large number of insects, among which beetles are the most common species. A common practice for biologists to organize beetles is to place them on trays and take a picture of each tray. Given the images of thousands of such trays, it is important to have an automated pipeline to process the large-scale data for further research. Therefore, we develop a 3-stage pipeline to detect all the beetles on each tray, sort and crop the image of each beetle, and do morphological segmentation on the cropped beetles. For detection, we design an iterative process utilizing a transformer-based open-vocabulary object detector and a vision-language model. For segmentation, we manually labeled 670 beetle images and fine-tuned two variants of a transformer-based segmentation model to achieve fine-grained segmentation of beetles with relatively high accuracy. The pipeline integrates multiple deep learning methods and is specialized for beetle image processing, which can greatly improve the efficiency to process large-scale beetle data and accelerate biological research.

[122] MambaNetLK: Enhancing Colonoscopy Point Cloud Registration with Mamba

Linzhe Jiang,Jiayuan Huang,Sophia Bano,Matthew J. Clarkson,Zhehua Mao,Mobarak I. Hoque

Main category: cs.CV

TL;DR: 提出MambaNetLK,一种用于内窥镜导航的无对应关系3D配准框架,并构建临床基准数据集C3VD-Raycasting-10k,在结肠镜引导中显著提升配准精度与鲁棒性。

Details Motivation: 解决生物组织特征退化和术前-术中域偏移导致的点云配准不稳定问题,提升图像引导结肠镜手术中的定位精度与安全性。 Method: 基于PointNetLK架构,引入Mamba状态空间模型(SSM)作为跨模态特征提取器,结合光线投射生成大规模对齐点云数据集C3VD-Raycasting-10k,并采用Lucas-Kanade算法迭代优化配准。 Result: 在C3VD-Raycasting-10k上优于现有方法,旋转误差中位数降低56.04%,平移误差RMSE降低26.19%;在ModelNet40上表现出强泛化能力,对初始位姿扰动更具鲁棒性。 Conclusion: MambaNetLK结合全局感知的SSM特征提取器与大规模临床数据集,为微创手术中的3D配准提供了更准确、可靠的解决方案。 Abstract: Accurate 3D point cloud registration underpins reliable image-guided colonoscopy, directly affecting lesion localization, margin assessment, and navigation safety. However, biological tissue exhibits repetitive textures and locally homogeneous geometry that cause feature degeneracy, while substantial domain shifts between pre-operative anatomy and intra-operative observations further degrade alignment stability. To address these clinically critical challenges, we introduce a novel 3D registration method tailored for endoscopic navigation and a high-quality, clinically grounded dataset to support rigorous and reproducible benchmarking. We introduce C3VD-Raycasting-10k, a large-scale benchmark dataset with 10,014 geometrically aligned point cloud pairs derived from clinical CT data. We propose MambaNetLK, a novel correspondence-free registration framework, which enhances the PointNetLK architecture by integrating a Mamba State Space Model (SSM) as a cross-modal feature extractor. As a result, the proposed framework efficiently captures long-range dependencies with linear-time complexity. The alignment is achieved iteratively using the Lucas-Kanade algorithm. On the clinical dataset, C3VD-Raycasting-10k, MambaNetLK achieves the best performance compared with the state-of-the-art methods, reducing median rotation error by 56.04% and RMSE translation error by 26.19% over the second-best method. The model also demonstrates strong generalization on ModelNet40 and superior robustness to initial pose perturbations. MambaNetLK provides a robust foundation for 3D registration in surgical navigation. The combination of a globally expressive SSM-based feature extractor and a large-scale clinical dataset enables more accurate and reliable guidance systems in minimally invasive procedures like colonoscopy.

[123] Spot The Ball: A Benchmark for Visual Social Inference

Neha Balamurugan,Sarah Wu,Adam Chun,Gabe Gaw,Cristobal Eyzaguirre,Tobias Gerstenberg

Main category: cs.CV

TL;DR: 本文提出了一个名为“Spot The Ball”的新基准,用于评估视觉-语言模型在体育图像中通过社会线索(如凝视和姿态)推断被移除球的位置的能力,发现人类的表现显著优于现有模型,揭示了模型在视觉社会推理上的不足。

Details Motivation: 为了提升AI系统在社交场景中的推理能力,需要构建能够理解人类细微行为线索(如目光、姿态)的模型。使用体育图像中的去球定位任务作为测试领域,可以有效衡量视觉-语言模型的社会推理水平。 Method: 构建了一个包含足球、篮球和排球图像的数据集,其中球已被移除,要求模型根据场景中的社会线索推断其位置;设计了三种提示策略来评估四个最先进的视觉-语言模型(Gemini、GPT、LLaMA、Qwen),并与人类表现进行对比。 Result: 人类在该任务上的准确率为20%-34%,始终比最佳模型高出两到三倍(模型准确率≤17%);分析表明,模型依赖于图像中心或附近球员等表面空间启发式方法,而人类则利用凝视方向和身体姿态等社会线索进行推理。 Conclusion: 当前视觉-语言模型在视觉社会推理方面与人类仍有显著差距,未来需设计能显式编码结构化行为线索的模型架构,以实现更鲁棒、类人化的推理能力。 Abstract: Humans excel at visual social inference, the ability to infer hidden elements of a scene from subtle behavioral cues such as other people's gaze, pose, and orientation. This ability drives everyday social reasoning in humans and is critical for developing more human-like AI agents. We introduce Spot The Ball, a challenging benchmark for evaluating visual social inference in vision-language models (VLMs) using sports as a test domain. The task is to localize a removed sports ball from soccer, basketball, and volleyball images. We present a curated evaluation set with human baselines and a scalable pipeline for generating additional test items. We evaluate four state-of-the-art VLMs (Gemini, GPT, LLaMA, Qwen) using three prompting strategies, finding that humans are consistently two to three times more accurate (20-34%) than models ($\leq$ 17%) across all sports. Our analyses show that models rely on superficial spatial heuristics--such as guessing near the image center or nearby players--while humans leverage social cues like gaze direction and body pose. These findings reveal a persistent human-model gap in visual social reasoning and underscore the need for architectures that explicitly encode structured behavioral cues to achieve robust, human-like inference.

[124] FedReplay: A Feature Replay Assisted Federated Transfer Learning Framework for Efficient and Privacy-Preserving Smart Agriculture

Long Li,Jiajia Li,Dong Chen,Lina Pu,Haibo Yao,Yanbo Huang

Main category: cs.CV

TL;DR: 提出一种结合冻结的CLIP视觉Transformer与轻量级分类器的联邦学习框架,用于解决农业智能中的隐私、非IID数据和通信开销问题。

Details Motivation: 传统集中训练存在隐私风险,标准联邦学习在非独立同分布数据下性能差且通信成本高,需更高效、隐私保护的农业图像分类方案。 Method: 采用预训练的冻结CLIP ViT提取特征,仅在客户端更新轻量级Transformer分类器;共享1%的CLIP特征表示以缓解非IID问题,同时保证隐私。 Result: 在农业分类任务中达到86.6%的准确率,性能超过基线联邦学习方法4倍以上。 Conclusion: 该方法有效结合视觉-语言模型与联邦学习,显著提升准确率与通信效率,适用于隐私敏感、数据异构的智慧农业场景。 Abstract: Accurate classification plays a pivotal role in smart agriculture, enabling applications such as crop monitoring, fruit recognition, and pest detection. However, conventional centralized training often requires large-scale data collection, which raises privacy concerns, while standard federated learning struggles with non-independent and identically distributed (non-IID) data and incurs high communication costs. To address these challenges, we propose a federated learning framework that integrates a frozen Contrastive Language-Image Pre-training (CLIP) vision transformer (ViT) with a lightweight transformer classifier. By leveraging the strong feature extraction capability of the pre-trained CLIP ViT, the framework avoids training large-scale models from scratch and restricts federated updates to a compact classifier, thereby reducing transmission overhead significantly. Furthermore, to mitigate performance degradation caused by non-IID data distribution, a small subset (1%) of CLIP-extracted feature representations from all classes is shared across clients. These shared features are non-reversible to raw images, ensuring privacy preservation while aligning class representation across participants. Experimental results on agricultural classification tasks show that the proposed method achieve 86.6% accuracy, which is more than 4 times higher compared to baseline federated learning approaches. This demonstrates the effectiveness and efficiency of combining vision-language model features with federated learning for privacy-preserving and scalable agricultural intelligence.

[125] Multi-View Consistent Human Image Customization via In-Context Learning

Hengjia Li,Jianjin Xu,Keli Cheng,Lei Wang,Ning Bi,Boxi Wu,Fernando De la Torre,Deng Cai

Main category: cs.CV

TL;DR: 提出PersonalView方法,通过少量样本实现个性化生成模型的多视角生成能力。

Details Motivation: 现有个性化生成模型难以控制生成图像的视角,也无法生成一致的多视角人物图像。 Method: 设计一种轻量级适配方法,包括利用预训练扩散Transformer的上下文学习能力的条件架构,以及保持原始生成能力的语义对应对齐损失。 Result: 在仅使用100个训练样本的情况下,PersonalView在多视角一致性、文本对齐、身份相似性和视觉质量方面显著优于使用大规模多视角数据训练的基线方法。 Conclusion: PersonalView能有效赋予现有生成模型以极少量样本进行多视角个性化生成的能力,具有高效性和实用性。 Abstract: Recent advances in personalized generative models demonstrate impressive results in creating identity-consistent images of the same person under diverse settings. Yet, we note that most methods cannot control the viewpoint of the generated image, nor generate consistent multiple views of the person. To address this problem, we propose a lightweight adaptation method, PersonalView, capable of enabling an existing model to acquire multi-view generation capability with as few as 100 training samples. PersonalView consists of two key components: First, we design a conditioning architecture to take advantage of the in-context learning ability of the pre-trained diffusion transformer. Second, we preserve the original generative ability of the pretrained model with a new Semantic Correspondence Alignment Loss. We evaluate the multi-view consistency, text alignment, identity similarity, and visual quality of PersonalView and compare it to recent baselines with potential capability of multi-view customization. PersonalView significantly outperforms baselines trained on a large corpus of multi-view data with only 100 training samples.

[126] Towards Automated Petrography

Isai Daniel Chacón,Paola Ruiz Puentes,Jillian Pearse,Pablo Arbeláez

Main category: cs.CV

TL;DR: 本文提出了一个名为LITHOS的大规模岩石薄片偏光图像数据集,用于推动自动岩石学分析的发展。该数据集包含超过21万张高分辨率图像和10万多个专家标注的矿物颗粒,涵盖25类矿物。作者还提出了一种双编码器Transformer模型,融合两种偏光模态,显著提升了矿物分类性能。

Details Motivation: 传统岩石学分析依赖专家通过偏光显微镜进行耗时且劳动密集的观察,难以规模化,因此需要自动化方法来提升效率和可扩展性。 Method: 构建了大规模公开数据集LITHOS,包含多模态偏光图像和精细标注的矿物颗粒信息;设计了一种双编码器Transformer架构,融合正交偏光图像进行矿物分类。 Result: 实验表明,所提出的双编码器模型在矿物分类任务上优于单模态模型,验证了多模态偏光信息融合的有效性;LITHOS成为当前最大且最多样化的自动岩石学研究基准。 Conclusion: LITHOS为自动岩石学提供了重要基础资源,所提出的模型展示了多模态深度学习在矿物识别中的潜力,未来有望推动地质分析的自动化与标准化。 Abstract: Petrography is a branch of geology that analyzes the mineralogical composition of rocks from microscopical thin section samples. It is essential for understanding rock properties across geology, archaeology, engineering, mineral exploration, and the oil industry. However, petrography is a labor-intensive task requiring experts to conduct detailed visual examinations of thin section samples through optical polarization microscopes, thus hampering scalability and highlighting the need for automated techniques. To address this challenge, we introduce the Large-scale Imaging and Thin section Optical-polarization Set (LITHOS), the largest and most diverse publicly available experimental framework for automated petrography. LITHOS includes 211,604 high-resolution RGB patches of polarized light and 105,802 expert-annotated grains across 25 mineral categories. Each annotation consists of the mineral class, spatial coordinates, and expert-defined major and minor axes represented as intersecting vector paths, capturing grain geometry and orientation. We evaluate multiple deep learning techniques for mineral classification in LITHOS and propose a dual-encoder transformer architecture that integrates both polarization modalities as a strong baseline for future reference. Our method consistently outperforms single-polarization models, demonstrating the value of polarization synergy in mineral classification. We have made the LITHOS Benchmark publicly available, comprising our dataset, code, and pretrained models, to foster reproducibility and further research in automated petrographic analysis.

[127] Beyond ImageNet: Understanding Cross-Dataset Robustness of Lightweight Vision Models

Weidong Zhang,Pak Lun Kevin Ding,Huan Liu

Main category: cs.CV

TL;DR: 本文首次系统评估了11种轻量级视觉模型在7个不同数据集上的跨域泛化能力,提出了一种新的跨数据集性能度量指标xScore,并发现ImageNet准确率不能可靠预测其他领域(如细粒度或医学图像)的性能;某些卷积结构设计更有利于泛化,而Transformer模块在参数开销更高的情况下提升有限。

Details Motivation: 现有轻量级视觉模型主要在ImageNet上评估,但其在其他领域的泛化能力未知,缺乏对跨数据集鲁棒性的系统分析和量化方法。 Method: 在固定训练条件下对11种轻量级模型在7个多样化数据集上进行评估,提出Cross-Dataset Score (xScore)作为衡量跨域一致性和鲁棒性的统一指标,并分析不同架构组件对泛化性能的影响。 Result: 1) ImageNet准确率无法可靠预测在细粒度或医疗等其他领域的表现;2) xScore可有效预测移动端模型的跨域性能,且仅需四个数据集即可估计;3) 各向同性卷积、高空间分辨率和通道注意力有助于泛化,而Transformer模块增益有限但参数成本更高。 Conclusion: 研究提供了超越ImageNet评估轻量级模型的可复现框架,揭示了面向移动端的通用架构设计原则,为开发跨域鲁棒的轻量模型提供了指导。 Abstract: Lightweight vision classification models such as MobileNet, ShuffleNet, and EfficientNet are increasingly deployed in mobile and embedded systems, yet their performance has been predominantly benchmarked on ImageNet. This raises critical questions: Do models that excel on ImageNet also generalize across other domains? How can cross-dataset robustness be systematically quantified? And which architectural elements consistently drive generalization under tight resource constraints? Here, we present the first systematic evaluation of 11 lightweight vision models (2.5M parameters), trained under a fixed 100-epoch schedule across 7 diverse datasets. We introduce the Cross-Dataset Score (xScore), a unified metric that quantifies the consistency and robustness of model performance across diverse visual domains. Our results show that (1) ImageNet accuracy does not reliably predict performance on fine-grained or medical datasets, (2) xScore provides a scalable predictor of mobile model performance that can be estimated from just four datasets, and (3) certain architectural components--such as isotropic convolutions with higher spatial resolution and channel-wise attention--promote broader generalization, while Transformer-based blocks yield little additional benefit, despite incurring higher parameter overhead. This study provides a reproducible framework for evaluating lightweight vision models beyond ImageNet, highlights key design principles for mobile-friendly architectures, and guides the development of future models that generalize robustly across diverse application domains.

[128] A DeepONet joint Neural Tangent Kernel Hybrid Framework for Physics-Informed Inverse Source Problems and Robust Image Reconstruction

Yuhao Fang,Zijian Wang,Yao Lu,Ye Zhang,Chun Li

Main category: cs.CV

TL;DR: 提出一种结合DeepONet与神经正切核(NTK)的混合方法,用于求解由Navier-Stokes方程主导的源定位和图像重建等复杂逆问题。

Details Motivation: 解决非线性、稀疏性和噪声数据带来的挑战,提升物理场逆问题求解的准确性与鲁棒性。 Method: 将DeepONet与NTK结合,并在损失函数中引入物理信息约束和任务特定的正则化项。 Result: 在多种合成与真实数据集上验证了该方法的鲁棒性、可扩展性和高精度。 Conclusion: 该框架能够生成物理一致且精确的解,在计算物理和成像科学中具有广泛应用前景。 Abstract: This work presents a novel hybrid approach that integrates Deep Operator Networks (DeepONet) with the Neural Tangent Kernel (NTK) to solve complex inverse problem. The method effectively addresses tasks such as source localization governed by the Navier-Stokes equations and image reconstruction, overcoming challenges related to nonlinearity, sparsity, and noisy data. By incorporating physics-informed constraints and task-specific regularization into the loss function, the framework ensures solutions that are both physically consistent and accurate. Validation on diverse synthetic and real datasets demonstrates its robustness, scalability, and precision, showcasing its broad potential applications in computational physics and imaging sciences.

[129] Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities

Xihang Qiu,Jiarong Cheng,Yuhao Fang,Wanpeng Zhang,Yao Lu,Ye Zhang,Chun Li

Main category: cs.CV

TL;DR: 本文提出了一种新的联邦对话引导和语义一致扩散框架(FedDISC),用于解决多模态情感识别中模态缺失的问题,通过联邦学习和扩散模型实现跨设备的模态恢复,并在多个数据集上实现了优越的情感分类性能。

Details Motivation: 现实场景中多模态信号的不可预测缺失严重影响现有方法的性能,且传统模态恢复方法在极端数据分布下易产生语义失真。 Method: 提出FedDISC框架,结合联邦学习与扩散模型,利用对话图网络捕捉会话依赖,通过语义条件网络保证恢复模态的语义一致性,并采用交替冻结聚合策略协同优化恢复与分类模块。 Result: 在IEMOCAP、CMUMOSI和CMUMOSEI数据集上实验表明,FedDISC在多种模态缺失模式下均优于现有方法,显著提升情感分类性能。 Conclusion: FedDISC有效解决了模态缺失下的多模态情感识别问题,通过联邦学习和语义一致性设计实现了鲁棒且可扩展的模态恢复。 Abstract: Multimodal Emotion Recognition in Conversations (MERC) enhances emotional understanding through the fusion of multimodal signals. However, unpredictable modality absence in real-world scenarios significantly degrades the performance of existing methods. Conventional missing-modality recovery approaches, which depend on training with complete multimodal data, often suffer from semantic distortion under extreme data distributions, such as fixed-modality absence. To address this, we propose the Federated Dialogue-guided and Semantic-Consistent Diffusion (FedDISC) framework, pioneering the integration of federated learning into missing-modality recovery. By federated aggregation of modality-specific diffusion models trained on clients and broadcasting them to clients missing corresponding modalities, FedDISC overcomes single-client reliance on modality completeness. Additionally, the DISC-Diffusion module ensures consistency in context, speaker identity, and semantics between recovered and available modalities, using a Dialogue Graph Network to capture conversational dependencies and a Semantic Conditioning Network to enforce semantic alignment. We further introduce a novel Alternating Frozen Aggregation strategy, which cyclically freezes recovery and classifier modules to facilitate collaborative optimization. Extensive experiments on the IEMOCAP, CMUMOSI, and CMUMOSEI datasets demonstrate that FedDISC achieves superior emotion classification performance across diverse missing modality patterns, outperforming existing approaches.

[130] OSMGen: Highly Controllable Satellite Image Synthesis using OpenStreetMap Data

Amir Ziashahabi,Narges Ghasemi,Sajjad Shahabi,John Krumm,Salman Avestimehr,Cyrus Shahabi

Main category: cs.CV

TL;DR: OSMGen是一个从OpenStreetMap原始数据生成逼真卫星图像的框架,支持生成前后对比图像对,可用于城市监测、训练数据生成和规划预览。

Details Motivation: 由于特定城市特征及其变化的标注数据稀缺,自动化城市监测面临挑战,因此需要一种能利用开放地图数据生成高质量地理图像的方法。 Method: 提出OSMGen框架,直接从OSM的JSON数据(包含矢量几何、语义标签、位置和时间)生成卫星图像,并能通过用户编辑OSM数据生成对应的视觉变化图像对。 Result: 实现了高保真的卫星图像生成,能够生成一致的前后图像对,有效支持训练数据扩充、类别平衡和城市规划可视化。 Conclusion: OSMGen为城市动态监测提供了新工具,推动了从影像到结构化地图更新的闭环系统发展。 Abstract: Accurate and up-to-date geospatial data are essential for urban planning, infrastructure monitoring, and environmental management. Yet, automating urban monitoring remains difficult because curated datasets of specific urban features and their changes are scarce. We introduce OSMGen, a generative framework that creates realistic satellite imagery directly from raw OpenStreetMap (OSM) data. Unlike prior work that relies on raster tiles, OSMGen uses the full richness of OSM JSON, including vector geometries, semantic tags, location, and time, giving fine-grained control over how scenes are generated. A central feature of the framework is the ability to produce consistent before-after image pairs: user edits to OSM inputs translate into targeted visual changes, while the rest of the scene is preserved. This makes it possible to generate training data that addresses scarcity and class imbalance, and to give planners a simple way to preview proposed interventions by editing map data. More broadly, OSMGen produces paired (JSON, image) data for both static and changed states, paving the way toward a closed-loop system where satellite imagery can automatically drive structured OSM updates. Source code is available at https://github.com/amir-zsh/OSMGen.

[131] Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach

Mohd Ruhul Ameen,Akif Islam

Main category: cs.CV

TL;DR: 提出了一种基于扩散模型重建动态的取证框架,利用多强度噪声下的重建度量变化来区分真实与合成图像,在4000张图像上达到0.993 AUROC,具有强鲁棒性和泛化能力。

Details Motivation: 传统深伪检测方法在面对Stable Diffusion和DALL-E等现代文本生成图像系统时失效,因其生成结果无明显频率或像素级伪影,需开发新型检测机制。 Method: 提出“扩散回弹”(diffusion snap-back)方法,通过分析不同噪声强度下LPIPS、SSIM和PSNR等重建指标的变化,提取流形特征以识别AI生成图像。 Result: 在4000张图像的数据集上交叉验证取得0.993 AUROC,对压缩和噪声等常见失真具有鲁棒性,并表现出良好的泛化性和可解释性。 Conclusion: 该方法为可扩展、模型无关的合成媒体取证提供了可行基础,即使使用有限数据和单一扩散模型也能实现高性能检测。 Abstract: The rapid rise of generative diffusion models has made distinguishing authentic visual content from synthetic imagery increasingly challenging. Traditional deepfake detection methods, which rely on frequency or pixel-level artifacts, fail against modern text-to-image systems such as Stable Diffusion and DALL-E that produce photorealistic and artifact-free results. This paper introduces a diffusion-based forensic framework that leverages multi-strength image reconstruction dynamics, termed diffusion snap-back, to identify AI-generated images. By analysing how reconstruction metrics (LPIPS, SSIM, and PSNR) evolve across varying noise strengths, we extract interpretable manifold-based features that differentiate real and synthetic images. Evaluated on a balanced dataset of 4,000 images, our approach achieves 0.993 AUROC under cross-validation and remains robust to common distortions such as compression and noise. Despite using limited data and a single diffusion backbone (Stable Diffusion v1.5), the proposed method demonstrates strong generalization and interpretability, offering a foundation for scalable, model-agnostic synthetic media forensics.

[132] Transfer Learning for Onboard Cloud Segmentation in Thermal Earth Observation: From Landsat to a CubeSat Constellation

Niklas Wölki,Lukas Kondmann,Christian Mollière,Martin Langer,Julia Gottfriedsen,Martin Werner

Main category: cs.CV

TL;DR: 本研究针对CubeSat任务中热红外地球观测的云分割问题,提出基于迁移学习和轻量级UNet架构的解决方案,利用公开数据集预训练并在少量任务特定样本上微调,实现了高效的在轨云掩膜推理。

Details Motivation: CubeSat受限于硬件资源和光谱信息,传统云掩膜方法难以适用,且缺乏足够的标注数据,亟需一种高效、准确的热红外云分割方法。 Method: 采用UNet结合MobileNet轻量编码器,通过在Landsat-7云覆盖数据集上预训练,并在FOREST-2任务的小样本上进行联合微调,最后将模型转换为TensorRT引擎以加速推理。 Result: 在FOREST-2数据上,macro F1从0.850提升至0.877,且在NVIDIA Jetson Nano上实现5秒内全图推理。 Conclusion: 结合公开数据集与轻量模型的迁移学习策略,可有效支持资源受限的地球观测任务实现实时、准确的热红外云掩膜。 Abstract: Onboard cloud segmentation is a critical yet underexplored task in thermal Earth observation (EO), particularly for CubeSat missions constrained by limited hardware and spectral information. CubeSats often rely on a single thermal band and lack sufficient labeled data, making conventional cloud masking techniques infeasible. This work addresses these challenges by applying transfer learning to thermal cloud segmentation for the FOREST-2 CubeSat, using a UNet with a lightweight MobileNet encoder. We pretrain the model on the public Landsat-7 Cloud Cover Assessment Dataset and fine-tune it with a small set of mission-specific samples in a joint-training setup, improving the macro F1 from 0.850 to 0.877 over FOREST-2-only baselines. We convert the model to a TensorRT engine and demonstrate full-image inference in under 5 seconds on an NVIDIA Jetson Nano. These results show that leveraging public datasets and lightweight architectures can enable accurate, efficient thermal-only cloud masking on-orbit, supporting real-time decision-making in data-limited EO missions.

[133] Oitijjo-3D: Generative AI Framework for Rapid 3D Heritage Reconstruction from Street View Imagery

Momen Khandoker Ope,Akif Islam,Mohd Ruhul Ameen,Abu Saleh Musa Miah,Md Rashedul Islam,Jungpil Shin

Main category: cs.CV

TL;DR: 本文提出了一种名为Oitijjo-3D的免费生成式AI框架,利用Google街景图像通过多模态视觉推理和神经图像到3D生成技术,实现对孟加拉国文化遗产的快速、低成本三维重建。

Details Motivation: 孟加拉国在文化遗产修复方面面临资源有限和技术专家稀缺的双重挑战,传统3D数字化方法因成本高、操作复杂而在发展中国家难以实施。 Method: 采用公开可用的Google街景图像,结合Gemini 2.5 Flash Image进行结构-纹理合成,再通过Hexagen实现几何恢复,形成两阶段的3D重建流程。 Result: 在Ahsan Manzil、Choto Sona清真寺和Paharpur等遗址上的实验表明,该方法能在几秒内生成照片级真实感且度量一致的3D模型,显著优于传统的运动恢复结构(SfM)流程。 Conclusion: Oitijjo-3D通过将开放图像转化为数字遗产,降低了经济与技术门槛,使文化遗产保护成为社区驱动、AI辅助的文化延续行动,为资源有限国家提供了可行方案。 Abstract: Cultural heritage restoration in Bangladesh faces a dual challenge of limited resources and scarce technical expertise. Traditional 3D digitization methods, such as photogrammetry or LiDAR scanning, require expensive hardware, expert operators, and extensive on-site access, which are often infeasible in developing contexts. As a result, many of Bangladesh's architectural treasures, from the Paharpur Buddhist Monastery to Ahsan Manzil, remain vulnerable to decay and inaccessible in digital form. This paper introduces Oitijjo-3D, a cost-free generative AI framework that democratizes 3D cultural preservation. By using publicly available Google Street View imagery, Oitijjo-3D reconstructs faithful 3D models of heritage structures through a two-stage pipeline - multimodal visual reasoning with Gemini 2.5 Flash Image for structure-texture synthesis, and neural image-to-3D generation through Hexagen for geometry recovery. The system produces photorealistic, metrically coherent reconstructions in seconds, achieving significant speedups compared to conventional Structure-from-Motion pipelines, without requiring any specialized hardware or expert supervision. Experiments on landmarks such as Ahsan Manzil, Choto Sona Mosque, and Paharpur demonstrate that Oitijjo-3D preserves both visual and structural fidelity while drastically lowering economic and technical barriers. By turning open imagery into digital heritage, this work reframes preservation as a community-driven, AI-assisted act of cultural continuity for resource-limited nations.

[134] Who Can We Trust? Scope-Aware Video Moment Retrieval with Multi-Agent Conflict

Chaochen Wu,Guan Luo,Meiyun Zuo,Zhitao Fan

Main category: cs.CV

TL;DR: 提出了一种基于强化学习的视频时刻检索模型,并引入多智能体系统框架,利用证据学习解决不同模型间的定位冲突,有效提升检索性能,同时可识别查询在视频中无对应时刻的情况。

Details Motivation: 现有方法未考虑不同模型定位结果之间的冲突,导致无法有效融合多模型的优势,且难以处理查询在视频中无对应时刻的情况。 Method: 设计了一种基于强化学习的视频时刻检索模型,能够单次扫描完整视频并输出位置边界及其定位证据;提出多智能体系统框架,采用证据学习整合各智能体的定位结果并解决冲突。 Result: 在基准数据集上的实验表明,所提方法优于当前最先进的方法;验证了建模多智能体系统中的竞争与冲突有助于提升强化学习在时刻检索任务中的表现。 Conclusion: 通过引入证据学习的多智能体框架,不仅能有效融合多个模型的定位结果、解决冲突,还能无需额外训练即可识别出无对应时刻的查询,提升了模型在真实场景中的实用性。 Abstract: Video moment retrieval uses a text query to locate a moment from a given untrimmed video reference. Locating corresponding video moments with text queries helps people interact with videos efficiently. Current solutions for this task have not considered conflict within location results from different models, so various models cannot integrate correctly to produce better results. This study introduces a reinforcement learning-based video moment retrieval model that can scan the whole video once to find the moment's boundary while producing its locational evidence. Moreover, we proposed a multi-agent system framework that can use evidential learning to resolve conflicts between agents' localization output. As a side product of observing and dealing with conflicts between agents, we can decide whether a query has no corresponding moment in a video (out-of-scope) without additional training, which is suitable for real-world applications. Extensive experiments on benchmark datasets show the effectiveness of our proposed methods compared with state-of-the-art approaches. Furthermore, the results of our study reveal that modeling competition and conflict of the multi-agent system is an effective way to improve RL performance in moment retrieval and show the new role of evidential learning in the multi-agent framework.

[135] VisionCAD: An Integration-Free Radiology Copilot Framework

Jiaming Li,Junlei Wu,Sheng Wang,Honglin Xiong,Jiangdong Cai,Zihao Zhao,Yitao Zhu,Yuan Yin,Dinggang Shen,Qian Wang

Main category: cs.CV

TL;DR: 提出VisionCAD,一种基于视觉的放射学辅助框架,通过摄像头捕捉屏幕上的医学图像,无需改变现有医院IT基础设施即可实现AI辅助诊断。

Details Motivation: 解决计算机辅助诊断(CAD)系统因难以集成到现有医院IT基础设施而限制其广泛应用的问题。 Method: 开发了一个自动化流程,通过摄像头捕获屏幕上的医学图像,进行检测、恢复和分析,将视觉数据转换为适合自动分析和报告生成的诊断级图像。 Result: 在多种医学影像数据集上验证了VisionCAD的有效性,其诊断性能与传统CAD系统相当,分类任务中F1分数下降通常小于2%,自动生成报告的自然语言指标与原始图像相比差异在1%以内。 Conclusion: VisionCAD仅需摄像头和标准计算资源,提供了一种无需修改现有基础设施即可在不同临床环境中部署AI诊断能力的便捷方法。 Abstract: Widespread clinical deployment of computer-aided diagnosis (CAD) systems is hindered by the challenge of integrating with existing hospital IT infrastructure. Here, we introduce VisionCAD, a vision-based radiological assistance framework that circumvents this barrier by capturing medical images directly from displays using a camera system. The framework operates through an automated pipeline that detects, restores, and analyzes on-screen medical images, transforming camera-captured visual data into diagnostic-quality images suitable for automated analysis and report generation. We validated VisionCAD across diverse medical imaging datasets, demonstrating that our modular architecture can flexibly utilize state-of-the-art diagnostic models for specific tasks. The system achieves diagnostic performance comparable to conventional CAD systems operating on original digital images, with an F1-score degradation typically less than 2\% across classification tasks, while natural language generation metrics for automated reports remain within 1\% of those derived from original images. By requiring only a camera device and standard computing resources, VisionCAD offers an accessible approach for AI-assisted diagnosis, enabling the deployment of diagnostic capabilities in diverse clinical settings without modifications to existing infrastructure.

[136] Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

Fan Zhang,Haoxuan Li,Shengju Qian,Xin Wang,Zheng Lian,Hao Wu,Zhihong Zhu,Yuan Gao,Qiankun Li,Yefeng Zheng,Zhouchen Lin,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出了FERBench,一个系统性的基准测试,用于评估20种最先进多模态大语言模型在面部表情识别(FER)任务上的表现,并引入了UniFER-7B模型,通过后训练策略提升模型的推理能力和可解释性。

Details Motivation: 尽管多模态大语言模型在多种任务中表现出色,但其在面部表情识别任务中的表现尚未充分探索,缺乏统一、可解释的FER评估基准和专用模型。 Method: 将传统FER数据集转换为视觉问答(VQA)格式,构建FERBench基准;提出两种后训练策略,基于两个大规模高质量数据集UniFER-CoT-230K(用于冷启动初始化)和UniFER-RLVR-360K(用于可验证奖励的强化学习)开发UniFER-7B模型。 Result: 实验表明,现有MLLMs在分类性能上表现良好,但在推理与可解释性方面存在不足;UniFER-7B在多个开源和闭源通用MLLM(如Gemini-2.5-Pro和Qwen2.5-VL-72B)中表现出优越性能。 Conclusion: 通过结构化的后训练策略,可以有效提升MLLM在面部表情识别任务中的推理能力与可解释性,UniFER-7B为构建统一、可解释的FER基础模型提供了新方向。 Abstract: Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant limitations in reasoning and interpretability. To this end, we introduce post-training strategies aimed at enhancing the facial expression reasoning capabilities of MLLMs. Specifically, we curate two high-quality and large-scale datasets: UniFER-CoT-230K for cold-start initialization and UniFER-RLVR-360K for reinforcement learning with verifiable rewards (RLVR), respectively. Building upon them, we develop a unified and interpretable FER foundation model termed UniFER-7B, which outperforms many open-sourced and closed-source generalist MLLMs (e.g., Gemini-2.5-Pro and Qwen2.5-VL-72B).

[137] VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

Xuanle Zhao,Deyang Jiang,Zhixiong Zeng,Lei Chen,Haibo Qiu,Jing Huang,Yufeng Zhong,Liming Zheng,Yilin Cao,Lin Ma

Main category: cs.CV

TL;DR: 本文提出了VinciCoder,一个通过两阶段训练框架解决多模态代码生成中泛化能力不足的统一模型,结合大规模监督微调和基于粗到细奖励机制的视觉强化学习,在多个基准上达到SOTA性能。

Details Motivation: 现有视觉语言模型在多模态代码生成任务中依赖单任务训练,缺乏泛化能力,难以实现真正的视觉-代码智能。 Method: 提出两阶段训练框架:首先构建包含160万图像-代码对的大规模监督微调语料库;然后引入一种粗到细奖励机制的视觉强化学习(ViRL)策略,通过计算局部和全局图像块的视觉相似性来提升生成代码的视觉保真度。 Result: 在多个多模态代码生成基准上实验表明,VinciCoder实现了最先进的性能,验证了所提ViRL策略的有效性。 Conclusion: VinciCoder通过统一的框架和创新的ViRL训练策略,显著提升了多模态代码生成的泛化能力和生成质量,推动了视觉-代码智能的发展。 Abstract: Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like Chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized \textbf{VI}sio\textbf{N} \textbf{C}ode \textbf{I}ntelligence. In this work, we introduce \textbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on various multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, underscoring the effectiveness of our coarse-to-fine ViRL strategy. The code and model will be available at https://github.com/DocTron-hub/VinciCoder.

[138] CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks

Long Li,Shuichen Ji,Ziyang Luo,Nian Liu,Dingwen Zhang,Junwei Han

Main category: cs.CV

TL;DR: 本文提出了一个统一框架,通过将三种异构的显著性任务(SOD、CoSOD、SIS)建模为视觉-语言模型中的思维链(CoT)推理过程,实现了联合处理。该框架采用两阶段CoT训练:监督微调和强化学习,并提出了一种轻量级的单样本算法CGPO来提升CoT质量,解决了现有方法在置信度感知、信号稀释和计算开销方面的局限。实验表明,该方法在使用更少训练数据的情况下,在多个任务上达到或超过了专用SOTA方法和闭源VLM的表现。

Details Motivation: 由于不同显著性任务在操作上的异质性,现有方法通常需为每个任务单独设计模型,缺乏通用性和效率。因此,亟需一个能统一处理多种显著性任务的框架以提升模型泛化能力与实用性。 Method: 将SOD、CoSOD和SIS任务统一建模为视觉-语言模型中的Chain-of-Thought(CoT)推理过程;采用两阶段训练:监督微调(SFT)和强化学习(RL);提出Confidence-Guided Policy Optimization(CGPO),利用奖励与模型置信度之间的差异作为每样本优势信号;引入“输出到推理”策略构建高保真SFT数据,确保与真实掩码的逻辑一致性。 Result: 在多个显著性任务上,该模型表现达到或优于专用SOTA方法和强闭源VLM;尤其在CoCA数据集的CoSOD任务上,S-measure达到0.899,超过先前最优结果8.0个百分点,且使用训练数据远少于对比方法。 Conclusion: 本文提出的基于CoT的统一框架有效弥合了不同显著性任务间的异构性,结合CGPO和输出到推理策略,在更少数据下实现了卓越性能,展示了VLM在显著性检测中的巨大潜力。 Abstract: We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks, eg, SOD, CoSOD, and SIS, by casting each as a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity. CoT training follows a two-stage paradigm: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). To enhance CoT quality in RL, we propose Confidence-Guided Policy Optimization (CGPO), a lightweight single-sample algorithm that leverages the discrepancy between reward and model confidence as a per-sample advantage signal. This design naturally focuses updates on informative responses while eliminating group sampling, thereby addressing GRPO's key limitations: confidence-agnostic learning, signal dilution, and prohibitive computational overhead. We also introduce an "output-to-reasoning" strategy to construct high-fidelity SFT data that ensures logical consistency with ground-truth masks. Experiments show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks, especially achieving an S-measure of 0.899 on CoCA for CoSOD, surpassing the prior best by 8.0 percentage points, despite using far less training data.

[139] LGCA: Enhancing Semantic Representation via Progressive Expansion

Thanh Hieu Cao,Trung Khang Tran,Gia Thinh Pham,Tuong Nghiem Diep,Thanh Binh Nguyen

Main category: cs.CV

TL;DR: 提出了一种名为Localized-Globalized Cross-Alignment (LGCA)的框架,通过结合局部与全局图像特征来提升零样本图像分类性能,同时保持高效的时间复杂度。

Details Motivation: 由于CLIP等模型对随机裁剪敏感,小尺度图像区域可能引入误导信息和偏差,因此需要一种更鲁棒的方法来利用局部细节而不牺牲全局一致性。 Method: LGCA首先捕获图像的局部特征,然后迭代选择最显著的区域并进行扩展,相似性评分结合原始图像和扩展后的图像,以融合局部与全局信息。 Result: 实验表明,LGCA在多个数据集上显著提升了零样本分类性能,优于现有最先进方法,且理论分析显示其时间复杂度与原模型相当。 Conclusion: LGCA有效平衡了局部细节与全局结构的建模,提升了视觉-语言模型在零样本任务中的准确性和鲁棒性,具有良好的效率和可扩展性。 Abstract: Recent advancements in large-scale pretraining in natural language processing have enabled pretrained vision-language models such as CLIP to effectively align images and text, significantly improving performance in zero-shot image classification tasks. Subsequent studies have further demonstrated that cropping images into smaller regions and using large language models to generate multiple descriptions for each caption can further enhance model performance. However, due to the inherent sensitivity of CLIP, random image crops can introduce misinformation and bias, as many images share similar features at small scales. To address this issue, we propose Localized-Globalized Cross-Alignment (LGCA), a framework that first captures the local features of an image and then repeatedly selects the most salient regions and expands them. The similarity score is designed to incorporate both the original and expanded images, enabling the model to capture both local and global features while minimizing misinformation. Additionally, we provide a theoretical analysis demonstrating that the time complexity of LGCA remains the same as that of the original model prior to the repeated expansion process, highlighting its efficiency and scalability. Extensive experiments demonstrate that our method substantially improves zero-shot performance across diverse datasets, outperforming state-of-the-art baselines.

[140] Leveraging Hierarchical Image-Text Misalignment for Universal Fake Image Detection

Daichi Zhang,Tong Zhang,Jianmin Bao,Shiming Ge,Sabine Süsstrunk

Main category: cs.CV

TL;DR: 本文提出了一种基于图像-文本不匹配的多模态方法(ITEM)来检测生成的假图像,利用预训练CLIP空间中的语义不一致性作为判别线索,具有良好的泛化性和鲁棒性。

Details Motivation: 现有方法仅依赖视觉线索进行假图像检测,容易过拟合特定模型且难以泛化到未见生成模型。为此,本文从多模态角度出发,探索图像与文本描述之间的不一致性作为新的检测依据。 Method: 提出ITEM检测器,通过在预训练CLIP的联合视觉-语言空间中衡量图像与对应文本的不匹配程度,并引入分层不匹配机制,分别捕捉整体图像和语义对象级别的全局与细粒度语义不一致,最后使用MLP头完成检测任务。 Result: 大量实验表明,该方法在多种最新生成模型上均表现出优于现有最先进方法的检测性能,具备出色的泛化能力和鲁棒性。 Conclusion: 利用图像-文本不匹配作为判别线索是一种有效且具泛化性的假图像检测新思路,所提出的ITEM方法在实际应用中具有潜力。 Abstract: With the rapid development of generative models, detecting generated fake images to prevent their malicious use has become a critical issue recently. Existing methods frame this challenge as a naive binary image classification task. However, such methods focus only on visual clues, yielding trained detectors susceptible to overfitting specific image patterns and incapable of generalizing to unseen models. In this paper, we address this issue from a multi-modal perspective and find that fake images cannot be properly aligned with corresponding captions compared to real images. Upon this observation, we propose a simple yet effective detector termed ITEM by leveraging the image-text misalignment in a joint visual-language space as discriminative clues. Specifically, we first measure the misalignment of the images and captions in pre-trained CLIP's space, and then tune a MLP head to perform the usual detection task. Furthermore, we propose a hierarchical misalignment scheme that first focuses on the whole image and then each semantic object described in the caption, which can explore both global and fine-grained local semantic misalignment as clues. Extensive experiments demonstrate the superiority of our method against other state-of-the-art competitors with impressive generalization and robustness on various recent generative models.

[141] Enhancing Frequency Forgery Clues for Diffusion-Generated Image Detection

Daichi Zhang,Tong Zhang,Shiming Ge,Sabine Süsstrunk

Main category: cs.CV

TL;DR: 提出一种基于频率伪造线索增强的图像检测方法,有效提升对未见过的扩散模型生成图像的泛化性和抗扰性。

Details Motivation: 现有检测方法难以捕捉不同扩散模型和设置下的判别特征,泛化性和鲁棒性不足。 Method: 通过分析自然图像与扩散生成图像在频域上的差异,设计频率选择函数,加权过滤傅里叶谱,增强关键频带的伪造线索。 Result: 在多个扩散生成图像数据集上实验表明,该方法优于现有最先进检测器,具备更强的泛化性和鲁棒性。 Conclusion: 所提出的F^2C表示方法能有效检测未知扩散模型生成的图像,并在多种扰动下保持稳定性能。 Abstract: Diffusion models have achieved remarkable success in image synthesis, but the generated high-quality images raise concerns about potential malicious use. Existing detectors often struggle to capture discriminative clues across different models and settings, limiting their generalization to unseen diffusion models and robustness to various perturbations. To address this issue, we observe that diffusion-generated images exhibit progressively larger differences from natural real images across low- to high-frequency bands. Based on this insight, we propose a simple yet effective representation by enhancing the Frequency Forgery Clue (F^2C) across all frequency bands. Specifically, we introduce a frequency-selective function which serves as a weighted filter to the Fourier spectrum, suppressing less discriminative bands while enhancing more informative ones. This approach, grounded in a comprehensive analysis of frequency-based differences between natural real and diffusion-generated images, enables general detection of images from unseen diffusion models and provides robust resilience to various perturbations. Extensive experiments on various diffusion-generated image datasets demonstrate that our method outperforms state-of-the-art detectors with superior generalization and robustness.

[142] ToxicTextCLIP: Text-Based Poisoning and Backdoor Attacks on CLIP Pre-training

Xin Yao,Haiyang Zhao,Yimin Chen,Jiawei Guo,Kecheng Huang,Ming Zhao

Main category: cs.CV

TL;DR: 本文提出了ToxicTextCLIP,一种在CLIP预训练阶段生成高质量对抗性文本的框架,以揭示文本模态中的数据中毒和后门风险。

Details Motivation: 现有研究多关注图像模态的攻击,而忽视了同样关键的文本模态;此外,背景不一致和背景一致文本稀缺是主要挑战。 Method: ToxicTextCLIP采用迭代方法:1)背景感知选择器筛选与目标类别背景一致的文本;2)基于背景的增强器生成语义连贯且多样的中毒样本。 Result: 实验表明,该方法在分类和检索任务中最高达到95.83%的投毒成功率和98.68%的后门Hit@1,且能绕过RoCLIP、CleanCLIP和SafeCLIP等防御机制。 Conclusion: ToxicTextCLIP有效揭示了CLIP在文本模态上的安全漏洞,强调了对文本侧攻击防御的必要性。 Abstract: The Contrastive Language-Image Pretraining (CLIP) model has significantly advanced vision-language modeling by aligning image-text pairs from large-scale web data through self-supervised contrastive learning. Yet, its reliance on uncurated Internet-sourced data exposes it to data poisoning and backdoor risks. While existing studies primarily investigate image-based attacks, the text modality, which is equally central to CLIP's training, remains underexplored. In this work, we introduce ToxicTextCLIP, a framework for generating high-quality adversarial texts that target CLIP during the pre-training phase. The framework addresses two key challenges: semantic misalignment caused by background inconsistency with the target class, and the scarcity of background-consistent texts. To this end, ToxicTextCLIP iteratively applies: 1) a background-aware selector that prioritizes texts with background content aligned to the target class, and 2) a background-driven augmenter that generates semantically coherent and diverse poisoned samples. Extensive experiments on classification and retrieval tasks show that ToxicTextCLIP achieves up to 95.83% poisoning success and 98.68% backdoor Hit@1, while bypassing RoCLIP, CleanCLIP and SafeCLIP defenses. The source code can be accessed via https://github.com/xinyaocse/ToxicTextCLIP/.

[143] Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations

Kiran Shahi,Anup Bagale

Main category: cs.CV

TL;DR: 提出一种基于Grad-CAM的弱监督深度学习框架,用于胸部X光片中的肺炎分类与定位,仅使用图像级标签即可生成具有临床意义的热图,多个预训练模型在Kermany CXR数据集上表现出色,其中ResNet-18和EfficientNet-B0准确率达98%,且Grad-CAM可视化验证了模型关注区域的临床相关性。

Details Motivation: 避免昂贵的像素级标注,利用弱监督方法实现肺炎的高效分类与定位,提升AI在医学影像诊断中的可解释性与临床信任度。 Method: 采用七种ImageNet预训练网络(如ResNet、DenseNet、EfficientNet等),在相同训练条件下使用焦点损失和患者划分策略,结合Grad-CAM生成解释性热图,实现弱监督下的分类与定位。 Result: 在Kermany CXR数据集上,ResNet-18和EfficientNet-B0达到98%准确率、ROC-AUC 0.997、F1 0.987;MobileNet-V2在精度与计算成本间取得最佳平衡;Grad-CAM热图显示模型聚焦于肺部病变区域。 Conclusion: 弱监督可解释模型能有效支持肺炎筛查,提升AI辅助诊断的透明度与临床可信度,具有实际应用潜力。 Abstract: This study proposes a weakly supervised deep learning framework for pneumonia classification and localization from chest X-rays, utilizing Grad-CAM explanations. Instead of costly pixel-level annotations, our approach utilizes image-level labels to generate clinically meaningful heatmaps that highlight regions affected by pneumonia. We evaluate seven ImageNet-pretrained architectures ResNet-18/50, DenseNet-121, EfficientNet-B0, MobileNet-V2/V3, and ViT-B16 under identical training conditions with focal loss and patient-wise splits to prevent data leakage. Experimental results on the Kermany CXR dataset demonstrate that ResNet-18 and EfficientNet-B0 achieve the best overall test accuracy of 98\%, ROC-AUC = 0.997, and F1 = 0.987, while MobileNet-V2 provides an optimal trade-off between accuracy and computational cost. Grad-CAM visualizations confirm that the proposed models focus on clinically relevant lung regions, supporting the use of interpretable AI for radiological diagnostics. This work highlights the potential of weakly supervised explainable models that enhance pneumonia screening transparency, and clinical trust in AI-assisted medical imaging. https://github.com/kiranshahi/pneumonia-analysis

[144] HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation

Panwang Pan,Tingting Shen,Chenxin Li,Yunlong Lin,Kairun Wen,Jingjing Zhao,Yixuan Yuan

Main category: cs.CV

TL;DR: 提出HumanCrafter,一个统一的前馈框架,通过结合几何和语义先验实现单图像的高质量3D人体重建与人体部位分割。

Details Motivation: 现有生成模型在3D人体重建中虽具高保真度,但在特定任务(如3D人体分割)中应用受限,且缺乏标注的3D人体数据。 Method: 引入人体几何先验用于重建,自监督语义先验用于分割,并设计交互式标注方法生成高质量标签数据;采用像素对齐聚合和多任务目标优化纹理与语义一致性。 Result: 在单图像3D人体重建和人体部位分割任务上均优于当前最先进方法。 Conclusion: HumanCrafter通过跨任务协同和联合建模,在无需复杂后处理的情况下实现了高保真的3D人体重建与精确的语义分割。 Abstract: Recent advances in generative models have achieved high-fidelity in 3D human reconstruction, yet their utility for specific tasks (e.g., human 3D segmentation) remains constrained. We propose HumanCrafter, a unified framework that enables the joint modeling of appearance and human-part semantics from a single image in a feed-forward manner. Specifically, we integrate human geometric priors in the reconstruction stage and self-supervised semantic priors in the segmentation stage. To address labeled 3D human datasets scarcity, we further develop an interactive annotation procedure for generating high-quality data-label pairs. Our pixel-aligned aggregation enables cross-task synergy, while the multi-task objective simultaneously optimizes texture modeling fidelity and semantic consistency. Extensive experiments demonstrate that HumanCrafter surpasses existing state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from a single image.

[145] Longitudinal Vestibular Schwannoma Dataset with Consensus-based Human-in-the-loop Annotations

Navodini Wijethilake,Marina Ivory,Oscar MacCormac,Siddhant Kumar,Aaron Kujawa,Lorena Garcia-Foncillas Macias,Rebecca Burger,Amanda Hitchings,Suki Thomson,Sinan Barazi,Eleni Maratos,Rupert Obholzer,Dan Jiang,Fiona McClenaghan,Kazumi Chia,Omar Al-Salihi,Nick Thomas,Steve Connor,Tom Vercauteren,Jonathan Shapey

Main category: cs.CV

TL;DR: 提出一种基于深度学习的迭代分割与质量优化框架,用于前庭神经鞘瘤(VS)在MRI上的自动分割,结合多中心数据和专家共识注释,显著提高了分割准确性(DSC从0.9125提升至0.9670),并在内部和外部数据集上表现出良好的泛化能力,估计比传统手动标注效率提高37.4%。

Details Motivation: 准确的VS分割对患者管理至关重要,但依赖耗时的手动标注;现有深度学习方法在不同数据集和复杂临床病例中仍面临鲁棒性挑战。 Method: 采用基于深度学习的自举式迭代框架,结合多中心MRI数据和专家共识进行标注优化,实现自动化分割模型在目标数据分布上的高效泛化。 Result: 在目标内部验证数据集上DSC从0.9125提升至0.9670,在外部数据集上保持稳定性能;专家评估143次扫描发现需人工干预的复杂案例;相比传统手动标注效率提升约37.4%。 Conclusion: 该人机协同训练方法实现了高精度、可推广的VS自动分割,具有在多样化临床环境中应用的潜力,相关数据集已公开于TCIA。 Abstract: Accurate segmentation of vestibular schwannoma (VS) on Magnetic Resonance Imaging (MRI) is essential for patient management but often requires time-intensive manual annotations by experts. While recent advances in deep learning (DL) have facilitated automated segmentation, challenges remain in achieving robust performance across diverse datasets and complex clinical cases. We present an annotated dataset stemming from a bootstrapped DL-based framework for iterative segmentation and quality refinement of VS in MRI. We combine data from multiple centres and rely on expert consensus for trustworthiness of the annotations. We show that our approach enables effective and resource-efficient generalisation of automated segmentation models to a target data distribution. The framework achieved a significant improvement in segmentation accuracy with a Dice Similarity Coefficient (DSC) increase from 0.9125 to 0.9670 on our target internal validation dataset, while maintaining stable performance on representative external datasets. Expert evaluation on 143 scans further highlighted areas for model refinement, revealing nuanced cases where segmentation required expert intervention. The proposed approach is estimated to enhance efficiency by approximately 37.4% compared to the conventional manual annotation process. Overall, our human-in-the-loop model training approach achieved high segmentation accuracy, highlighting its potential as a clinically adaptable and generalisable strategy for automated VS segmentation in diverse clinical settings. The dataset includes 190 patients, with tumour annotations available for 534 longitudinal contrast-enhanced T1-weighted (T1CE) scans from 184 patients, and non-annotated T2-weighted scans from 6 patients. This dataset is publicly accessible on The Cancer Imaging Archive (TCIA) (https://doi.org/10.7937/bq0z-xa62).

[146] FedMGP: Personalized Federated Learning with Multi-Group Text-Visual Prompts

Weihao Bo,Yanpeng Sun,Yu Wang,Xinyu Zhang,Zechao Li

Main category: cs.CV

TL;DR: 本文提出了FedMGP,一种用于视觉-语言模型中个性化联邦提示学习的新方法,通过多组文本和视觉提示捕捉细粒度语义,并采用基于相似性引导的动态聚合策略,在保持参数效率的同时实现先进性能。

Details Motivation: 在联邦学习中,如何在保护数据隐私的同时实现模型的个性化和良好的领域泛化能力是一个挑战;现有方法难以兼顾客户端间的知识共享与本地特征保留,且通信开销大。 Method: FedMGP为每个客户端配备多组配对的文本和视觉提示,引入多样性损失使各提示组专注于不同的语义方面;在通信中采用基于余弦相似度的softmax加权采样进行动态提示聚合,实现软选择机制,并通过重分配固定提示容量保持参数高效性。 Result: 理论分析表明该动态聚合策略有助于增强共享语义并抑制噪声;实验结果显示FedMGP在多个联邦视觉-语言基准上均优于先前方法,尤其在个性化和领域泛化方面表现突出,且通信参数最少。 Conclusion: FedMGP通过多样化的提示分组与动态聚合策略,有效平衡了全局知识共享与客户端个性化需求,在低通信成本下实现了最先进的性能,为联邦提示学习提供了新范式。 Abstract: In this paper, we introduce FedMGP, a new paradigm for personalized federated prompt learning in vision-language models. FedMGP equips each client with multiple groups of paired textual and visual prompts, enabling the model to capture diverse, fine-grained semantic and instance-level cues. A diversity loss is introduced to drive each prompt group to specialize in distinct and complementary semantic aspects, ensuring that the groups collectively cover a broader range of local characteristics. During communication, FedMGP employs a dynamic prompt aggregation strategy based on similarity-guided probabilistic sampling: each client computes the cosine similarity between its prompt groups and the global prompts from the previous round, then samples s groups via a softmax-weighted distribution. This soft selection mechanism preferentially aggregates semantically aligned knowledge while still enabling exploration of underrepresented patterns effectively balancing the preservation of common knowledge with client-specific features. Notably, FedMGP maintains parameter efficiency by redistributing a fixed prompt capacity across multiple groups, achieving state-of-the-art performance with the lowest communication parameters among all federated prompt learning methods. Theoretical analysis shows that our dynamic aggregation strategy promotes robust global representation learning by reinforcing shared semantics while suppressing client-specific noise. Extensive experiments demonstrate that FedMGP consistently outperforms prior approaches in both personalization and domain generalization across diverse federated vision-language benchmarks. The code will be released on https://github.com/weihao-bo/FedMGP.git.

[147] Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan,Chenguo Lin,Jingjing Zhao,Chenxin Li,Yuchen Lin,Haopeng Li,Honglei Yan,Kairun Wen,Yunlong Lin,Yixuan Yuan,Yadong Mu

Main category: cs.CV

TL;DR: Diff4Splat是一种从单张图像生成可控且显式的4D场景的前馈方法,结合视频扩散模型与4D数据集中的几何和运动约束,在30秒内合成高质量结果。

Details Motivation: 现有的动态场景生成方法通常依赖测试时优化或后处理,效率较低。希望实现高效、无需优化的端到端4D场景生成。 Method: 提出Diff4Splat,利用视频潜在Transformer增强视频扩散模型,统一生成先验与几何运动约束,直接预测可变形的3D高斯场,编码外观、几何和运动信息。 Result: 在视频生成、新视角合成和几何提取任务中,Diff4Splat在质量上达到或超过基于优化的方法,且推理时间仅需30秒,显著更高效。 Conclusion: Diff4Splat实现了快速、高效的单图像4D场景生成,无需测试时优化,在多个下游任务中表现出色,推动了动态场景建模的实用性。 Abstract: We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splatacross video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

[148] VinDr-CXR-VQA: A Visual Question Answering Dataset for Explainable Chest X-Ray Analysis with Multi-Task Learning

Hai-Dang Nguyen,Ha-Hieu Pham,Hao T. Nguyen,Huy-Hieu Pham

Main category: cs.CV

TL;DR: VinDr-CXR-VQA是一个大规模带空间定位和临床解释的胸部X光医学视觉问答数据集,包含17,597个问答对,旨在提升可复现且临床可靠的医学VQA研究。

Details Motivation: 为解决医学视觉问答中缺乏空间标注和临床解释的问题,推动更可靠、可解释的Med-VQA模型发展。 Method: 构建包含放射科医生验证的边界框和推理说明的数据集,设计六类诊断问题,并平衡正负样本分布以减少幻觉。 Result: 在MedGemma-4B上取得F1=0.624,比基线提升11.8%,并支持病灶定位。 Conclusion: VinDr-CXR-VQA有助于推进具备临床可信度和可重复性的医学视觉问答研究。 Abstract: We present VinDr-CXR-VQA, a large-scale chest X-ray dataset for explainable Medical Visual Question Answering (Med-VQA) with spatial grounding. The dataset contains 17,597 question-answer pairs across 4,394 images, each annotated with radiologist-verified bounding boxes and clinical reasoning explanations. Our question taxonomy spans six diagnostic types-Where, What, Is there, How many, Which, and Yes/No-capturing diverse clinical intents. To improve reliability, we construct a balanced distribution of 41.7% positive and 58.3% negative samples, mitigating hallucinations in normal cases. Benchmarking with MedGemma-4B-it demonstrates improved performance (F1 = 0.624, +11.8% over baseline) while enabling lesion localization. VinDr-CXR-VQA aims to advance reproducible and clinically grounded Med-VQA research. The dataset and evaluation tools are publicly available at huggingface.co/datasets/Dangindev/VinDR-CXR-VQA.

[149] OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback

Kai Luo,Hao Shi,Kunyu Peng,Fei Teng,Sheng Wu,Kaiwei Wang,Kailun Yang

Main category: cs.CV

TL;DR: 本文提出OmniTrack++,一种用于全景图像多目标跟踪(MOT)的反馈驱动框架,通过动态特征稳定、轨迹引导的实例匹配、专家记忆机制和自适应tracklet管理,在360度视野下显著提升跟踪性能,并发布EmboTrack基准数据集以推动全景MOT研究。

Details Motivation: 传统MOT方法在窄视场针孔相机上表现良好,但在360度全景图像中面临视野广、分辨率稀释和严重几何畸变等挑战,导致性能下降,因此需要专门针对全景特性设计新的跟踪框架。 Method: OmniTrack++采用反馈驱动的渐进优化框架:1)DynamicSSM模块稳定全景特征以缓解畸变;2)FlexiTrack Instances利用轨迹反馈实现灵活定位与短期关联;3)ExpertTrack Memory通过MoE结构整合外观信息,增强长期鲁棒性;4)Tracklet Management模块自适应切换端到端与检测跟踪模式。同时构建EmboTrack基准,包含QuadTrack和BipTrack两个真实场景数据集。 Result: 在JRDB和EmboTrack数据集上实验表明,OmniTrack++相比原始OmniTrack在HOTA指标上分别提升25.5%和43.07%,显著优于现有方法,验证了其在全景MOT中的有效性与鲁棒性。 Conclusion: OmniTrack++通过引入轨迹反馈、特征归一化和记忆机制,有效解决了全景MOT中的畸变、身份漂移和长时跟踪难题,结合新发布的EmboTrack基准,为现实场景下的全景感知提供了有力支持。 Abstract: This paper investigates Multi-Object Tracking (MOT) in panoramic imagery, which introduces unique challenges including a 360{\deg} Field of View (FoV), resolution dilution, and severe view-dependent distortions. Conventional MOT methods designed for narrow-FoV pinhole cameras generalize unsatisfactorily under these conditions. To address panoramic distortion, large search space, and identity ambiguity under a 360{\deg} FoV, OmniTrack++ adopts a feedback-driven framework that progressively refines perception with trajectory cues. A DynamicSSM block first stabilizes panoramic features, implicitly alleviating geometric distortion. On top of normalized representations, FlexiTrack Instances use trajectory-informed feedback for flexible localization and reliable short-term association. To ensure long-term robustness, an ExpertTrack Memory consolidates appearance cues via a Mixture-of-Experts design, enabling recovery from fragmented tracks and reducing identity drift. Finally, a Tracklet Management module adaptively switches between end-to-end and tracking-by-detection modes according to scene dynamics, offering a balanced and scalable solution for panoramic MOT. To support rigorous evaluation, we establish the EmboTrack benchmark, a comprehensive dataset for panoramic MOT that includes QuadTrack, captured with a quadruped robot, and BipTrack, collected with a bipedal wheel-legged robot. Together, these datasets span wide-angle environments and diverse motion patterns, providing a challenging testbed for real-world panoramic perception. Extensive experiments on JRDB and EmboTrack demonstrate that OmniTrack++ achieves state-of-the-art performance, yielding substantial HOTA improvements of +25.5% on JRDB and +43.07% on QuadTrack over the original OmniTrack. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack.

[150] ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation

Panwang Pan,Jingjing Zhao,Yuchen Lin,Chenguo Lin,Chenxin Li,Haopeng Li,Honglei Yan,Tingting Shen,Yadong Mu

Main category: cs.CV

TL;DR: 提出ID-Composer框架,实现基于文本提示和参考图像的多主体视频生成,通过分层身份保持注意力机制、视觉语言模型语义理解及在线强化学习优化,在身份保持、时序一致性和视频质量方面优于现有方法。

Details Motivation: 现有视频生成模型多依赖文本或单图条件输入,缺乏对多主体身份保持和语义整合的控制能力,限制了其可控性和应用范围。 Method: 提出ID-Composer框架,采用分层身份保持注意力机制融合多主体与多模态特征,结合预训练视觉语言模型增强语义理解,并引入在线强化学习阶段优化训练目标。 Result: 实验表明,ID-Composer在身份保持、时序一致性及生成视频质量方面均优于现有方法。 Conclusion: ID-Composer有效解决了多主体视频生成中的身份保持与语义控制难题,显著提升了生成效果与用户意图对齐能力。 Abstract: Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a \textbf{hierarchical identity-preserving attention mechanism}, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce \textbf{semantic understanding via pretrained vision-language model (VLM)}, leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an \textbf{online reinforcement learning phase} to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.

[151] SegDebias: Test-Time Bias Mitigation for ViT-Based CLIP via Segmentation

Fangyu Wu,Yujun Cai

Main category: cs.CV

TL;DR: 提出一种无需额外训练或偏见标注的测试时去偏方法,利用预训练分割模型隔离目标视觉属性,并调整非目标区域的嵌入以消除混淆区域的偏见信号。

Details Motivation: 现有去偏方法通常需要训练数据和显式组标签,限制了实际应用;而测试时方法又依赖特定数据集的偏见先验知识,泛化能力有限。 Method: 使用预训练分割模型分离目标视觉属性,调整非目标区域的图像嵌入,使其对所有类别文本提示均匀相似,从而去除混淆区域带来的偏见信号,保留目标属性信息。 Result: 在Waterbirds和CelebA数据集上,该方法在群体鲁棒性指标和注意力IoU方面均优于现有的测试时去偏方法。 Conclusion: 基于分割引导的干预策略可有效实现可扩展且无需标注的视觉语言模型去偏,具有良好的实际应用前景。 Abstract: Vision language models such as CLIP have shown remarkable performance in zero shot classification, but remain susceptible to spurious correlations, where irrelevant visual features influence predictions. Existing debiasing methods often require access to training data and explicit group labels to perform fine-tuning or adjust embeddings, which limits their practicality in real-world settings. Test-time methods attempt to avoid this constraint, but many still depend on prior knowledge of dataset specific biases, limiting their generalizability in open set settings. In this work, we propose a test-time debiasing method for ViT based CLIP models that requires no additional training or assumptions of bias annotations. Our approach uses a pretrained segmentation model to isolate the target visual attribute, then adjusts the non target regions so that their embeddings are uniformly similar to all class specific text prompts. This procedure removes unintended bias signals from confounding visual regions while preserving the target attribute. Experiments on Waterbirds and CelebA show that our method outperforms existing test-time debiasing approaches in both group robustness metrics and Attention IoU. These results demonstrate the effectiveness of segmentation guided interventions for scalable and annotation free bias mitigation in vision language models.

[152] Text-guided Fine-Grained Video Anomaly Detection

Jihao Gu,Kun Li,He Wang,Kaan Akşit

Main category: cs.CV

TL;DR: 本文提出了一种基于大视觉-语言模型的细粒度视频异常检测框架T-VAD,通过引入异常热图解码器和区域感知编码器,实现了像素级的异常定位与文本化描述,在多个数据集上取得了SOTA性能。

Details Motivation: 现有视频异常检测方法多为半自动化且输出粗糙,缺乏细粒度和可解释性,难以满足实际应用需求。 Method: 提出T-VAD框架,结合大型视觉-语言模型,设计异常热图解码器(AHD)进行像素级图文特征对齐生成热图,并通过区域感知异常编码器(RAE)将热图转化为可学习的文本嵌入以引导模型识别和定位异常事件。 Result: 在UBnormal数据集上达到94.8% AUC、67.8%/76.7%热图准确率(RBDC/TBDC),在上海Tech和UBnormal数据集上生成的文本描述具有更高的BLEU-4分数和Yes/No准确率,表现出优越的定量与定性结果。 Conclusion: T-VAD显著提升了视频异常检测的细粒度与交互性,能够同时提供精准的异常定位和自然语言描述,推动了异常检测向更智能、可解释的方向发展。 Abstract: Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).

[153] Real-IAD Variety: Pushing Industrial Anomaly Detection Dataset to a Modern Era

Wenbing Zhu,Chengjie Wang,Bin-Bin Gao,Jiangning Zhang,Guannan Jiang,Jie Hu,Zhenye Gan,Lidong Wang,Ziqing Zhou,Linjie Cheng,Yurui Pan,Bo Peng,Mingmin Chi,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文提出了Real-IAD Variety,这是目前最大且最多样化的工业异常检测(IAD)基准数据集,包含160个不同物体类别、共198,960张高分辨率图像,覆盖28个行业、24种材料和22种颜色变化。实验表明,现有最先进的多类无监督异常检测方法在扩展到160类时性能显著下降,而视觉-语言模型则表现出更强的鲁棒性和泛化能力。该数据集为下一代异常检测基础模型的训练与评估提供了重要资源,并支持多类无监督、多视角及零样本/少样本设置下的严格评估,旨在推动可扩展、通用型异常检测系统的发展。

Details Motivation: 现有的工业异常检测(IAD)数据集类别多样性不足、规模有限,导致算法评估时出现指标饱和、模型难以迁移到真实场景的问题,限制了通用异常检测系统的发展。因此,亟需一个更大规模、更丰富的基准来推动研究突破。 Method: 作者构建了一个名为Real-IAD Variety的新基准,包含198,960张高分辨率图像,涵盖160个对象类别、28个行业、24种材料和22种颜色变化。通过在多类无监督、多视图以及零样本/少样本设置下对现有方法进行系统性实验评估,分析其在不同类别规模下的表现,并对比视觉-语言模型的性能。 Result: 实验结果显示,当前最先进的多类无监督异常检测方法在类别从30增加到160时性能显著下降;而视觉-语言模型在不同类别数量下表现稳定,展现出优异的鲁棒性和跨类别泛化能力。Real-IAD Variety因其规模和复杂性成为更具挑战性的基准。 Conclusion: Real-IAD Variety是目前最大最全面的工业异常检测基准,能够有效评估模型在复杂工业环境中的可扩展性与泛化能力。该工作推动了从特定领域模型向通用、可扩展的基础模型转变,并将公开数据集以促进该领域的进一步发展。 Abstract: Industrial Anomaly Detection (IAD) is critical for enhancing operational safety, ensuring product quality, and optimizing manufacturing efficiency across global industries. However, the IAD algorithms are severely constrained by the limitations of existing public benchmarks. Current datasets exhibit restricted category diversity and insufficient scale, frequently resulting in metric saturation and limited model transferability to real-world scenarios. To address this gap, we introduce Real-IAD Variety, the largest and most diverse IAD benchmark, comprising 198,960 high-resolution images across 160 distinct object categories. Its diversity is ensured through comprehensive coverage of 28 industries, 24 material types, and 22 color variations. Our comprehensive experimental analysis validates the benchmark's substantial challenge: state-of-the-art multi-class unsupervised anomaly detection methods experience significant performance degradation when scaled from 30 to 160 categories. Crucially, we demonstrate that vision-language models exhibit remarkable robustness to category scale-up, with minimal performance variation across different category counts, significantly enhancing generalization capabilities in diverse industrial contexts. The unprecedented scale and complexity of Real-IAD Variety position it as an essential resource for training and evaluating next-generation foundation models for anomaly detection. By providing this comprehensive benchmark with rigorous evaluation protocols across multi-class unsupervised, multi-view, and zero-/few-shot settings, we aim to accelerate research beyond domain-specific constraints, enabling the development of scalable, general-purpose anomaly detection systems. Real-IAD Variety will be made publicly available to facilitate innovation in this critical field.

[154] MIFO: Learning and Synthesizing Multi-Instance from One Image

Kailun Su,Ziqi He,Xi Wang,Yang Zhou

Main category: cs.CV

TL;DR: 本文提出了一种从单张图像中精确学习和合成多实例语义的方法,通过基于惩罚的注意力优化和注意力层中的框控制优化,有效解决了语义纠缠和布局控制问题。

Details Motivation: 由于训练数据有限,且实例间语义或外观相似,导致多实例语义学习与合成困难,因此需要一种能够解耦相似语义并精确控制布局的方法。 Method: 提出基于惩罚的注意力优化以在学习阶段解耦相似语义,并在合成阶段引入并优化注意力层中的框控制,以减少语义泄漏并精确控制输出布局。 Result: 实验结果表明,该方法在语义解耦、生成质量、可编辑性和实例一致性之间取得了良好平衡,对语义或视觉上相似的实例及罕见物体均表现出鲁棒性。 Conclusion: 所提出的方法能有效实现从单图中学习和合成多实例语义,显著提升了复杂场景下的编辑精度与一致性。 Abstract: This paper proposes a method for precise learning and synthesizing multi-instance semantics from a single image. The difficulty of this problem lies in the limited training data, and it becomes even more challenging when the instances to be learned have similar semantics or appearance. To address this, we propose a penalty-based attention optimization to disentangle similar semantics during the learning stage. Then, in the synthesis, we introduce and optimize box control in attention layers to further mitigate semantic leakage while precisely controlling the output layout. Experimental results demonstrate that our method achieves disentangled and high-quality semantic learning and synthesis, strikingly balancing editability and instance consistency. Our method remains robust when dealing with semantically or visually similar instances or rare-seen objects. The code is publicly available at https://github.com/Kareneveve/MIFO

[155] 4D Neural Voxel Splatting: Dynamic Scene Rendering with Voxelized Guassian Splatting

Chun-Tin Wu,Jun-Cheng Chen

Main category: cs.CV

TL;DR: 提出4D Neural Voxel Splatting(4D-NVS),结合体素表示与神经高斯点阵,实现高效动态场景建模,显著降低内存消耗并提升训练速度,同时保持高质量渲染。

Details Motivation: 3D高斯点阵在动态场景中因每帧复制高斯导致内存开销大,需更高效的动态建模方法。 Method: 采用紧凑的神经体素集合与学习到的形变场建模时间动态,避免每时刻生成独立高斯集合,并引入新视角细化阶段优化难渲染视角。 Result: 实验表明,该方法在内存占用、训练速度和渲染质量上优于现有最先进方法,支持实时渲染且视觉保真度更高。 Conclusion: 4D-NVS通过神经体素与变形场有效实现高效、高质量的动态场景新视图合成,为大规模动态场景提供了实用解决方案。 Abstract: Although 3D Gaussian Splatting (3D-GS) achieves efficient rendering for novel view synthesis, extending it to dynamic scenes still results in substantial memory overhead from replicating Gaussians across frames. To address this challenge, we propose 4D Neural Voxel Splatting (4D-NVS), which combines voxel-based representations with neural Gaussian splatting for efficient dynamic scene modeling. Instead of generating separate Gaussian sets per timestamp, our method employs a compact set of neural voxels with learned deformation fields to model temporal dynamics. The design greatly reduces memory consumption and accelerates training while preserving high image quality. We further introduce a novel view refinement stage that selectively improves challenging viewpoints through targeted optimization, maintaining global efficiency while enhancing rendering quality for difficult viewing angles. Experiments demonstrate that our method outperforms state-of-the-art approaches with significant memory reduction and faster training, enabling real-time rendering with superior visual fidelity.

[156] Generalized Category Discovery under Domain Shift: A Frequency Domain Perspective

Wei Feng,Zongyuan Ge

Main category: cs.CV

TL;DR: 本文提出了一种新的领域偏移下的广义类别发现(DS_GCD)任务,并设计了基于频域信息的FREE框架,通过频域分离与扰动策略、改进的对比学习和聚类损失,以及难样本重采样技术,有效提升了模型在分布偏移下对已知和未知类别的发现能力。

Details Motivation: 现有广义类别发现方法在标准条件下表现良好,但在存在分布偏移时性能下降。本文旨在解决更现实的场景:未标注数据不仅包含未知类别,还可能来自未知领域,即领域偏移下的广义类别发现(DS_GCD)。 Method: 提出频率引导的FREE框架:1)基于频域幅度差异进行已知-未知领域分离;2)跨域和域内频域扰动策略增强泛化与鲁棒性;3)扩展自监督对比目标和语义聚类损失;4)引入聚类难度感知的重采样机制。 Result: 在多个基准数据集上实验表明,该方法能有效缓解分布偏移的影响,在发现已知和未知类别方面均优于现有方法。 Conclusion: 通过引入频域信息和领域感知策略,所提出的FREE框架显著提升了模型在复杂真实场景下的广义类别发现能力,为应对分布偏移提供了新思路。 Abstract: Generalized Category Discovery (GCD) aims to leverage labeled samples from known categories to cluster unlabeled data that may include both known and unknown categories. While existing methods have achieved impressive results under standard conditions, their performance often deteriorates in the presence of distribution shifts. In this paper, we explore a more realistic task: Domain-Shifted Generalized Category Discovery (DS\_GCD), where the unlabeled data includes not only unknown categories but also samples from unknown domains. To tackle this challenge, we propose a \textbf{\underline{F}}requency-guided Gene\textbf{\underline{r}}alized Cat\textbf{\underline{e}}gory Discov\textbf{\underline{e}}ry framework (FREE) that enhances the model's ability to discover categories under distributional shift by leveraging frequency-domain information. Specifically, we first propose a frequency-based domain separation strategy that partitions samples into known and unknown domains by measuring their amplitude differences. We then propose two types of frequency-domain perturbation strategies: a cross-domain strategy, which adapts to new distributions by exchanging amplitude components across domains, and an intra-domain strategy, which enhances robustness to intra-domain variations within the unknown domain. Furthermore, we extend the self-supervised contrastive objective and semantic clustering loss to better guide the training process. Finally, we introduce a clustering-difficulty-aware resampling technique to adaptively focus on harder-to-cluster categories, further enhancing model performance. Extensive experiments demonstrate that our method effectively mitigates the impact of distributional shifts across various benchmark datasets and achieves superior performance in discovering both known and unknown categories.

[157] TRACES: Temporal Recall with Contextual Embeddings for Real-Time Video Anomaly Detection

Yousuf Ahmed Siddiqui,Sufiyaan Usmani,Umer Tariq,Jawwad Ahmed Shamsi,Muhammad Burhan Khan

Main category: cs.CV

TL;DR: 提出一种基于记忆增强的上下文感知零样本视频异常检测方法,通过跨注意力机制融合时序和视觉特征,并利用上下文相似性评分实现高精度实时异常检测。

Details Motivation: 现有异常检测方法通常忽略上下文信息,导致在新场景中泛化能力差;需要能够适应新事件、具备上下文理解能力的零样本检测模型。 Method: 构建记忆增强的检测流程,使用跨注意力机制关联时序信号与视觉嵌入,并结合文本记忆线索进行实时零样本异常分类,通过上下文相似性打分实现检测。 Result: 在UCF-Crime上达到90.4% AUC,在XD-Violence上达到83.67% AP,均为当前零样本模型中的最先进性能,且支持实时推理与高可解释性。 Conclusion: 融合跨注意力时序建模与上下文记忆能有效提升零样本异常检测性能,推动其在真实监控与基础设施监测中的应用。 Abstract: Video anomalies often depend on contextual information available and temporal evolution. Non-anomalous action in one context can be anomalous in some other context. Most anomaly detectors, however, do not notice this type of context, which seriously limits their capability to generalize to new, real-life situations. Our work addresses the context-aware zero-shot anomaly detection challenge, in which systems need to learn adaptively to detect new events by correlating temporal and appearance features with textual traces of memory in real time. Our approach defines a memory-augmented pipeline, correlating temporal signals with visual embeddings using cross-attention, and real-time zero-shot anomaly classification by contextual similarity scoring. We achieve 90.4\% AUC on UCF-Crime and 83.67\% AP on XD-Violence, a new state-of-the-art among zero-shot models. Our model achieves real-time inference with high precision and explainability for deployment. We show that, by fusing cross-attention temporal fusion and contextual memory, we achieve high fidelity anomaly detection, a step towards the applicability of zero-shot models in real-world surveillance and infrastructure monitoring.

[158] CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World

Yating Yu,Congqi Cao,Zhaoying Wang,Weihua Meng,Jie Li,Yuxin Li,Zihao Wei,Zhongpei Shen,Jiajun Zhang

Main category: cs.CV

TL;DR: 本文提出了CueBench,首个面向上下文感知视频异常理解(VAU)的统一评估基准,构建了事件中心的层次化分类体系,并提出Cue-R1模型通过强化学习微调显著超越现有方法。

Details Motivation: 现有视频异常理解方法仅停留在表面检测或简单解释,缺乏对复杂情境和细微上下文差异的深入理解,难以应对真实世界中的异常识别需求。 Method: 提出CueBench基准,建立包含条件性和绝对性异常事件的层次化分类体系;设计统一的评估框架涵盖识别、定位、检测与预测任务;并开发基于R1风格强化微调的生成式模型Cue-R1,采用可验证、任务对齐且层级细化的奖励机制。 Result: 在CueBench上实验表明,现有视觉语言模型表现不佳,而Cue-R1平均超越最先进方法24%以上。 Conclusion: 当前深度模型在真实世界视频异常理解方面仍有较大差距,CueBench为该领域提供了更严格和全面的评估平台,推动上下文感知VAU的发展。 Abstract: How far are deep models from real-world video anomaly understanding (VAU)? Current works typically emphasize on detecting unexpected occurrences deviated from normal patterns or comprehending anomalous events with interpretable descriptions. However, they exhibit only a superficial comprehension of real-world anomalies, with limited breadth in complex principles and subtle context that distinguish the anomalies from normalities, e.g., climbing cliffs with safety gear vs. without it. To this end, we introduce CueBench, the first of its kind Benchmark, devoted to Context-aware video anomalies within a Unified Evaluation framework. We comprehensively establish an event-centric hierarchical taxonomy that anchors two core event types: 14 conditional and 18 absolute anomaly events, defined by their refined semantics from diverse contexts across 174 scenes and 198 attributes. Based on this, we propose to unify and benchmark context-aware VAU with various challenging tasks across recognition, temporal grounding, detection, and anticipation. This also serves as a rigorous and fair probing evaluation suite for generative-discriminative as well as generalized-specialized vision-language models (VLMs). To address the challenges underlying CueBench, we further develop Cue-R1 based on R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in a unified generative manner. Extensive results on CueBench reveal that, existing VLMs are still far from satisfactory real-world anomaly understanding, while our Cue-R1 surpasses these state-of-the-art approaches by over 24% on average.

[159] Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach

Oluwatosin Alabi,Meng Wei,Charlie Budd,Tom Vercauteren,Miaojing Shi

Main category: cs.CV

TL;DR: 本文提出了基于仪器实例分割的手术动作三元组定位方法(triplet segmentation),通过新构建的CholecTriplet-Seg数据集和提出的TargetFusionNet模型,实现了对手术场景中<仪器, 动作, 目标>三元组的精确空间定位。

Details Motivation: 现有手术动作识别方法局限于帧级分类,难以将动作准确关联到具体的仪器实例,且依赖的类激活图缺乏足够的精度和鲁棒性,无法满足精细的仪器-组织交互分析需求。 Method: 提出‘三元组分割’任务,构建包含3万余帧标注的CholecTriplet-Seg数据集,并设计TargetFusionNet网络,扩展Mask2Former并引入目标感知融合机制,结合弱解剖先验与仪器实例查询以提升解剖目标预测准确性。 Result: 在识别、检测和三元组分割指标上,TargetFusionNet均优于现有基线方法,验证了强实例监督与弱目标先验结合可显著提升手术动作理解的准确性与鲁棒性。 Conclusion: 三元组分割为手术动作理解提供了统一的空间定位框架,所提数据集、任务和模型推动了更可解释的手术场景理解发展。 Abstract: Understanding surgical instrument-tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame-level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument-tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded outputs.We start by presenting CholecTriplet-Seg, a large-scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance-level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target-aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.

[160] Benchmarking individual tree segmentation using multispectral airborne laser scanning data: the FGI-EMIT dataset

Lassi Ruoppa,Tarmo Hietala,Verneri Seppänen,Josef Taher,Teemu Hakala,Xiaowei Yu,Antero Kukko,Harri Kaartinen,Juha Hyyppä

Main category: cs.CV

TL;DR: 本研究提出了首个大规模多光谱机载激光扫描基准数据集FGI-EMIT,用于点云中的单木分割(ITS),并比较了传统无监督算法与深度学习方法的性能,结果表明深度学习模型(如ForestFormer3D)显著优于传统方法,尤其在 understory 树木分割中表现更佳,但当前深度学习方法尚未有效利用多光谱反射率信息。

Details Motivation: 由于缺乏大规模基准数据集和多光谱LiDAR数据的有限可用性,单木分割方法的发展受到限制,因此需要一个包含多光谱信息的大规模标注数据集来推动该领域发展。 Method: 提出FGI-EMIT数据集,包含1,561棵人工标注的树木,覆盖532、905和1,550 nm波长;对比四种无监督算法和四种深度学习模型,使用贝叶斯优化无监督方法超参数,深度学习模型从零训练,并进行消融实验和点密度分析。 Result: 无监督方法中Treeiso取得最高F1分数52.7%;深度学习方法整体表现更好,ForestFormer3D达到73.3%的F1分数,在understory树木上比Treeiso高25.9个百分点;当前深度学习模型未能有效利用多光谱反射率信息,但单通道反射率可略微提升精度;即使在低至10点/m²的情况下,深度学习仍优于无监督方法。 Conclusion: FGI-EMIT为单木分割提供了重要的多光谱基准数据集,深度学习方法在ITS任务中显著优于传统方法,尤其是在复杂林下环境中,未来需探索如何更好地融合多光谱反射率信息以进一步提升性能。 Abstract: Individual tree segmentation (ITS) from LiDAR point clouds is fundamental for applications such as forest inventory, carbon monitoring and biodiversity assessment. Traditionally, ITS has been achieved with unsupervised geometry-based algorithms, while more recent advances have shifted toward supervised deep learning (DL). In the past, progress in method development was hindered by the lack of large-scale benchmark datasets, and the availability of novel data formats, particularly multispectral (MS) LiDAR, remains limited to this day, despite evidence that MS reflectance can improve the accuracy of ITS. This study introduces FGI-EMIT, the first large-scale MS airborne laser scanning benchmark dataset for ITS. Captured at wavelengths 532, 905, and 1,550 nm, the dataset consists of 1,561 manually annotated trees, with a particular focus on small understory trees. Using FGI-EMIT, we comprehensively benchmarked four conventional unsupervised algorithms and four supervised DL approaches. Hyperparameters of unsupervised methods were optimized using a Bayesian approach, while DL models were trained from scratch. Among the unsupervised methods, Treeiso achieved the highest test set F1-score of 52.7%. The DL approaches performed significantly better overall, with the best model, ForestFormer3D, attaining an F1-score of 73.3%. The most significant difference was observed in understory trees, where ForestFormer3D exceeded Treeiso by 25.9 percentage points. An ablation study demonstrated that current DL-based approaches generally fail to leverage MS reflectance information when it is provided as additional input features, although single channel reflectance can improve accuracy marginally, especially for understory trees. A performance analysis across point densities further showed that DL methods consistently remain superior to unsupervised algorithms, even at densities as low as 10 points/m$^2$.

[161] Metadata-Aligned 3D MRI Representations for Contrast Understanding and Quality Control

Mehmet Yigit Avci,Pedro Borges,Virginia Fernandez,Paul Wright,Mehmet Yigitsoy,Sebastien Ourselin,Jorge Cardoso

Main category: cs.CV

TL;DR: MR-CLIP是一种基于元数据引导的框架,通过将MRI体积图像与其DICOM采集参数对齐,学习统一的MRI对比度表征,适用于少样本分类和无监督质量控制。

Details Motivation: MRI数据存在异质性且缺乏跨设备、协议和机构的标准化对比标签,限制了大规模自动化分析,因此需要一种不依赖人工标注的统一MRI对比表示方法。 Method: 提出MR-CLIP框架,利用DICOM采集参数作为监督信号,通过对比学习将MRI体积图像与其元数据对齐,学习对比度表征。 Result: MR-CLIP在少样本序列分类中优于监督3D基线模型,嵌入结果呈现出明显的序列聚类,并能通过图像-元数据嵌入距离实现无监督数据质量控制。 Conclusion: MR-CLIP通过利用常规可用的采集元数据作为监督信号,为多样化临床数据集中的高效标签MRI分析提供了可扩展的基础。 Abstract: Magnetic Resonance Imaging suffers from substantial data heterogeneity and the absence of standardized contrast labels across scanners, protocols, and institutions, which severely limits large-scale automated analysis. A unified representation of MRI contrast would enable a wide range of downstream utilities, from automatic sequence recognition to harmonization and quality control, without relying on manual annotations. To this end, we introduce MR-CLIP, a metadata-guided framework that learns MRI contrast representations by aligning volumetric images with their DICOM acquisition parameters. The resulting embeddings shows distinct clusters of MRI sequences and outperform supervised 3D baselines under data scarcity in few-shot sequence classification. Moreover, MR-CLIP enables unsupervised data quality control by identifying corrupted or inconsistent metadata through image-metadata embedding distances. By transforming routinely available acquisition metadata into a supervisory signal, MR-CLIP provides a scalable foundation for label-efficient MRI analysis across diverse clinical datasets.

[162] Outlier-Aware Post-Training Quantization for Image Super-Resolution

Hailing Wang,jianglin Lu,Yitian Zhang,Yun Fu

Main category: cs.CV

TL;DR: 提出了一种针对图像超分辨率网络的双区域量化策略和敏感性感知微调方法,有效解决了后训练量化中激活异常值的问题,在无需重训练的情况下显著提升了性能。

Details Motivation: 现有后训练量化方法在处理图像超分辨率网络时忽略激活中的异常值问题,导致性能不佳,且直接去除异常值会引发性能下降。 Method: 提出双区域量化策略,将激活划分为异常值区域和密集区域并分别进行均匀量化;引入敏感性感知微调,针对不同层的量化敏感性进行优化。 Result: 在多种超分辨率网络和数据集上优于现有的PTQ方法,性能接近QAT方法,同时实现至少75倍加速。 Conclusion: 所提出的双区域量化与敏感性感知微调方法有效提升了后训练量化在图像超分辨率任务上的表现,兼顾效率与精度。 Abstract: Quantization techniques, including quantization-aware training (QAT) and post-training quantization (PTQ), have become essential for inference acceleration of image super-resolution (SR) networks. Compared to QAT, PTQ has garnered significant attention as it eliminates the need for ground truth and model retraining. However, existing PTQ methods for SR often fail to achieve satisfactory performance as they overlook the impact of outliers in activation. Our empirical analysis reveals that these prevalent activation outliers are strongly correlated with image color information, and directly removing them leads to significant performance degradation. Motivated by this, we propose a dual-region quantization strategy that partitions activations into an outlier region and a dense region, applying uniform quantization to each region independently to better balance bit-width allocation. Furthermore, we observe that different network layers exhibit varying sensitivities to quantization, leading to different levels of performance degradation. To address this, we introduce sensitivity-aware finetuning that encourages the model to focus more on highly sensitive layers, further enhancing quantization performance. Extensive experiments demonstrate that our method outperforms existing PTQ approaches across various SR networks and datasets, while achieving performance comparable to QAT methods in most scenarios with at least a 75 speedup.

[163] Evolve to Inspire: Novelty Search for Diverse Image Generation

Alex Inch,Passawis Chaiyapattanaporn,Yuchen Zhu,Yuan Lu,Ting-Wen Ko,Davide Paglieri

Main category: cs.CV

TL;DR: 本文提出了一种基于新奇性搜索的文本到图像生成方法WANDER,通过大语言模型对自然语言提示进行语义演化,并利用CLIP嵌入量化新颖性,显著提升了生成图像的多样性。

Details Motivation: 现有的文本到图像扩散模型虽然能生成高质量图像,但输出多样性有限,且现有提示优化技术难以满足创造性视觉任务的需求。 Method: WANDER直接在自然语言提示上操作,使用大语言模型(LLM)进行提示的语义演化,结合CLIP嵌入衡量新颖性,并引入emitters引导搜索进入提示空间的不同区域。 Result: 在FLUX-DEV生成模型和GPT-4o-mini用于变异的实验中,WANDER在多样性指标上显著优于现有的进化式提示优化基线方法,消融实验验证了emitters的有效性。 Conclusion: WANDER有效提升了文本到图像生成的多样性,适用于探索性和创意性任务,为提示优化提供了新的思路。 Abstract: Text-to-image diffusion models, while proficient at generating high-fidelity im- ages, often suffer from limited output diversity, hindering their application in exploratory and ideation tasks. Existing prompt optimization techniques typically target aesthetic fitness or are ill-suited to the creative visual domain. To address this shortcoming, we introduce WANDER, a novelty search-based approach to generating diverse sets of images from a single input prompt. WANDER operates directly on natural language prompts, employing a Large Language Model (LLM) for semantic evolution of diverse sets of images, and using CLIP embeddings to quantify novelty. We additionally apply emitters to guide the search into distinct regions of the prompt space, and demonstrate that they boost the diversity of the generated images. Empirical evaluations using FLUX-DEV for generation and GPT-4o-mini for mutation demonstrate that WANDER significantly outperforms existing evolutionary prompt optimization baselines in diversity metrics. Ablation studies confirm the efficacy of emitters.

[164] Toward Better Optimization of Low-Dose CT Enhancement: A Critical Analysis of Loss Functions and Image Quality Assessment Metrics

Taifour Yousra,Beghdadi Azeddine,Marie Luong,Zuheng Ming

Main category: cs.CV

TL;DR: 本文研究了深度学习中不同损失函数在低剂量CT图像质量增强中的表现,发现现有损失函数与图像质量指标(如PSNR、SSIM)之间存在不一致性,强调在设计新损失函数时应考虑感知质量指标。

Details Motivation: 低剂量CT图像因噪声和伪影影响诊断准确性,虽有多种损失函数用于深度学习模型提升图像质量,但传统指标(如PSNR、SSIM)难以反映感知质量,亟需系统评估损失函数与图像质量指标的一致性。 Method: 通过客观分析多种经典和定制损失函数在LDCT图像增强任务中的表现,比较其与不同图像质量评估指标的相关性。 Result: 发现常用的损失函数与图像质量指标之间存在明显不一致,表明当前训练目标与感知质量不匹配。 Conclusion: 在设计面向图像质量增强的损失函数时,必须结合更符合人类视觉感知或临床需求的质量评价指标,以提高模型的实际有效性。 Abstract: Low-dose CT (LDCT) imaging is widely used to reduce radiation exposure to mitigate high exposure side effects, but often suffers from noise and artifacts that affect diagnostic accuracy. To tackle this issue, deep learning models have been developed to enhance LDCT images. Various loss functions have been employed, including classical approaches such as Mean Square Error and adversarial losses, as well as customized loss functions(LFs) designed for specific architectures. Although these models achieve remarkable performance in terms of PSNR and SSIM, these metrics are limited in their ability to reflect perceptual quality, especially for medical images. In this paper, we focus on one of the most critical elements of DL-based architectures, namely the loss function. We conduct an objective analysis of the relevance of different loss functions for LDCT image quality enhancement and their consistency with image quality metrics. Our findings reveal inconsistencies between LFs and quality metrics, and highlight the need of consideration of image quality metrics when developing a new loss function for image quality enhancement.

[165] Validating Deep Models for Alzheimer's 18F-FDG PET Diagnosis Across Populations: A Study with Latin American Data

Hugo Massaroli,Hernan Chaves,Pilar Anania,Mauricio Farez,Emmanuel Iarussi,Viviana Siless

Main category: cs.CV

TL;DR: 该研究评估了基于ADNI数据集训练的深度学习模型在拉丁美洲FLENI队列中的泛化能力,发现模型性能显著下降,揭示了明显的域偏移问题。

Details Motivation: 现有阿尔茨海默病(AD)诊断模型主要基于北美人群训练,但在非代表性人群中的泛化能力尚不明确,亟需验证其跨人群适用性。 Method: 采用卷积神经网络和Transformer模型,在ADNI数据集上进行训练,并在FLENI拉丁美洲临床队列上测试;通过消融实验和遮挡敏感性分析探究影响泛化的关键因素。 Result: 模型在ADNI上表现优异(AUC高达0.96、0.97),但在FLENI上性能明显下降(降至0.82、0.80);不同架构表现相似,Transformer未显优势;归一化和采样策略对泛化至关重要;ADNI模型关注典型的低代谢区域,但在其他类别和FLENI图像上注意力分散。 Conclusion: 诊断AI模型需在多样化人群中进行验证,强调人口特征感知的重要性,并推动未来在域适应和队列多样化方面的研究。 Abstract: Deep learning models have shown strong performance in diagnosing Alzheimer's disease (AD) using neuroimaging data, particularly 18F-FDG PET scans, with training datasets largely composed of North American cohorts such as those in the Alzheimer's Disease Neuroimaging Initiative (ADNI). However, their generalization to underrepresented populations remains underexplored. In this study, we benchmark convolutional and Transformer-based models on the ADNI dataset and assess their generalization performance on a novel Latin American clinical cohort from the FLENI Institute in Buenos Aires, Argentina. We show that while all models achieve high AUCs on ADNI (up to .96, .97), their performance drops substantially on FLENI (down to .82, .80, respectively), revealing a significant domain shift. The tested architectures demonstrated similar performance, calling into question the supposed advantages of transformers for this specific task. Through ablation studies, we identify per-image normalization and a correct sampling selection as key factors for generalization. Occlusion sensitivity analysis further reveals that models trained on ADNI, generally attend to canonical hypometabolic regions for the AD class, but focus becomes unclear for the other classes and for FLENI scans. These findings highlight the need for population-aware validation of diagnostic AI models and motivate future work on domain adaptation and cohort diversification.

[166] Towards classification-based representation learning for place recognition on LiDAR scans

Dmitrii Khizbullin,Maksim Konoplia

Main category: cs.CV

TL;DR: 提出一种将地点识别作为多类别分类问题的新方法,使用LiDAR扫描直接分类位置,在NuScenes数据集上表现出与对比学习方法相当的性能,且训练更高效稳定。

Details Motivation: 现有方法多依赖对比学习,存在训练效率和稳定性问题,因此探索替代方案。 Method: 为LiDAR扫描分配离散位置标签,采用编码器-解码器模型进行端到端的位置分类。 Result: 在NuScenes数据集上达到与对比学习方法相当的性能,同时提升训练效率和稳定性。 Conclusion: 将地点识别视为分类任务是可行且有效的,相比对比学习具有实用优势。 Abstract: Place recognition is a crucial task in autonomous driving, allowing vehicles to determine their position using sensor data. While most existing methods rely on contrastive learning, we explore an alternative approach by framing place recognition as a multi-class classification problem. Our method assigns discrete location labels to LiDAR scans and trains an encoder-decoder model to classify each scan's position directly. We evaluate this approach on the NuScenes dataset and show that it achieves competitive performance compared to contrastive learning-based methods while offering advantages in training efficiency and stability.

[167] Erasing 'Ugly' from the Internet: Propagation of the Beauty Myth in Text-Image Models

Tanvi Dinkar,Aiqi Jiang,Gavin Abercrombie,Ioannis Konstas

Main category: cs.CV

TL;DR: 该研究探讨了生成式AI模型如何编码‘美’并消除‘丑’,揭示了这些模型中存在的西方审美偏见及其对社会的负面影响。

Details Motivation: 社交媒体加剧了西方审美标准的传播,导致女性和女孩出现负面自我形象及身体畸形恐惧等问题。随着人工智能生成内容的增多,人们担忧这种审美偏见被进一步放大。因此,研究旨在分析生成式AI在‘美’与‘丑’表达中的偏见问题。 Method: 构建了两个图像生成流程:文本到图像模型和文本到语言模型再到图像模型。设计了一个结构化的美学分类体系,用于提示三种语言模型和两种文本到图像模型,共生成5984张图像。通过李克特量表对其中1200张图像进行用户评价实验,参与者为女性和非二元性别社交媒体用户。 Result: 86.5%的生成图像具有较浅肤色,22%包含非安全内容(尽管经过SFW训练),74%被归类为更年轻年龄段。非二元性别个体的图像被评价为更年轻且更具性化。带有‘负面’或‘丑陋’特征提示(如“宽鼻子”)的图像普遍获得更高的NSFW评分,且不受性别影响。 Conclusion: 生成式AI模型中存在显著的审美偏见,这些偏见由开发者通过负向提示等方式持续强化,导致不符合主流审美的特征被系统性抹除,并污染数据流,可能进一步加剧社会不平等。 Abstract: Social media has exacerbated the promotion of Western beauty norms, leading to negative self-image, particularly in women and girls, and causing harm such as body dysmorphia. Increasingly content on the internet has been artificially generated, leading to concerns that these norms are being exaggerated. The aim of this work is to study how generative AI models may encode 'beauty' and erase 'ugliness', and discuss the implications of this for society. To investigate these aims, we create two image generation pipelines: a text-to-image model and a text-to-language model-to image model. We develop a structured beauty taxonomy which we use to prompt three language models (LMs) and two text-to-image models to cumulatively generate 5984 images using our two pipelines. We then recruit women and non-binary social media users to evaluate 1200 of the images through a Likert-scale within-subjects study. Participants show high agreement in their ratings. Our results show that 86.5% of generated images depicted people with lighter skin tones, 22% contained explicit content despite Safe for Work (SFW) training, and 74% were rated as being in a younger age demographic. In particular, the images of non-binary individuals were rated as both younger and more hypersexualised, indicating troubling intersectional effects. Notably, prompts encoded with 'negative' or 'ugly' beauty traits (such as "a wide nose") consistently produced higher Not SFW (NSFW) ratings regardless of gender. This work sheds light on the pervasive demographic biases related to beauty standards present in generative AI models -- biases that are actively perpetuated by model developers, such as via negative prompting. We conclude by discussing the implications of this on society, which include pollution of the data streams and active erasure of features that do not fall inside the stereotype of what is considered beautiful by developers.

[168] A Hybrid YOLOv5-SSD IoT-Based Animal Detection System for Durian Plantation Protection

Anis Suttan Shahrir,Zakiah Ayop,Syarulnaziah Anawar,Norulzahrah Mohd Zainudin

Main category: cs.CV

TL;DR: 本研究提出了一种基于物联网的榴莲园动物入侵检测系统,结合YOLOv5与SSD算法提升检测精度,并通过Telegram通知农民,触发声音驱赶机制。

Details Motivation: 传统农业方法在无人监控下难以有效防止动物入侵,导致作物损失和经济损失。现有系统受限于单一检测算法、通知平台不便捷及驱赶机制不足。 Method: 开发一个集成YOTOv5和SSD目标检测算法的物联网系统,实现对大象、野猪和猴子的实时监测;通过Telegram发送警报,并启动自动声音驱赶(如虎啸)。 Result: YOLO+SSD模型在白天表现最佳,对大象、野猪和猴子的检测准确率分别为90%、85%和70%,夜间性能下降,静态图像与视频均呈现相同趋势。 Conclusion: 该系统构建了一个集检测、通知与驱赶于一体的实用框架,为智慧农业中的自动化防控提供了可行方案。 Abstract: Durian plantation suffers from animal intrusions that cause crop damage and financial loss. The traditional farming practices prove ineffective due to the unavailability of monitoring without human intervention. The fast growth of machine learning and Internet of Things (IoT) technology has led to new ways to detect animals. However, current systems are limited by dependence on single object detection algorithms, less accessible notification platforms, and limited deterrent mechanisms. This research suggests an IoT-enabled animal detection system for durian crops. The system integrates YOLOv5 and SSD object detection algorithms to improve detection accuracy. The system provides real-time monitoring, with detected intrusions automatically reported to farmers via Telegram notifications for rapid response. An automated sound mechanism (e.g., tiger roar) is triggered once the animal is detected. The YOLO+SSD model achieved accuracy rates of elephant, boar, and monkey at 90%, 85% and 70%, respectively. The system shows the highest accuracy in daytime and decreases at night, regardless of whether the image is still or a video. Overall, this study contributes a comprehensive and practical framework that combines detection, notification, and deterrence, paving the way for future innovations in automated farming solutions.

[169] Class-agnostic 3D Segmentation by Granularity-Consistent Automatic 2D Mask Tracking

Juan Wang,Yasutomo Kawanishi,Tomo Miyazaki,Zhijie Wang,Shinichiro Omachi

Main category: cs.CV

TL;DR: 提出了一种粒度一致的自动2D掩码跟踪方法,结合三阶段课程学习框架,实现从碎片化到全局一致的3D实例分割,显著提升伪标签一致性与分割精度。

Details Motivation: 现有方法将2D掩码独立地转移到3D导致分割粒度不一致和伪标签冲突,影响3D实例分割性能。 Method: 提出Granularity-Consistent自动2D掩码跟踪方法,保持帧间时序对应,并设计三阶段课程学习框架,逐步从单视图碎片数据训练到多视图统一标注,最终实现全场景一致监督。 Result: 实验表明该方法能生成更一致、准确的3D分割,在标准基准上达到SOTA,并具备开放词汇能力。 Conclusion: 通过保持时序一致性和渐进式学习,可有效从不一致的2D先验中蒸馏出鲁棒的3D表示,显著提升无监督3D实例分割性能。 Abstract: 3D instance segmentation is an important task for real-world applications. To avoid costly manual annotations, existing methods have explored generating pseudo labels by transferring 2D masks from foundation models to 3D. However, this approach is often suboptimal since the video frames are processed independently. This causes inconsistent segmentation granularity and conflicting 3D pseudo labels, which degrades the accuracy of final segmentation. To address this, we introduce a Granularity-Consistent automatic 2D Mask Tracking approach that maintains temporal correspondences across frames, eliminating conflicting pseudo labels. Combined with a three-stage curriculum learning framework, our approach progressively trains from fragmented single-view data to unified multi-view annotations, ultimately globally coherent full-scene supervision. This structured learning pipeline enables the model to progressively expose to pseudo-labels of increasing consistency. Thus, we can robustly distill a consistent 3D representation from initially fragmented and contradictory 2D priors. Experimental results demonstrated that our method effectively generated consistent and accurate 3D segmentations. Furthermore, the proposed method achieved state-of-the-art results on standard benchmarks and open-vocabulary ability.

[170] FedOnco-Bench: A Reproducible Benchmark for Privacy-Aware Federated Tumor Segmentation with Synthetic CT Data

Viswa Chaitanya Marella,Suhasnadh Reddy Veluru,Sai Teja Erukude

Main category: cs.CV

TL;DR: 本文提出了FedOnco-Bench,一个基于合成肿瘤CT图像的可复现联邦学习基准平台,用于评估隐私保护与模型性能之间的权衡。

Details Motivation: 联邦学习在医疗等隐私敏感领域具有重要应用,但存在成员推断攻击和数据异构性问题,亟需标准化基准来评估隐私与性能的平衡。 Method: 构建包含合成肿瘤CT图像和肿瘤标注的数据集,采用FedAvg、FedProx、FedBN和结合DP-SGD的FedAvg方法进行联邦学习分割实验,并评估各方法的分割性能与隐私泄露程度。 Result: 实验结果显示FedAvg性能最优(Dice约0.85),但隐私泄露较高(攻击AUC约0.72);DP-SGD隐私保护最好(AUC约0.25),但精度下降(Dice约0.79);FedProx和FedBN在非独立同分布数据下表现更均衡。 Conclusion: FedOnco-Bench为医学图像分割中的隐私保护联邦学习方法提供了一个标准化、开源的评估与开发平台。 Abstract: Federated Learning (FL) allows multiple institutions to cooperatively train machine learning models while retaining sensitive data at the source, which has great utility in privacy-sensitive environments. However, FL systems remain vulnerable to membership-inference attacks and data heterogeneity. This paper presents FedOnco-Bench, a reproducible benchmark for privacy-aware FL using synthetic oncologic CT scans with tumor annotations. It evaluates segmentation performance and privacy leakage across FL methods: FedAvg, FedProx, FedBN, and FedAvg with DP-SGD. Results show a distinct trade-off between privacy and utility: FedAvg is high performance (Dice around 0.85) with more privacy leakage (attack AUC about 0.72), while DP-SGD provides a higher level of privacy (AUC around 0.25) at the cost of accuracy (Dice about 0.79). FedProx and FedBN offer balanced performance under heterogeneous data, especially with non-identical distributed client data. FedOnco-Bench serves as a standardized, open-source platform for benchmarking and developing privacy-preserving FL methods for medical image segmentation.

[171] Med-Banana-50K: A Cross-modality Large-Scale Dataset for Text-guided Medical Image Editing

Zhihui Chen,Mengling Feng

Main category: cs.CV

TL;DR: Med-Banana-50K是一个包含5万个医学图像的大规模数据集,支持基于指令的医学图像编辑,涵盖三种模态和23种疾病类型,通过Gemini模型生成双向编辑,并采用LLM-as-Judge进行医学质量控制,包含大量失败尝试日志,旨在推动医学图像编辑模型的发展。

Details Motivation: 当前医学图像编辑领域缺乏大规模、高质量且公开可用的数据集,尤其是满足严格解剖和临床约束的数据集,限制了多模态大模型在该领域的研究进展。 Method: 利用Gemini-2.5-Flash-Image模型从真实医学图像生成病灶添加和删除的双向编辑,构建Med-Banana-50K数据集;采用基于医学标准的LLM-as-Judge评估编辑质量(包括指令遵循、结构合理性、真实感和保真度),并进行最多五轮的历史感知迭代优化。 Result: 成功构建了50K规模的医学图像编辑数据集,覆盖胸片、脑MRI和眼底摄影三种模态及23种疾病;包含37K次失败编辑的完整对话日志;实现了系统化的医学质量控制流程,确保编辑结果符合临床合理性。 Conclusion: Med-Banana-50K为医学图像编辑提供了首个大规模、经过医学验证且完全可复现的数据资源,支持模型训练、偏好学习与对齐研究,为下一代医学图像编辑模型奠定了基础。 Abstract: Recent advances in multimodal large language models have enabled remarkable medical image editing capabilities. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built specifically for medical image editing with strict anatomical and clinical constraints. We introduce Med-Banana-50K, a comprehensive 50K-image dataset for instruction-based medical image editing spanning three modalities (chest X-ray, brain MRI, fundus photography) and 23 disease types. Our dataset is constructed by leveraging Gemini-2.5-Flash-Image to generate bidirectional edits (lesion addition and removal) from real medical images. What distinguishes Med-Banana-50K from general-domain editing datasets is our systematic approach to medical quality control: we employ LLM-as-Judge with a medically grounded rubric (instruction compliance, structural plausibility, realism, and fidelity preservation) and history-aware iterative refinement up to five rounds. Beyond single-turn editing, Med-Banana-50K includes 37K failed attempts with full conversation logs for preference learning and alignment research. By providing this large-scale, medically validated, and fully documented resource, Med-Banana-50K establishes a foundation for training and evaluating the next generation of medical image editing models.Our dataset and code are publicly available at [https://github.com/richardChenzhihui/med-banana-50k].

[172] GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Shijie Zhou,Viet Dac Lai,Hao Tan,Jihyung Kil,Wanrong Zhu,Changyou Chen,Ruiyi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于注意力机制且无需坐标的GUI接地框架GUI-AIMA,通过多头聚合简化查询-视觉注意力矩阵来自适应生成补丁级接地信号,实现了高效且数据高效的GUI接地。

Details Motivation: 现有的基于多模态大语言模型(MLLMs)的GUI接地方法在直接从视觉输入生成精确坐标时面临挑战,计算成本高且精度不足。因此需要一种更高效、准确的方法来实现自然语言指令到屏幕区域的映射。 Method: 提出GUI-AIMA框架,利用MLLM内部的多模态注意力机制,通过对简化的查询-视觉注意力矩阵进行多头聚合,自适应地生成针对不同用户指令的补丁级接地信号,并采用无坐标方式结合可插拔的放大阶段以精确定位点击位置。 Result: 仅使用85k截图训练的GUI-AIMA-3B在ScreenSpot-Pro上达到58.6%的平均准确率,在OSWorld-G上达到62.2%,在3B规模模型中表现达到最先进水平。 Conclusion: GUI-AIMA证明了轻量级训练即可激发MLLM的内在接地能力,其注意力对齐与无坐标设计实现了高效、精准且数据友好的GUI接地。 Abstract: Graphical user interface (GUI) grounding is a key function of computer-use agents, which maps natural-language instructions to actionable screen regions. Existing approaches based on Multimodal Large Language Models (MLLMs) typically formulate it as a text-based coordinate generation task, yet directly generating precise coordinates from visual inputs remains challenging and computationally intensive. An intuitive way to implement GUI grounding is to first select visual patches relevant to the instructions and then determine the precise click location within those patches. Based on the observations that general MLLMs have some native grounding capability, nested within their attentions, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals are calculated adaptively for diverse user instructions by multi-head aggregation on simplified query-visual attention matrices. Besides, its coordinate-free manner can easily integrate a plug-and-play zoom-in stage. GUI-AIMA-3B was trained with only 85k screenshots, demonstrating exceptional data efficiency and verifying that light training can trigger the native grounding capability of MLLMs. It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 58.6% on ScreenSpot-Pro and 62.2% on OSWorld-G. Project page: https://github.com/sjz5202/GUI-AIMA

[173] TA-LSDiff:Topology-Aware Diffusion Guided by a Level Set Energy for Pancreas Segmentation

Yue Gou,Fanghui Song,Yuming Xing,Shengzhu Shi,Zhichang Guo,Boying Wu

Main category: cs.CV

TL;DR: 提出了一种结合拓扑感知扩散概率模型与水平集能量的新型胰腺分割方法TA-LSDiff,在多个公开数据集上实现了最先进的精度。

Details Motivation: 胰腺分割因器官小、对比度低和拓扑变异大而具有挑战性,传统方法在几何演化和细节保持方面存在局限。 Method: 提出TA-LSDiff模型,融合拓扑感知扩散模型与水平集能量函数,并引入像素自适应精修模块,通过邻域亲和加权局部调节能量函数。 Result: 在四个公开胰腺数据集上验证了方法的有效性,消融实验表明各组件贡献显著,整体性能优于现有方法。 Conclusion: TA-LSDiff是一种无需显式几何演化的高精度胰腺分割方案,兼顾结构细节与语义特征,具有临床应用潜力。 Abstract: Pancreas segmentation in medical image processing is a persistent challenge due to its small size, low contrast against adjacent tissues, and significant topological variations. Traditional level set methods drive boundary evolution using gradient flows, often ignoring pointwise topological effects. Conversely, deep learning-based segmentation networks extract rich semantic features but frequently sacrifice structural details. To bridge this gap, we propose a novel model named TA-LSDiff, which combined topology-aware diffusion probabilistic model and level set energy, achieving segmentation without explicit geometric evolution. This energy function guides implicit curve evolution by integrating the input image and deep features through four complementary terms. To further enhance boundary precision, we introduce a pixel-adaptive refinement module that locally modulates the energy function using affinity weighting from neighboring evidence. Ablation studies systematically quantify the contribution of each proposed component. Evaluations on four public pancreas datasets demonstrate that TA-LSDiff achieves state-of-the-art accuracy, outperforming existing methods. These results establish TA-LSDiff as a practical and accurate solution for pancreas segmentation.

[174] OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models

Ruoxiang Huang,Xindian Ma,Rundong Kong,Zhen Yuan,Peng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉语言模型位置编码框架OMEGA,通过模态特定的位置编码(MSPE)和全局自适应编码步长缩放(GAESS)提升多模态任务性能。

Details Motivation: 现有视觉语言模型采用统一的1D或2D位置索引策略,未能充分考虑文本和视觉模态在结构特性上的差异,限制了模型对序列连续性和空间一致性的建模能力。 Method: 提出OMEGA框架,包含模态特定位置编码(MSPE),在独立坐标维度上为不同模态分配位置索引以保留其固有结构;并引入全局自适应编码步长缩放(GAESS),根据两模态的嵌入熵自适应调整视觉token的位置编码步长,以对齐多模态数据的信息密度。 Result: 实验表明,OMEGA在多种架构和VQA基准上均能持续提升VLM性能,在视觉密集型任务中,相比基线方法在Qwen2.5-VL-3B上最高提升3.43%,且在更大模型如Qwen2.5-VL-7B和LLaVA-v1.5-7B上也观察到一致增益。 Conclusion: OMEGA通过模态特定且自适应的位置编码机制,有效提升了视觉语言模型对多模态结构信息的建模能力,具有良好的通用性和扩展性。 Abstract: Vision-Language Models (VLMs) have demonstrated strong performance across various multimodal tasks, where position encoding plays a vital role in modeling both the sequential structure of textual information and the spatial structure of visual information. However, current VLMs commonly adopt modality-unified 1D or 2D positional indexing strategies, which treat textual and visual tokens uniformly without accounting for their distinct structural properties and sequential continuity for text and spatial coherence for vision. To address this limitation, we propose OMEGA, a novel position encoding framework that employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving the inherent structures of each modality across separate coordinate dimensions. Additionally, to align the information density of multimodal data in the positional index space, OMEGA introduces Global Adaptive Encoding Step Scaling (GAESS), which adaptively adjusts the position encoding step size of visual tokens based on the embedding entropy of both modalities. Experimental results demonstrate that OMEGA consistently enhances VLM performance across diverse architectures and VQA benchmarks. On visual-intensive tasks, OMEGA achieves up to 3.43% improvement over baseline position encoding strategies on Qwen2.5-VL-3B, with consistent gains observed across larger models including Qwen2.5-VL-7B and LLaVA-v1.5-7B.

[175] Enhancing Adversarial Transferability in Visual-Language Pre-training Models via Local Shuffle and Sample-based Attack

Xin Liu,Aoyang Zhou,Aoyang Zhou

Main category: cs.CV

TL;DR: 提出了一种新的多模态对抗攻击方法LSSA,通过局部图像块重排和采样增强输入多样性,显著提升了对抗样本在视觉-语言预训练模型中的迁移性。

Details Motivation: 现有跨模态对抗攻击方法因过度依赖单一模态的对抗信息而导致输入多样性不足,易发生过拟合,限制了对抗样本的迁移性。 Method: 提出局部重排与采样攻击(LSSA):随机打乱图像的局部块以扩充图像-文本对,生成对抗图像并在其邻域采样;利用原始和采样图像共同生成对抗文本。 Result: 在多个模型和数据集上的实验表明,LSSA显著提升了多模态对抗样本在不同VLP模型和下游任务间的迁移能力,并在大视觉语言模型上优于其他先进攻击方法。 Conclusion: LSSA通过增强输入多样性有效缓解了跨模态对抗攻击中的过拟合问题,显著提高了对抗样本的迁移性,为评估视觉-语言模型的鲁棒性提供了新思路。 Abstract: Visual-Language Pre-training (VLP) models have achieved significant performance across various downstream tasks. However, they remain vulnerable to adversarial examples. While prior efforts focus on improving the adversarial transferability of multimodal adversarial examples through cross-modal interactions, these approaches suffer from overfitting issues, due to a lack of input diversity by relying excessively on information from adversarial examples in one modality when crafting attacks in another. To address this issue, we draw inspiration from strategies in some adversarial training methods and propose a novel attack called Local Shuffle and Sample-based Attack (LSSA). LSSA randomly shuffles one of the local image blocks, thus expanding the original image-text pairs, generating adversarial images, and sampling around them. Then, it utilizes both the original and sampled images to generate the adversarial texts. Extensive experiments on multiple models and datasets demonstrate that LSSA significantly enhances the transferability of multimodal adversarial examples across diverse VLP models and downstream tasks. Moreover, LSSA outperforms other advanced attacks on Large Vision-Language Models.

[176] Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials

Yifan Pu,Jixuan Ying,Qixiu Li,Tianzhu Ye,Dongchen Han,Xiaochen Wang,Ziyi Wang,Xinyu Shao,Gao Huang,Xiu Li

Main category: cs.CV

TL;DR: 本文提出了Visual-Contrast Attention (VCA),作为Vision Transformers中多头自注意力的替代方案,通过引入判别性机制,在降低计算复杂度的同时提升了图像识别与生成性能。

Details Motivation: 现有的Vision Transformers中的多头自注意力机制在所有token对上进行二次型计算,浪费了大量计算资源在视觉上弱相关或冗余的关联上,因此需要一种更高效且具有判别性的注意力机制。 Method: VCA首先将每个注意力头的密集查询场蒸馏为少量空间池化的视觉对比token,然后将其分为可学习的正负两路流,通过差异性交互突出区域间的本质区别。该方法将理论复杂度从O(NNC)降至O(nNC),其中n远小于N。 Result: 在ImageNet-1K上,VCA将DeiT-Tiny的top-1准确率从72.2%提升至75.6%,并在多个分层ViT模型上最高提升3.1%;在图像生成任务中,FID-50K指标在DiT和SiT模型上均显著下降2.1至5.2点。消融实验验证了空间池化、双位置编码及两阶段结合的有效性。 Conclusion: VCA提供了一种简单有效的方法,使Vision Transformers在保持低参数量和计算成本的前提下,实现更快、更精准的视觉建模。 Abstract: Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.

[177] Parameter Interpolation Adversarial Training for Robust Image Classification

Xin Liu,Yichen Yang,Kun He,John E. Hopcroft

Main category: cs.CV

TL;DR: 提出了一种新的对抗训练框架PIAT,通过在每个训练周期之间插值模型参数来缓解模型鲁棒性波动和过拟合问题,并结合归一化均方误差(NMSE)进一步提升鲁棒性。

Details Motivation: 现有对抗训练方法存在模型鲁棒性波动大和过拟合问题,影响防御效果,因此需要一种更稳定的训练框架。 Method: 提出参数插值对抗训练(PIAT),在每轮训练中对前后周期的模型参数进行插值,并引入NMSE损失对齐干净样本与对抗样本的logits相对幅度。 Result: 在多个基准数据集上的实验证明,PIAT显著提升了CNN和Vision Transformer的模型鲁棒性。 Conclusion: PIAT通过平滑决策边界和缓解过拟合,有效提高了模型对抗攻击的防御能力,适用于多种网络结构。 Abstract: Though deep neural networks exhibit superior performance on various tasks, they are still plagued by adversarial examples. Adversarial training has been demonstrated to be the most effective method to defend against adversarial attacks. However, existing adversarial training methods show that the model robustness has apparent oscillations and overfitting issues in the training process, degrading the defense efficacy. To address these issues, we propose a novel framework called Parameter Interpolation Adversarial Training (PIAT). PIAT tunes the model parameters between each epoch by interpolating the parameters of the previous and current epochs. It makes the decision boundary of model change more moderate and alleviates the overfitting issue, helping the model converge better and achieving higher model robustness. In addition, we suggest using the Normalized Mean Square Error (NMSE) to further improve the robustness by aligning the relative magnitude of logits between clean and adversarial examples rather than the absolute magnitude. Extensive experiments conducted on several benchmark datasets demonstrate that our framework could prominently improve the robustness of both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs).

[178] OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks

Zhihao Peng,Cheng Wang,Shengyuan Liu,Zhiying Liang,Yixuan Yuan

Main category: cs.CV

TL;DR: OmniBrainBench 是首个面向脑成像分析的综合性多模态视觉问答(VQA)基准,涵盖15种脑成像模态、31,706张图像和9,527个验证的VQA对,模拟真实临床流程,用于全面评估多模态大语言模型(MLLMs)在脑影像分析中的表现。

Details Motivation: 现有脑部VQA基准覆盖的成像模态有限,且多局限于粗粒度病理描述,难以全面评估MLLMs在完整临床流程中的能力。因此,亟需一个更全面、贴近临床实践的评估基准。 Method: 构建名为OmniBrainBench的多模态VQA基准,包含来自30个权威医学来源的15种脑成像模态数据,共9,527个经放射科医生验证的VQA对和31,706张图像,涵盖15个多层次临床任务,并对24种前沿MLLM(开源、医学专用和闭源模型)进行系统评测。 Result: 实验表明:(1) 闭源MLLM(如GPT-5)优于开源和医学模型,但仍落后于医生;(2) 医学MLLM性能差异大;(3) 开源模型整体落后但在特定任务中表现优异;(4) MLLMs在复杂的术前任务中表现显著下降,暴露出视觉到临床推理的鸿沟。 Conclusion: OmniBrainBench为评估和推动MLLM在脑成像分析中的发展设立了新标准,揭示了当前模型与专家临床推理能力之间的差距,有助于指导未来研究方向。 Abstract: Brain imaging analysis is vital for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly assisting in that analysis. However, current brain-oriented visual question-answering (VQA) benchmarks either cover a few imaging modalities or are limited to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs throughout the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis.OmniBrainBench consists of 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluation of 24 state-of-the-art models, including open-source, medical, and proprietary MLLMs, highlights the substantial challenges posed by OmniBrainBench. Our experiments reveal: (1) proprietary MLLMs (e.g., GPT-5) beat open-source and medical models but lag physicians; (2) medical MLLMs vary widely in performance; (3) open-source MLLMs trail overall but excel in specific tasks; (4) MLLMs underperform sharply in complex preoperative tasks, revealing a visual-to-clinical reasoning gap. OmniBrainBench sets a new standard for evaluating and advancing MLLMs in brain imaging analysis, highlighting gaps compared to expert clinical reasoning. We release it at benchmark \& code.

[179] Occlusion-Aware Diffusion Model for Pedestrian Intention Prediction

Yu Liu,Zhijie Liu,Zedong Yang,You-Fu Li,He Kong

Main category: cs.CV

TL;DR: 提出了一种遮挡感知扩散模型(ODM),用于在遮挡情况下预测行人过街意图,通过重建被遮挡的运动模式并结合遮挡掩码引导反向过程,提升了预测鲁棒性和准确性。

Details Motivation: 现有基于深度学习的行人意图预测模型在处理遮挡导致的不完整观测时表现不佳,缺乏对遮挡场景下上下文语义关系的有效建模。 Method: 提出Occlusion-Aware Diffusion Model(ODM),采用遮挡感知扩散Transformer架构估计被遮挡区域的噪声特征,并引入遮挡掩码引导的反向扩散过程,以更有效地利用观测信息并减少预测误差累积。 Result: 在PIE和JAAD两个主流数据集上进行了广泛实验,结果表明所提方法在多种遮挡场景下均优于现有方法,显著提升了运动特征重建和意图预测的准确性与鲁棒性。 Conclusion: ODM通过显式建模遮挡区域的运动模式和上下文关系,在复杂遮挡环境下实现了更可靠的行人意图预测,具有较强的实用潜力。 Abstract: Predicting pedestrian crossing intentions is crucial for the navigation of mobile robots and intelligent vehicles. Although recent deep learning-based models have shown significant success in forecasting intentions, few consider incomplete observation under occlusion scenarios. To tackle this challenge, we propose an Occlusion-Aware Diffusion Model (ODM) that reconstructs occluded motion patterns and leverages them to guide future intention prediction. During the denoising stage, we introduce an occlusion-aware diffusion transformer architecture to estimate noise features associated with occluded patterns, thereby enhancing the model's ability to capture contextual relationships in occluded semantic scenarios. Furthermore, an occlusion mask-guided reverse process is introduced to effectively utilize observation information, reducing the accumulation of prediction errors and enhancing the accuracy of reconstructed motion features. The performance of the proposed method under various occlusion scenarios is comprehensively evaluated and compared with existing methods on popular benchmarks, namely PIE and JAAD. Extensive experimental results demonstrate that the proposed method achieves more robust performance than existing methods in the literature.

[180] Layer-Wise Modality Decomposition for Interpretable Multimodal Sensor Fusion

Jaehyun Park,Konyul Park,Daehun Kim,Junseo Park,Jun Won Choi

Main category: cs.CV

TL;DR: 本文提出了一种名为Layer-Wise Modality Decomposition (LMD) 的后验、模型无关的解释性方法,用于在自动驾驶的多传感器融合感知模型中分离各模态信息,首次实现了对融合模型预测结果按输入模态进行归因分析。

Details Motivation: 在自动驾驶中,感知模型决策的透明性至关重要,但由于多传感器信息在融合网络中纠缠,难以判断各模态对预测的贡献。 Method: 提出LMD方法,通过逐层分解融合模型中不同模态的贡献,实现对预训练融合模型的可解释性分析,适用于camera-radar、camera-LiDAR及三者联合设置。 Result: 在多种传感器组合下验证了LMD的有效性,通过结构化扰动指标和模态可视化分解证明其能有效解释高容量多模态模型。 Conclusion: LMD是首个能在自动驾驶传感器融合系统中将感知模型预测归因于各输入模态的方法,具有良好的实用性和扩展性。 Abstract: In autonomous driving, transparency in the decision-making of perception models is critical, as even a single misperception can be catastrophic. Yet with multi-sensor inputs, it is difficult to determine how each modality contributes to a prediction because sensor information becomes entangled within the fusion network. We introduce Layer-Wise Modality Decomposition (LMD), a post-hoc, model-agnostic interpretability method that disentangles modality-specific information across all layers of a pretrained fusion model. To our knowledge, LMD is the first approach to attribute the predictions of a perception model to individual input modalities in a sensor-fusion system for autonomous driving. We evaluate LMD on pretrained fusion models under camera-radar, camera-LiDAR, and camera-radar-LiDAR settings for autonomous driving. Its effectiveness is validated using structured perturbation-based metrics and modality-wise visual decompositions, demonstrating practical applicability to interpreting high-capacity multimodal architectures. Code is available at https://github.com/detxter-jvb/Layer-Wise-Modality-Decomposition.

[181] GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks

Heng Zheng,Yuling Shi,Xiaodong Gu,Haochen You,Zijian Zhang,Lubin Gan,Hao Zhang,Wenjun Huang,Jin Huang

Main category: cs.CV

TL;DR: 提出了一种基于异构图神经网络的多智能体辩论框架GraphGeo,用于视觉地理定位,通过建模不同类型的辩论关系显著提升了定位精度。

Details Motivation: 现有方法在处理多样地理区域和复杂场景时表现有限,且多智能体系统缺乏有效处理冲突预测的机制。 Method: 构建了一个多智能体辩论框架GraphGeo,采用异构图神经网络建模支持性协作、竞争性论证和知识迁移等关系,并引入节点级和边级双层辩论机制及跨层级拓扑优化策略。 Result: 在多个基准上的实验表明,GraphGeo显著优于现有最先进方法,能将智能体间的认知冲突转化为更高的定位准确性。 Conclusion: GraphGeo通过结构化辩论机制有效提升了视觉地理定位性能,为多智能体协作提供了新思路。 Abstract: Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.

[182] Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs

Yan Shu,Chi Liu,Robin Chen,Derek Li,Bryan Dai

Main category: cs.CV

TL;DR: 提出Fleming-VL,一个统一的端到端多模态大语言模型框架,用于跨异构模态(2D、3D、视频)的医学视觉理解,在多个医学基准上达到SOTA性能。

Details Motivation: 医学数据具有异质性(如2D图像、3D扫描、视频序列),模态间领域差异和格式不一致阻碍了统一医学MLLM的发展。 Method: 从数据角度出发:(1)融合自然与医学领域的长上下文数据进行大规模预训练;(2)用稀有医学数据(如超声、皮肤镜、视频)进行微调补充;(3)扩展评估框架以包含3D和视频理解任务;采用SFT和GRPO训练多规模模型。 Result: Fleming-VL在多个医学VQA、视频QA和3D医学图像理解基准上实现最先进性能。 Conclusion: Fleming-VL是一个有效的统一框架,能够处理多种医学模态,在广泛任务中表现优异,且已公开发布以促进医学AI的可重复研究。 Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable effectiveness in various general-domain scenarios, such as visual question answering and image captioning. Recently, researchers have increasingly focused on empowering MLLMs with medical conversational abilities, which hold significant promise for clinical applications. However, medical data presents unique challenges due to its heterogeneous nature -- encompassing diverse modalities including 2D images, 3D volumetric scans, and temporal video sequences. The substantial domain gap and data format inconsistencies across these modalities have hindered the development of unified medical MLLMs. To address these challenges, we propose Fleming-VL, a unified end-to-end framework for comprehensive medical visual understanding across heterogeneous modalities. Fleming-VL tackles this problem from a data-centric perspective through three key strategies: (1) scaling up pretraining by integrating long-context data from both natural and medical-specific domains; (2) complementing fine-tuning with rare medical data, including holistic video analysis and underrepresented 2D modalities such as ultrasound and dermoscopy images; (3) extending existing evaluation frameworks to incorporate 3D volumetric and video understanding benchmarks. Through supervised fine-tuning (SFT) and group relative policy optimization (GRPO), we develop Fleming-VL in multiple model scales. Extensive experiments demonstrate that Fleming-VL achieves state-of-the-art performance across multiple benchmarks, including medical VQA, video QA, and 3D medical image understanding. We publicly release Fleming-VL to promote transparent, reproducible, and auditable progress in medical AI.

[183] Dynamic Multi-level Weighted Alignment Network for Zero-shot Sketch-based Image Retrieval

Hanwen Su,Ge Song,Jiyan Wang,Yuanbo Zhu

Main category: cs.CV

TL;DR: 提出了一种用于零样本草图检索的动态多层级加权对齐网络,在多个基准数据集上优于现有方法。

Details Motivation: 现有方法在训练中使用模态不平衡样本和低质量信息,导致性能不佳。 Method: 设计了三个模块:单模态特征提取模块、跨模态多层级加权模块和加权四元组损失模块,以提升对齐质量和域平衡性。 Result: 在Sketchy、TU-Berlin和QuickDraw数据集上实验表明,该方法性能优于当前最先进的ZS-SBIR方法。 Conclusion: 所提出的动态多层级加权对齐网络有效提升了零样本草图图像检索的性能。 Abstract: The problem of zero-shot sketch-based image retrieval (ZS-SBIR) has achieved increasing attention due to its wide applications, e.g. e-commerce. Despite progress made in this field, previous works suffer from using imbalanced samples of modalities and inconsistent low-quality information during training, resulting in sub-optimal performance. Therefore, in this paper, we introduce an approach called Dynamic Multi-level Weighted Alignment Network for ZS-SBIR. It consists of three components: (i) a Uni-modal Feature Extraction Module that includes a CLIP text encoder and a ViT for extracting textual and visual tokens, (ii) a Cross-modal Multi-level Weighting Module that produces an alignment weight list by the local and global aggregation blocks to measure the aligning quality of sketch and image samples, (iii) a Weighted Quadruplet Loss Module aiming to improve the balance of domains in the triplet loss. Experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, show our method delivers superior performances over the state-of-the-art ZS-SBIR methods.

[184] EVTAR: End-to-End Try on with Additional Unpaired Visual Reference

Liuzhuozheng Li,Yue Gong,Shanyuan Liu,Bo Cheng,Yuhang Ma,Liebucha Wu,Dengyang Jiang,Zanyi Wang,Dawei Leng,Yuhui Yin

Main category: cs.CV

TL;DR: 提出EVTAR,一种端到端的虚拟试穿模型,利用额外参考图像提升试穿效果,无需复杂输入如姿态或分割图,仅需源图像和目标服装即可实现高质量、细节保留的虚拟试穿。

Details Motivation: 现有虚拟试穿方法依赖复杂的输入(如人体姿态、密集姿态、关键点等),导致标注成本高、实用性差,难以应用于真实场景。因此,需要一种简化输入且能保持高质量生成效果的端到端模型。 Method: 采用两阶段训练策略,设计一个无需掩码、密集姿态或分割图的端到端生成模型;引入额外参考图像(不同人穿着同款衣物)来增强衣物纹理和细节的保真度,并结合未配对人物图像增强训练数据多样性。 Result: 在两个广泛使用的基准和多种任务上进行评估,结果表明EVTAR在保持衣物细节、整体视觉质量和试穿准确性方面优于现有方法,且推理过程简单高效。 Conclusion: EVTAR通过引入参考图像和简化输入要求,实现了更实用、高质量的虚拟试穿,在真实应用场景中具有更强的可行性,并为未来虚拟试穿系统提供了新的设计思路。 Abstract: We propose EVTAR, an End-to-End Virtual Try-on model with Additional Reference, that directly fits the target garment onto the person image while incorporating reference images to enhance try-on accuracy. Most existing virtual try-on approaches rely on complex inputs such as agnostic person images, human pose, densepose, or body keypoints, making them labor-intensive and impractical for real-world applications. In contrast, EVTAR adopts a two-stage training strategy, enabling simple inference with only the source image and the target garment inputs. Our model generates try-on results without masks, densepose, or segmentation maps. Moreover, EVTAR leverages additional reference images of different individuals wearing the same clothes to preserve garment texture and fine-grained details better. This mechanism is analogous to how humans consider reference models when choosing outfits, thereby simulating a more realistic and high-quality dressing effect. We enrich the training data with supplementary references and unpaired person images to support these capabilities. We evaluate EVTAR on two widely used benchmarks and diverse tasks, and the results consistently validate the effectiveness of our approach.

[185] A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis

Dongheng Lin,Mengxue Qu,Kunyang Han,Jianbo Jiao,Xiaojie Jin,Yunchao Wei

Main category: cs.CV

TL;DR: 提出一种统一的零样本视频异常分析框架,通过任务链式推理实现时间检测、空间定位和文本解释的联合推理,无需额外训练即可实现最先进的性能。

Details Motivation: 现有视频异常分析方法通常局限于帧级检测,缺乏对异常原因的解释,且多数方法依赖特定数据和任务,泛化能力差。需要一个能同时提供时空定位与语义解释的通用、可解释的零样本框架。 Method: 构建基于测试时链式推理的统一框架,利用任务内推理优化时间检测,通过任务间链式连接实现空间定位与语义解释,完全无需微调或额外训练,依托基础模型的推理能力完成多任务协同。 Result: 在多个视频异常检测、定位和解释基准上达到最先进的零样本性能,验证了该方法在无需额外数据或梯度的情况下具有优异的可解释性和泛化能力。 Conclusion: 精心设计的提示与任务链式结构能够释放基础模型的推理潜力,实现高效、可解释、完全零样本的视频异常分析。 Abstract: Most video-anomaly research stops at frame-wise detection, offering little insight into why an event is abnormal, typically outputting only frame-wise anomaly scores without spatial or semantic context. Recent video anomaly localization and video anomaly understanding methods improve explainability but remain data-dependent and task-specific. We propose a unified reasoning framework that bridges the gap between temporal detection, spatial localization, and textual explanation. Our approach is built upon a chained test-time reasoning process that sequentially connects these tasks, enabling holistic zero-shot anomaly analysis without any additional training. Specifically, our approach leverages intra-task reasoning to refine temporal detections and inter-task chaining for spatial and semantic understanding, yielding improved interpretability and generalization in a fully zero-shot manner. Without any additional data or gradients, our method achieves state-of-the-art zero-shot performance across multiple video anomaly detection, localization, and explanation benchmarks. The results demonstrate that careful prompt design with task-wise chaining can unlock the reasoning power of foundation models, enabling practical, interpretable video anomaly analysis in a fully zero-shot manner. Project Page: https://rathgrith.github.io/Unified_Frame_VAA/.

[186] VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel

Suzhong Fu,Rui Sun,Xuan Ding,Jingqi Dong,Yiming Yang,Yao Zhu,Min Chang Jordan Ren,Delin Deng,Angelica Aviles-Rivero,Shuguang Cui,Zhen Li

Main category: cs.CV

TL;DR: 提出VesSAM,一个针对2D血管分割的高效框架,结合卷积适配器、多提示编码器和轻量解码器,在多个指标上显著优于现有方法。

Details Motivation: 基础模型如SAM在通用分割任务中表现良好,但在血管结构分割上效果不佳,且血管具有细长分支和低纹理对比度的特点,导致准确分割困难。 Method: 设计VesSAM框架,包含增强局部纹理的卷积适配器、融合解剖提示(如骨架、分叉点)的多提示编码器,以及减少锯齿的轻量掩码解码器,并构建自动生成多提示标注的流程和跨模态基准数据集。 Result: VesSAM在Dice和IoU指标上比现有的PEFT-based SAM变体高出10%以上和13%,性能接近全微调方法但参数更少,且在分布外场景中表现出更强的泛化能力。 Conclusion: VesSAM是一种高效、强健的2D血管分割框架,通过引入领域特定结构和多提示学习,显著提升了对复杂血管结构的分割性能和泛化能力。 Abstract: Accurate vessel segmentation is critical for clinical applications such as disease diagnosis and surgical planning, yet remains challenging due to thin, branching structures and low texture contrast. While foundation models like the Segment Anything Model (SAM) have shown promise in generic segmentation, they perform sub-optimally on vascular structures. In this work, we present VesSAM, a powerful and efficient framework tailored for 2D vessel segmentation. VesSAM integrates (1) a convolutional adapter to enhance local texture features, (2) a multi-prompt encoder that fuses anatomical prompts, including skeletons, bifurcation points, and segment midpoints, via hierarchical cross-attention, and (3) a lightweight mask decoder to reduce jagged artifacts. We also introduce an automated pipeline to generate structured multi-prompt annotations, and curate a diverse benchmark dataset spanning 8 datasets across 5 imaging modalities. Experimental results demonstrate that VesSAM consistently outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU, and achieves competitive performance compared to fully fine-tuned methods, with significantly fewer parameters. VesSAM also generalizes well to out-of-distribution (OoD) settings, outperforming all baselines in average OoD Dice and IoU.

[187] $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

Trishanu Das,Abhilash Nandy,Khush Bajaj,Deepiha S

Main category: cs.CV

TL;DR: 本文提出了一个包含1333个英文Rebus谜题的大规模基准数据集,并提出了一种模型无关的框架RebusDescProgICE,通过结合非结构化描述和基于代码的结构化推理,提升了视觉-语言模型在该任务上的表现。

Details Motivation: Rebus谜题涉及图像识别、常识推理、多步推理等多种能力,对当前视觉-语言模型构成挑战,因此需要更有效的基准和推理框架来评估和提升模型性能。 Method: 构建了一个大规模、多样化的Rebus谜题基准数据集,并提出RebusDescProgICE框架,结合非结构化描述、基于代码的结构化推理以及改进的上下文示例选择方法。 Result: 相比思维链推理,所提方法在闭源模型上性能提升2.1-4.1%,在开源模型上提升20-30%。 Conclusion: RebusDescProgICE框架有效提升了视觉-语言模型在Rebus谜题理解任务上的性能,验证了结构化推理与合理上下文示例选择的重要性。 Abstract: Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$, a large and diverse benchmark of $1,333$ English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$ by $2.1-4.1\%$ and $20-30\%$ using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.

[188] MID: A Self-supervised Multimodal Iterative Denoising Framework

Chang Nie,Tianchen Deng,Zhe Liu,Hesheng Wang

Main category: cs.CV

TL;DR: 提出了一种自监督的多模态迭代去噪框架MID,无需配对数据即可从噪声数据中学习并有效去除复杂非线性噪声,在多个领域表现出色。

Details Motivation: 传统基于规则的去噪方法难以处理复杂的非线性噪声,且依赖成对的干净-噪声数据,限制了其在真实场景中的应用。 Method: MID将噪声数据建模为连续非线性噪声累积过程的一个状态,通过迭代加噪学习两个神经网络:一个估计当前噪声步骤,另一个预测并减去相应的噪声增量,并利用一阶泰勒展开局部线性化噪声过程。 Result: 在四个经典计算机视觉任务以及生物医学和生物信息学领域的任务中,MID均展现出强大的鲁棒性、适应性和最先进的性能。 Conclusion: MID是一种无需配对训练数据、能有效处理复杂非线性噪声的通用去噪框架,在多种实际应用场景中具有广泛潜力。 Abstract: Data denoising is a persistent challenge across scientific and engineering domains. Real-world data is frequently corrupted by complex, non-linear noise, rendering traditional rule-based denoising methods inadequate. To overcome these obstacles, we propose a novel self-supervised multimodal iterative denoising (MID) framework. MID models the collected noisy data as a state within a continuous process of non-linear noise accumulation. By iteratively introducing further noise, MID learns two neural networks: one to estimate the current noise step and another to predict and subtract the corresponding noise increment. For complex non-linear contamination, MID employs a first-order Taylor expansion to locally linearize the noise process, enabling effective iterative removal. Crucially, MID does not require paired clean-noisy datasets, as it learns noise characteristics directly from the noisy inputs. Experiments across four classic computer vision tasks demonstrate MID's robustness, adaptability, and consistent state-of-the-art performance. Moreover, MID exhibits strong performance and adaptability in tasks within the biomedical and bioinformatics domains.

[189] Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models

Xiaoyu Zhan,Wenxuan Huang,Hao Sun,Xinyu Fu,Changfeng Ma,Shaosheng Cao,Bohan Jia,Shaohui Lin,Zhenfei Yin,Lei Bai,Wanli Ouyang,Yuanqi Li,Jie Guo,Yanwen Guo

Main category: cs.CV

TL;DR: 本文提出了Viewpoint Learning任务和Viewpoint-100K数据集,通过两阶段微调策略提升多模态大语言模型(MLLM)的空间推理能力,实验表明该方法显著增强了MLLM在3D视觉理解中的表现。

Details Motivation: 现有MLLM在2D视觉理解上取得进展,但其在复杂3D推理任务中,尤其是跨视角一致性等空间信息捕捉方面的能力尚不明确,亟需评估与提升。 Method: 构建包含10万对多视角图像的Viewpoint-100K数据集,采用两阶段微调:第一阶段通过监督微调注入基础空间知识,第二阶段利用GRPO强化学习算法提升泛化能力,并提出混合冷启动初始化方法以同时学习视角表示并保持推理连贯性。 Result: 所提方法显著激活了MLLM的空间推理能力,在领域内和领域外的3D推理任务上均取得性能提升,验证了基础空间技能训练的有效性。 Conclusion: 发展MLLM的基础空间推理能力对其在机器人、自动驾驶和3D场景理解等现实应用中的成功至关重要,本文为实现这一目标提供了有效路径。 Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved 2D visual understanding, prompting interest in their application to complex 3D reasoning tasks. However, it remains unclear whether these models can effectively capture the detailed spatial information required for robust real-world performance, especially cross-view consistency, a key requirement for accurate 3D reasoning. Considering this issue, we introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs. We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs. Our approach employs a two-stage fine-tuning strategy: first, foundational knowledge is injected to the baseline MLLM via Supervised Fine-Tuning (SFT) on Viewpoint-100K, resulting in significant improvements across multiple tasks; second, generalization is enhanced through Reinforcement Learning using the Group Relative Policy Optimization (GRPO) algorithm on a broader set of questions. Additionally, we introduce a hybrid cold-start initialization method designed to simultaneously learn viewpoint representations and maintain coherent reasoning thinking. Experimental results show that our approach significantly activates the spatial reasoning ability of MLLM, improving performance on both in-domain and out-of-domain reasoning tasks. Our findings highlight the value of developing foundational spatial skills in MLLMs, supporting future progress in robotics, autonomous systems, and 3D scene understanding.

[190] Integrating Visual and X-Ray Machine Learning Features in the Study of Paintings by Goya

Hassan Ugail,Ismail Lujain Jaleel

Main category: cs.CV

TL;DR: 提出一种用于戈雅画作认证的新型多模态机器学习框架,通过统一的特征提取方法处理视觉和X射线图像,在24幅真迹上实现了97.8%的分类准确率和较低的误报率。

Details Motivation: 由于戈雅风格演变复杂且历史上伪造频繁,传统艺术认证方法难以有效应对,因此需要一种更可靠的计算方法进行作品鉴别。 Method: 采用统一的特征提取流程(包括灰度共生矩阵、局部二值模式、熵、能量和颜色分布分析)处理视觉与X射线图像,并使用超参数优化的单类支持向量机进行训练和分类。 Result: 在80/20训练-测试划分和10折交叉验证下,模型达到97.8%的准确率和0.022的假阳性率;对《Un Gigante》案例分析显示认证置信度为92.3%。 Conclusion: 该多模态方法显著优于单模态方法,证明了对多种成像模式应用一致计算方法在艺术品认证中的有效性。 Abstract: Art authentication of Francisco Goya's works presents complex computational challenges due to his heterogeneous stylistic evolution and extensive historical patterns of forgery. We introduce a novel multimodal machine learning framework that applies identical feature extraction techniques to both visual and X-ray radiographic images of Goya paintings. The unified feature extraction pipeline incorporates Grey-Level Co-occurrence Matrix descriptors, Local Binary Patterns, entropy measures, energy calculations, and colour distribution analysis applied consistently across both imaging modalities. The extracted features from both visual and X-ray images are processed through an optimised One-Class Support Vector Machine with hyperparameter tuning. Using a dataset of 24 authenticated Goya paintings with corresponding X-ray images, split into an 80/20 train-test configuration with 10-fold cross-validation, the framework achieves 97.8% classification accuracy with a 0.022 false positive rate. Case study analysis of ``Un Gigante'' demonstrates the practical efficacy of our pipeline, achieving 92.3% authentication confidence through unified multimodal feature analysis. Our results indicate substantial performance improvement over single-modal approaches, establishing the effectiveness of applying identical computational methods to both visual and radiographic imagery in art authentication applications.

[191] HyFormer-Net: A Synergistic CNN-Transformer with Interpretable Multi-Scale Fusion for Breast Lesion Segmentation and Classification in Ultrasound Images

Mohammad Amanour Rahman

Main category: cs.CV

TL;DR: 本文提出了一种名为HyFormer-Net的混合CNN-Transformer网络,用于乳腺超声图像的同步分割与分类,具备内在可解释性,在多个指标上优于现有模型,并通过渐进式微调实现了跨数据集的良好泛化能力。

Details Motivation: B模式超声在乳腺癌诊断中存在斑点噪声、操作者依赖性和边界模糊等问题,现有深度学习方法受限于单任务学习、网络结构缺陷(如CNN缺乏全局上下文,Transformer缺乏局部特征)以及决策过程不透明,限制了临床应用。 Method: 提出HyFormer-Net,采用EfficientNet-B3和Swin Transformer双分支编码器,通过多尺度层次融合模块结合两者优势,结合注意力门控解码器实现精确分割与可解释性;引入双通道可解释性机制:(1) 内在注意力验证(定量IoU验证,均值0.86);(2) Grad-CAM用于分类推理。 Result: 在BUSI数据集上,HyFormer-Net取得Dice Score 0.761±0.072,准确率93.2%,恶性肿瘤召回率92.1±2.2%;集成模型达到Dice 90.2%、准确率99.5%、恶性召回率100%;消融实验显示多尺度融合提升+16.8% Dice,注意力门控提升+5.9%;跨数据集实验表明零样本迁移失败(Dice 0.058),但使用仅10%目标域数据微调即可恢复92.5%性能,50%数据时Dice达77.3%,超过源域表现。 Conclusion: HyFormer-Net在乳腺超声图像分割与分类任务中表现出优越性能和良好可解释性,且通过少量目标域数据微调即可实现强跨域泛化能力,具有较高的临床应用潜力。 Abstract: B-mode ultrasound for breast cancer diagnosis faces challenges: speckle, operator dependency, and indistinct boundaries. Existing deep learning suffers from single-task learning, architectural constraints (CNNs lack global context, Transformers local features), and black-box decision-making. These gaps hinder clinical adoption. We propose HyFormer-Net, a hybrid CNN-Transformer for simultaneous segmentation and classification with intrinsic interpretability. Its dual-branch encoder integrates EfficientNet-B3 and Swin Transformer via multi-scale hierarchical fusion blocks. An attention-gated decoder provides precision and explainability. We introduce dual-pipeline interpretability: (1) intrinsic attention validation with quantitative IoU verification (mean: 0.86), and (2) Grad-CAM for classification reasoning. On the BUSI dataset, HyFormer-Net achieves Dice Score 0.761 +/- 0.072 and accuracy 93.2%, outperforming U-Net, Attention U-Net, and TransUNet. Malignant Recall of 92.1 +/- 2.2% ensures minimal false negatives. Ensemble modeling yields exceptional Dice 90.2%, accuracy 99.5%, and perfect 100% Malignant Recall, eliminating false negatives. Ablation studies confirm multi-scale fusion contributes +16.8% Dice and attention gates add +5.9%. Crucially, we conduct the first cross-dataset generalization study for hybrid CNN-Transformers in breast ultrasound. Zero-shot transfer fails (Dice: 0.058), confirming domain shift. However, progressive fine-tuning with only 10% target-domain data (68 images) recovers 92.5% performance. With 50% data, our model achieves 77.3% Dice, exceeding source-domain performance (76.1%) and demonstrating true generalization.

[192] FastBoost: Progressive Attention with Dynamic Scaling for Efficient Deep Learning

JunXi Yuan

Main category: cs.CV

TL;DR: FastBoost是一种参数高效的神经网络架构,通过动态缩放渐进式注意力(DSPA)机制,在CIFAR基准上实现了最先进的性能,兼顾高精度与低参数量,适合资源受限的边缘设备部署。

Details Motivation: 为了在保持高精度的同时显著降低模型参数量,提升在资源受限边缘设备上的部署可行性,需要设计一种更高效的注意力机制和网络架构。 Method: 提出FastBoost架构,核心是动态缩放渐进式注意力(DSPA)机制,包含自适应融合、阶段缩放和残差适配三个创新,并结合增强型MBConv块,实现高效的特征提取与注意力调控。 Result: 在CIFAR-10上达到95.57%准确率(0.85M参数)和93.80%(0.37M参数);在CIFAR-100上达到81.37%(0.92M参数)和74.85%(0.44M参数),相比MobileNetV3减少2.1倍参数且准确率提升3.2个百分点,FLOPs为0.28G,梯度流增加12.7%。 Conclusion: FastBoost通过DSPA机制与高效卷积的协同优化,在极低参数量和计算开销下实现了卓越的精度,显著提升了参数-精度权衡,为边缘设备上的高效深度学习提供了新范式。 Abstract: We present FastBoost, a parameter-efficient neural architecture that achieves state-of-the-art performance on CIFAR benchmarks through a novel Dynamically Scaled Progressive Attention (DSPA) mechanism. Our design establishes new efficiency frontiers with: CIFAR-10: 95.57% accuracy (0.85M parameters) and 93.80% (0.37M parameters) CIFAR-100: 81.37% accuracy (0.92M parameters) and 74.85% (0.44M parameters) The breakthrough stems from three fundamental innovations in DSPA: (1) Adaptive Fusion: Learnt channel-spatial attention blending with dynamic weights. (2) Phase Scaling: Training-stage-aware intensity modulation (from 0.5 to 1.0). (3) Residual Adaptation: Self-optimized skip connections (gamma from 0.5 to 0.72). By integrating DSPA with enhanced MBConv blocks, FastBoost achieves a 2.1 times parameter reduction over MobileNetV3 while improving accuracy by +3.2 percentage points on CIFAR-10. The architecture features dual attention pathways with real-time weight adjustment, cascaded refinement layers (increasing gradient flow by 12.7%), and a hardware-friendly design (0.28G FLOPs). This co-optimization of dynamic attention and efficient convolution operations demonstrates unprecedented parameter-accuracy trade-offs, enabling deployment in resource-constrained edge devices without accuracy degradation.

[193] T-MLA: A Targeted Multiscale Log--Exponential Attack Framework for Neural Image Compression

Nikolay I. Kalmykov,Razan Dibo,Kaiyu Shen,Xu Zhonghan,Anh-Huy Phan,Yipeng Liu,Ivan Oseledets

Main category: cs.CV

TL;DR: 提出了一种针对神经图像压缩(NIC)的新型多尺度对数-指数攻击框架T-MLA,通过在小波域中构造对抗性扰动,显著降低重建图像质量,同时保持视觉不可察觉。

Details Motivation: 现有对NIC的对抗攻击多为像素空间方法的简单迁移,忽视了压缩流程的结构特性,缺乏针对性和有效性。 Method: 在小波域中设计对抗扰动,直接针对图像质量和重建过程,将扰动限制在特定小波子带,实现离线、有策略的攻击。 Result: 在多个先进NIC模型和标准数据集上验证了T-MLA的有效性,导致重建质量大幅下降但扰动仍视觉不可感知。 Conclusion: 揭示了神经图像压缩系统中存在的核心安全漏洞,对生成式模型和内容分发管道构成严重威胁。 Abstract: Neural image compression (NIC) has become the state-of-the-art for rate-distortion performance, yet its security vulnerabilities remain significantly less understood than those of classifiers. Existing adversarial attacks on NICs are often naive adaptations of pixel-space methods, overlooking the unique, structured nature of the compression pipeline. In this work, we propose a more advanced class of vulnerabilities by introducing T-MLA, the first targeted multiscale log--exponential attack framework. Our approach crafts adversarial perturbations in the wavelet domain by directly targeting the quality of the attacked and reconstructed images. This allows for a principled, offline attack where perturbations are strategically confined to specific wavelet subbands, maximizing distortion while ensuring perceptual stealth. Extensive evaluation across multiple state-of-the-art NIC architectures on standard image compression benchmarks reveals a large drop in reconstruction quality while the perturbations remain visually imperceptible. Our findings reveal a critical security flaw at the core of generative and content delivery pipelines.

[194] GeoToken: Hierarchical Geolocalization of Images via Next Token Prediction

Narges Ghasemi,Amir Ziashahabi,Salman Avestimehr,Cyrus Shahabi

Main category: cs.CV

TL;DR: 本文提出了一种基于S2网格的分层序列预测方法用于图像地理定位,通过类比语言模型的自回归生成过程,在不同数据集上实现了优于现有方法的性能。

Details Motivation: 图像地理定位面临跨区域视觉相似性和搜索空间大的挑战,需要更有效的策略来逐步缩小位置范围。 Method: 采用S2多分辨率全球网格系统,将地理位置编码为层次化token,构建自回归模型逐级预测从粗到细的位置,并结合beam search和多采样推理等解码策略优化性能。 Result: 在Im2GPS3k和YFCC4k数据集上,无需MLLM时性能超越大多数基线方法,最高提升13.9%;结合MLLM后在所有指标上均达到SOTA。 Conclusion: 该分层token预测框架有效应对了图像地理定位中的大规模搜索与歧义问题,展现了类似语言模型解码策略在空间预测任务中的潜力。 Abstract: Image geolocalization, the task of determining an image's geographic origin, poses significant challenges, largely due to visual similarities across disparate locations and the large search space. To address these issues, we propose a hierarchical sequence prediction approach inspired by how humans narrow down locations from broad regions to specific addresses. Analogously, our model predicts geographic tokens hierarchically, first identifying a general region and then sequentially refining predictions to increasingly precise locations. Rather than relying on explicit semantic partitions, our method uses S2 cells, a nested, multiresolution global grid, and sequentially predicts finer-level cells conditioned on visual inputs and previous predictions. This procedure mirrors autoregressive text generation in large language models. Much like in language modeling, final performance depends not only on training but also on inference-time strategy. We investigate multiple top-down traversal methods for autoregressive sampling, incorporating techniques from test-time compute scaling used in language models. Specifically, we integrate beam search and multi-sample inference while exploring various selection strategies to determine the final output. This enables the model to manage uncertainty by exploring multiple plausible paths through the hierarchy. We evaluate our method on the Im2GPS3k and YFCC4k datasets against two distinct sets of baselines: those that operate without a Multimodal Large Language Model (MLLM) and those that leverage one. In the MLLM-free setting, our model surpasses other comparable baselines on nearly all metrics, achieving state-of-the-art performance with accuracy gains of up to 13.9%. When augmented with an MLLM, our model outperforms all baselines, setting a new state-of-the-art across all metrics. The source code is available at https://github.com/NNargesNN/GeoToken.

[195] SliceVision-F2I: A Synthetic Feature-to-Image Dataset for Visual Pattern Representation on Network Slices

Md. Abid Hasan Rafi,Mst. Fatematuj Johora,Pankaj Bhowmik

Main category: cs.CV

TL;DR: SliceVision-F2I是一个用于下一代网络切片中特征可视化的合成数据集,包含四种编码方法生成的图像样本,适用于视觉学习与网络状态分类等任务。

Details Motivation: 为了支持5G/6G网络切片中对精细化识别方法的需求,需要高质量、可复用的数据集来推动基于图像的机器学习在网络数据分析中的应用。 Method: 将多变量KPI向量通过四种编码方式(物理启发映射、Perlin噪声、神经壁纸化和分形分支)转化为低分辨率RGB图像,每种方法生成30,000个样本。 Result: 构建了一个包含120,000个样本的公开数据集,模拟真实且含噪声的网络环境,适用于网络状态分类、异常检测和图像学习算法基准测试。 Conclusion: SliceVision-F2I为网络切片中的特征可视化研究提供了有效工具,并促进了图像驱动的机器学习在网络管理中的应用。 Abstract: The emergence of 5G and 6G networks has established network slicing as a significant part of future service-oriented architectures, demanding refined identification methods supported by robust datasets. The article presents SliceVision-F2I, a dataset of synthetic samples for studying feature visualization in network slicing for next-generation networking systems. The dataset transforms multivariate Key Performance Indicator (KPI) vectors into visual representations through four distinct encoding methods: physically inspired mappings, Perlin noise, neural wallpapering, and fractal branching. For each encoding method, 30,000 samples are generated, each comprising a raw KPI vector and a corresponding RGB image at low-resolution pixels. The dataset simulates realistic and noisy network conditions to reflect operational uncertainties and measurement imperfections. SliceVision-F2I is suitable for tasks involving visual learning, network state classification, anomaly detection, and benchmarking of image-based machine learning techniques applied to network data. The dataset is publicly available and can be reused in various research contexts, including multivariate time series analysis, synthetic data generation, and feature-to-image transformations.

[196] Epanechnikov nonparametric kernel density estimation based feature-learning in respiratory disease chest X-ray images

Veronica Marsico,Antonio Quintero-Rincon,Hadj Batatia

Main category: cs.CV

TL;DR: 提出一种基于Epanechnikov核密度估计与双模态逻辑回归的呼吸系统疾病诊断方法。

Details Motivation: 提高医学图像中呼吸系统疾病的诊断准确性和可靠性,利用非参数模型灵活捕捉影像数据分布特征。 Method: 结合Epanechnikov非参数核密度估计(EKDE)与双模态逻辑回归分类器,构建基于统计模型的学习框架,从胸部X光图像中提取关键特征并进行分类。 Result: 在包含13808张胸部X光片的数据集上测试,准确率达70.14%,敏感性为59.26%,特异性为74.18%,表现出中等检测性能,敏感性仍有提升空间。 Conclusion: EKDE-based方法在医学影像诊断中具有应用潜力,但需结合临床专业知识进一步优化模型性能。 Abstract: This study presents a novel method for diagnosing respiratory diseases using image data. It combines Epanechnikov's non-parametric kernel density estimation (EKDE) with a bimodal logistic regression classifier in a statistical-model-based learning scheme. EKDE's flexibility in modeling data distributions without assuming specific shapes and its adaptability to pixel intensity variations make it valuable for extracting key features from medical images. The method was tested on 13808 randomly selected chest X-rays from the COVID-19 Radiography Dataset, achieved an accuracy of 70.14%, a sensitivity of 59.26%, and a specificity of 74.18%, demonstrating moderate performance in detecting respiratory disease while showing room for improvement in sensitivity. While clinical expertise remains essential for further refining the model, this study highlights the potential of EKDE-based approaches to enhance diagnostic accuracy and reliability in medical imaging.

[197] Anatomically Constrained Transformers for Echocardiogram Analysis

Alexander Thorley,Agis Chartsias,Jordan Strom,Jeremy Slivnick,Dipak Kotecha,Alberto Gomez,Jinming Duan

Main category: cs.CV

TL;DR: 提出了一种新的视频变换器框架ViACT,通过引入解剖先验信息来增强超声心动图分析的准确性和可解释性。

Details Motivation: 现有的视频变换器模型容易从非诊断区域(如图像背景)学习到虚假相关性,影响了模型在超声心动图分析中的性能和可靠性。 Method: ViACT将变形的解剖结构表示为点集,并将其空间几何和对应的图像块编码进变换器令牌中;采用掩码自编码策略进行预训练,仅对解剖区域的图像块进行掩码和重建,确保表征学习集中在解剖区域内。 Result: ViACT在左心室射血分数回归和心脏淀粉样变性检测等任务上表现出色,生成的注意力图与已知病理区域一致,且无需特定任务组件即可实现心肌点跟踪。 Conclusion: ViACT通过整合解剖先验信息有效提升了视频变换器在超声心动图分析中的性能和可解释性,具有广泛的应用前景。 Abstract: Video transformers have recently demonstrated strong potential for echocardiogram (echo) analysis, leveraging self-supervised pre-training and flexible adaptation across diverse tasks. However, like other models operating on videos, they are prone to learning spurious correlations from non-diagnostic regions such as image backgrounds. To overcome this limitation, we propose the Video Anatomically Constrained Transformer (ViACT), a novel framework that integrates anatomical priors directly into the transformer architecture. ViACT represents a deforming anatomical structure as a point set and encodes both its spatial geometry and corresponding image patches into transformer tokens. During pre-training, ViACT follows a masked autoencoding strategy that masks and reconstructs only anatomical patches, enforcing that representation learning is focused on the anatomical region. The pre-trained model can then be fine-tuned for tasks localized to this region. In this work we focus on the myocardium, demonstrating the framework on echo analysis tasks such as left ventricular ejection fraction (EF) regression and cardiac amyloidosis (CA) detection. The anatomical constraint focuses transformer attention within the myocardium, yielding interpretable attention maps aligned with regions of known CA pathology. Moreover, ViACT generalizes to myocardium point tracking without requiring task-specific components such as correlation volumes used in specialized tracking networks.

[198] Boosting performance of computer vision applications through embedded GPUs on the edge

Fabio Diniz Rossi

Main category: cs.CV

TL;DR: 本文提出利用配备GPU的嵌入式设备来提升移动设备上计算机视觉应用在边缘计算环境中的性能,从而改善用户体验。

Details Motivation: 计算机视觉和增强现实应用对资源需求高,而移动设备资源有限,边缘计算虽可辅助但其设备能力受限,影响用户体验。 Method: 采用配备GPU的嵌入式设备进行任务卸载,通过实验比较GPU与CPU的性能差异。 Result: 实验结果表明,使用GPU相比仅使用CPU能显著提升性能。 Conclusion: 使用GPU的嵌入式设备可有效缓解边缘计算资源限制,提升用户在运行计算机视觉应用时的体验。 Abstract: Computer vision applications, especially those using augmented reality technology, are becoming quite popular in mobile devices. However, this type of application is known as presenting significant demands regarding resources. In order to enable its utilization in devices with more modest resources, edge computing can be used to offload certain high intensive tasks. Still, edge computing is usually composed of devices with limited capacity, which may impact in users quality of experience when using computer vision applications. This work proposes the use of embedded devices with graphics processing units (GPUs) to overcome such limitation. Experiments performed shown that GPUs can attain a performance gain when compared to using only CPUs, which guarantee a better experience to users using such kind of application.

[199] Weakly Supervised Concept Learning with Class-Level Priors for Interpretable Medical Diagnosis

Md Nahiduzzaman,Steven Korevaar,Alireza Bab-Hadiashar,Ruwan Tennakoon

Main category: cs.CV

TL;DR: 提出了一种无需显式监督或语言模型的弱监督框架PCP,利用类别级概念先验和优化机制提升医学图像中可解释预测的性能。

Details Motivation: 现有的可解释性方法需要昂贵的概念标注或难以捕捉医学领域特定特征,限制了在临床中的应用。 Method: 提出Prior-guided Concept Predictor (PCP),使用类别级概念先验作为弱监督,并结合KL散度和熵正则化的 refinement 机制来对齐临床推理。 Result: 在PH2和WBCatt数据集上,PCP相较于零样本基线提升了33%以上的概念级F1分数,并在四个医学数据集上达到与全监督概念瓶颈模型相当的分类性能。 Conclusion: PCP是一种有效的弱监督可解释方法,无需概念标注即可提升医学图像分析中的概念预测和分类性能,具有良好的临床应用潜力。 Abstract: Human-interpretable predictions are essential for deploying AI in medical imaging, yet most interpretable-by-design (IBD) frameworks require concept annotations for training data, which are costly and impractical to obtain in clinical contexts. Recent attempts to bypass annotation, such as zero-shot vision-language models or concept-generation frameworks, struggle to capture domain-specific medical features, leading to poor reliability. In this paper, we propose a novel Prior-guided Concept Predictor (PCP), a weakly supervised framework that enables concept answer prediction without explicit supervision or reliance on language models. PCP leverages class-level concept priors as weak supervision and incorporates a refinement mechanism with KL divergence and entropy regularization to align predictions with clinical reasoning. Experiments on PH2 (dermoscopy) and WBCatt (hematology) show that PCP improves concept-level F1-score by over 33% compared to zero-shot baselines, while delivering competitive classification performance on four medical datasets (PH2, WBCatt, HAM10000, and CXR4) relative to fully supervised concept bottleneck models (CBMs) and V-IP.

[200] Learning with Category-Equivariant Architectures for Human Activity Recognition

Yoshihiro Maruyama

Main category: cs.CV

TL;DR: 提出了一种名为CatEquiv的类别等变神经网络,用于基于惯性传感器的人类活动识别,通过编码时间、幅度和结构对称性,在分布外扰动下表现出更强的鲁棒性和泛化能力。

Details Motivation: 为了提升人类活动识别模型在分布外扰动下的鲁棒性和泛化能力,需系统地建模惯性传感器数据的时间、幅度和结构对称性。 Method: 引入了类别对称积,将循环时间偏移、正增益和传感器层次结构偏序集结合,构建具有类别对称结构的等变神经网络CatEquiv。 Result: 在UCI-HAR数据集上,CatEquiv在分布外扰动下显著优于带循环填充的CNN和普通CNN,展现出更高的鲁棒性。 Conclusion: 强制施加类别对称性能有效提升模型的不变性和泛化能力,且无需增加模型容量。 Abstract: We propose CatEquiv, a category-equivariant neural network for Human Activity Recognition (HAR) from inertial sensors that systematically encodes temporal, amplitude, and structural symmetries. In particular, we introduce the categorical symmetry product where cyclic time shifts, positive gains and the sensor-hierarchy poset together capture the categorical symmetry structure of the data. CatEquiv achieves equivariance with respect to the categorical symmetry product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains markedly higher robustness compared with circularly padded CNNs and plain CNNs. These results demonstrate that enforcing categorical symmetries yields strong invariance and generalization without additional model capacity.

[201] MicroAUNet: Boundary-Enhanced Multi-scale Fusion with Knowledge Distillation for Colonoscopy Polyp Image Segmentation

Ziyi Wang,Yuanmei Zhang,Dorna Esrafilzadeh,Ali R. Jalili,Suncheng Xiang

Main category: cs.CV

TL;DR: 提出了一种轻量级注意力分割网络MicroAUNet,结合深度可分离膨胀卷积和单路径参数共享的通道-空间注意力模块,提升多尺度边界特征,并通过两阶段知识蒸馏从高容量教师模型迁移语义和边界信息,在极低复杂度下实现最先进的精度,适用于实时临床息肉分割。

Details Motivation: 现有基于深度学习的结直肠息肉分割模型在分割结果中存在模糊的息肉边缘,影响临床决策,或因模型复杂导致推理速度不足,难以满足实时内镜应用需求。 Method: 设计了轻量化的MicroAUNet,采用深度可分离膨胀卷积和单路径参数共享的通道-空间注意力模块以增强边界特征;引入渐进式两阶段知识蒸馏方法,从高容量教师模型中迁移语义与边界线索。 Result: 在多个基准数据集上实验表明,MicroAUNet在极低模型复杂度下达到了最先进的分割精度,具备高效的推理速度,适合实时临床应用。 Conclusion: MicroAUNet在保证高精度的同时显著降低计算复杂度,有效解决了现有模型在临床实用性与实时性方面的局限,是适用于实时结直肠息肉分割的高效解决方案。 Abstract: Early and accurate segmentation of colorectal polyps is critical for reducing colorectal cancer mortality, which has been extensively explored by academia and industry. However, current deep learning-based polyp segmentation models either compromise clinical decision-making by providing ambiguous polyp margins in segmentation outputs or rely on heavy architectures with high computational complexity, resulting in insufficient inference speeds for real-time colorectal endoscopic applications. To address this problem, we propose MicroAUNet, a light-weighted attention-based segmentation network that combines depthwise-separable dilated convolutions with a single-path, parameter-shared channel-spatial attention block to strengthen multi-scale boundary features. On the basis of it, a progressive two-stage knowledge-distillation scheme is introduced to transfer semantic and boundary cues from a high-capacity teacher. Extensive experiments on benchmarks also demonstrate the state-of-the-art accuracy under extremely low model complexity, indicating that MicroAUNet is suitable for real-time clinical polyp segmentation. The code is publicly available at https://github.com/JeremyXSC/MicroAUNet.

[202] ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation

Yongyuan Liang,Wei Chow,Feng Li,Ziqiao Ma,Xiyao Wang,Jiageng Mao,Jiuhai Chen,Jiatao Gu,Yue Wang,Furong Huang

Main category: cs.CV

TL;DR: 本文提出了ROVER,一个用于评估统一多模态模型中双向跨模态推理能力的人工标注基准。现有评测方法多孤立地评估文本或图像能力,而ROVER强调一种核心能力:利用一种模态来引导、验证或优化另一种模态的输出。该基准包含1312个任务,涵盖两种互补场景:通过语言增强推理来提升图像生成,以及通过视觉增强推理来改进语言回答。在17个模型上的实验表明:跨模态推理显著影响生成质量,交错式模型优于非交错式;同时模型在物理和符号推理之间存在分离现象,擅长处理具象概念却难以构建抽象视觉表征。

Details Motivation: 现有的多模态模型评估方法大多将文本与图像能力割裂评估,依赖单模态输出进行打分,无法有效衡量模型在多模态输入输出任务中的双向跨模态推理能力。这种推理能力——即用一种模态指导或修正另一模态的输出——是实现真正统一多模态智能的核心,因此亟需一种专门针对该能力的评测基准。 Method: 提出ROVER基准,包含1312个任务和1876张图像,聚焦于双向跨模态推理。设计两个互补评估场景:一是语言增强的视觉生成,测试模型能否利用语言提示和推理链生成更准确的图像;二是视觉增强的语言生成,测试模型能否通过生成中间可视化内容来提升自身问答推理能力。所有任务由人工标注并建立评分标准,对17种主流统一多模态模型进行系统评测。 Result: 实验发现:(1)跨模态推理能力显著影响视觉生成质量,采用交错架构的模型明显优于非交错模型,且强单模态组合无法达到类似推理效果;(2)模型在物理推理与符号推理间存在解离现象:能准确理解字面意义的感知概念,但在需要抽象视觉表征的符号任务上表现差,错误的内部推理会损害最终性能。 Conclusion: 双向跨模态推理是实现真正多模态生成智能的关键前沿。ROVER为评估这一核心能力提供了有效工具,揭示了当前模型在融合语言与视觉推理方面的局限,特别是对符号性抽象的建模不足,未来模型需更强的跨模态协同推理机制。 Abstract: Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.

[203] Web-Scale Collection of Video Data for 4D Animal Reconstruction

Brian Nlong Zhao,Jiajun Wu,Shangzhe Wu

Main category: cs.CV

TL;DR: 提出了一种从YouTube视频中自动挖掘并处理动物视频的管道,构建了大规模数据集和Animal-in-Motion(AiM)基准,用于推动无标记、野外场景下的4D动物重建研究。

Details Motivation: 现有动物视频数据集规模小、标注不足,难以支持面向动物的3D/4D任务,且依赖受控采集环境,缺乏对非侵入式、单视角真实场景视频的有效利用。 Method: 设计了一个自动化管道,从YouTube视频中挖掘单视角动物视频,提取以物体为中心的视频片段,并生成辅助标注;基于该管道构建了包含30K视频(2M帧)的大规模数据集,并发布了含230个精选序列的Animal-in-Motion(AiM)基准;在此基础上改进了基于序列优化的无模型方法,建立了首个4D动物重建基线。 Result: 获得了比以往工作大一个数量级的数据规模;在AiM基准上发现基于模型的方法在2D指标上表现好但3D形状不真实,而无模型方法重建更自然但评分较低,揭示了当前评估的局限性;提出的改进方法实现了首个4D动物重建基线。 Conclusion: 所提出的管道、基准和基线共同推动了从野外视频中进行大规模、无标记的4D动物重建及相关任务的发展。 Abstract: Computer vision for animals holds great promise for wildlife research but often depends on large-scale data, while existing collection methods rely on controlled capture setups. Recent data-driven approaches show the potential of single-view, non-invasive analysis, yet current animal video datasets are limited--offering as few as 2.4K 15-frame clips and lacking key processing for animal-centric 3D/4D tasks. We introduce an automated pipeline that mines YouTube videos and processes them into object-centric clips, along with auxiliary annotations valuable for downstream tasks like pose estimation, tracking, and 3D/4D reconstruction. Using this pipeline, we amass 30K videos (2M frames)--an order of magnitude more than prior works. To demonstrate its utility, we focus on the 4D quadruped animal reconstruction task. To support this task, we present Animal-in-Motion (AiM), a benchmark of 230 manually filtered sequences with 11K frames showcasing clean, diverse animal motions. We evaluate state-of-the-art model-based and model-free methods on Animal-in-Motion, finding that 2D metrics favor the former despite unrealistic 3D shapes, while the latter yields more natural reconstructions but scores lower--revealing a gap in current evaluation. To address this, we enhance a recent model-free approach with sequence-level optimization, establishing the first 4D animal reconstruction baseline. Together, our pipeline, benchmark, and baseline aim to advance large-scale, markerless 4D animal reconstruction and related tasks from in-the-wild videos. Code and datasets are available at https://github.com/briannlongzhao/Animal-in-Motion.

[204] Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution

Peng Du,Hui Li,Han Xu,Paul Barom Jeon,Dongwook Lee,Daehyun Ji,Ran Yang,Feng Zhu

Main category: cs.CV

TL;DR: 提出了一种基于扩散Transformer和小波谱的图像超分辨率方法DTWSR,通过多级离散小波变换和金字塔token化有效捕捉多尺度频带间的相关性,显著提升重建图像的质量。

Details Motivation: 现有基于离散小波变换(DWT)的超分辨率方法大多忽略多尺度频带间的相互关系,导致重建图像出现不一致和伪影问题。 Method: 提出DTWSR模型,采用多级离散小波变换(MDWT)分解图像为小波谱,设计金字塔token化方法将频谱转化为序列输入Transformer,并引入双解码器分别处理低频和高频子带,保持生成过程中的对齐性。 Result: 在多个基准数据集上的实验表明,该方法在感知质量和保真度方面均表现出色,优于现有方法。 Conclusion: DTWSR有效建模了多尺度频带间的依赖关系,实现了更一致且真实的超分辨率图像重建,验证了结合扩散模型、Transformer与小波频谱分析的潜力。 Abstract: Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image superresolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multiscale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR).DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multiscale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform (MDWT) to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in lowfrequency (LF) and high-frequency (HF) sub-bands, without omitting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.

[205] A Topology-Aware Graph Convolutional Network for Human Pose Similarity and Action Quality Assessment

Minmin Zeng

Main category: cs.CV

TL;DR: 提出了一种基于拓扑感知图卷积网络(GCN-PSN)的框架,用于动作质量评估(AQA),通过将人体骨架建模为图来学习具有判别性和拓扑敏感性的姿态嵌入,在AQA-7和FineDiving基准上表现出色。

Details Motivation: 动作质量评估需要对人体运动进行细粒度理解和精确的姿态相似性评估,现有方法在利用骨骼拓扑结构方面仍有不足。 Method: 采用拓扑感知的图卷积网络(GCN-PSN),将人体骨架建模为图,并结合Siamese架构与对比回归目标进行训练,以学习姿态嵌入。 Result: 在AQA-7和FineDiving数据集上优于基于坐标的基线方法,并取得具有竞争力的性能,消融实验验证了骨骼拓扑信息的有效性。 Conclusion: 利用骨骼拓扑结构有助于提升姿态相似性计算和动作质量评估的准确性,所提GCN-PSN框架在相关任务中表现优异。 Abstract: Action Quality Assessment (AQA) requires fine-grained understanding of human motion and precise evaluation of pose similarity. This paper proposes a topology-aware Graph Convolutional Network (GCN) framework, termed GCN-PSN, which models the human skeleton as a graph to learn discriminative, topology-sensitive pose embeddings. Using a Siamese architecture trained with a contrastive regression objective, our method outperforms coordinate-based baselines and achieves competitive performance on AQA-7 and FineDiving benchmarks. Experimental results and ablation studies validate the effectiveness of leveraging skeletal topology for pose similarity and action quality assessment.

[206] MoSa: Motion Generation with Scalable Autoregressive Modeling

Mengyuan Liu,Sheng Yan,Yong Wang,Yingjie Li,Gui-Bin Bian,Hong Liu

Main category: cs.CV

TL;DR: MoSa是一种用于文本驱动3D人体运动生成的分层运动生成框架,通过多尺度令牌保留策略和可扩展自回归建模,在生成质量和效率上达到SOTA。

Details Motivation: 现有方法在生成高质量3D人体运动时存在推理步数多、细节丢失和重建质量下降的问题,需要更高效且精细的生成机制。 Method: 提出MoSa框架,结合分层残差向量量化变分自编码器(RQ-VAE)与多尺度令牌保留策略(MTPS),实现粗到精的可扩展自回归生成;引入CAQ-VAE,融合卷积与注意力机制以增强全局依赖建模。 Result: 在Motion-X数据集上FID为0.06(优于MoMask的0.20),推理时间减少27%;仅需10步推理,且在运动编辑等下游任务中无需微调即具良好泛化性。 Conclusion: MoSa通过层级化、多尺度的生成策略显著提升了文本到3D运动生成的质量与效率,具备强泛化能力和实际应用潜力。 Abstract: We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation, we propose CAQ-VAE, a lightweight yet expressive convolution-attention hybrid VQ-VAE. CAQ-VAE enhances residual block design and incorporates attention mechanisms to better capture global dependencies. Extensive experiments show that MoSa achieves state-of-the-art generation quality and efficiency, outperforming prior methods in both fidelity and speed. On the Motion-X dataset, MoSa achieves an FID of 0.06 (versus MoMask's 0.20) while reducing inference time by 27 percent. Moreover, MoSa generalizes well to downstream tasks such as motion editing, requiring no additional fine-tuning. The code is available at https://mosa-web.github.io/MoSa-web

[207] OmniVLA: Unifiying Multi-Sensor Perception for Physically-Grounded Multimodal VLA

Heyu Guo,Shanmu Wang,Ruichun Ma,Shiqi Jiang,Yasaman Ghasempour,Omid Abari,Baining Guo,Lili Qi

Main category: cs.CV

TL;DR: OmniVLA是一种多模态视觉-语言-动作模型,通过引入红外相机、毫米波雷达和麦克风阵列等新型传感器模态,扩展了传统仅依赖RGB图像的感知能力。其核心是传感器掩码图像,将物理空间信息融合到RGB图像中,实现统一且高效的多模态学习。

Details Motivation: 现有视觉-语言-动作模型主要依赖RGB图像,感知能力受限,难以应对复杂的真实世界操作任务。因此需要引入更多物理感知模态以提升空间理解与操作性能。 Method: 提出传感器掩码图像作为统一表征,将来自红外、毫米波雷达和麦克风阵列的空间对齐、物理有意义的掩码叠加到RGB图像上;基于RGB预训练的VLA主干网络构建多感官VLA架构,并使用轻量级的单传感器投影器实现数据高效学习。 Result: 在需要多模态感知的真实世界任务中,OmniVLA平均任务成功率达到84%,相比仅用RGB和原始传感器输入的基线模型分别提升59%和28%,同时表现出更高的学习效率和更强的泛化能力。 Conclusion: OmniVLA通过融合多种传感模态,显著提升了视觉-语言-动作模型在真实环境中的感知与操作性能,验证了多模态物理感知对于智能体空间理解的重要性。 Abstract: Vision-language-action (VLA) models have shown strong generalization for action prediction through large-scale vision-language pretraining. However, most existing models rely solely on RGB cameras, limiting their perception and, consequently, manipulation capabilities. We present OmniVLA, an omni-modality VLA model that integrates novel sensing modalities for physically-grounded spatial intelligence beyond RGB perception. The core of our approach is the sensor-masked image, a unified representation that overlays spatially grounded and physically meaningful masks onto the RGB images, derived from sensors including an infrared camera, a mmWave radar, and a microphone array. This image-native unification keeps sensor input close to RGB statistics to facilitate training, provides a uniform interface across sensor hardware, and enables data-efficient learning with lightweight per-sensor projectors. Built on this, we present a multisensory vision-language-action model architecture and train the model based on an RGB-pretrained VLA backbone. We evaluate OmniVLA on challenging real-world tasks where sensor-modality perception is needed to guide the manipulation. OmniVLA achieves an average task success rate of 84%, significantly outperforms both RGB-only and raw-sensor-input baseline models by 59% and 28% respectively, meanwhile showing higher learning efficiency and stronger generalization capability.

[208] Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering

Riddhi Jain,Manasi Patwardhan,Parijat Deshpande,Venkataramana Runkana

Main category: cs.CV

TL;DR: 本文提出了一种基于推理链的多步推理方法,用于提升印度食品视觉问答(VQA)系统的性能,通过微调小型语言和视觉语言模型,并结合强化学习,在基准上实现了平均10个百分点的准确率提升。

Details Motivation: 现有的VQA系统主要偏向西方食物,对印度多样化的饮食文化支持不足,且现有印度食品VQA数据集采用两步式方法,缺乏对复杂烹饪语境的深入理解。 Method: 构建自动验证的推理链,将其融入问答过程,通过微调小型LLM和VLM模型,并使用强化学习进行进一步训练,以实现多步推理。 Result: 在印度食品VQA任务中,引入推理链后,模型准确率平均提升了10个百分点。 Conclusion: 多步推理结合推理链能有效提升印度食品VQA的准确性,为处理复杂饮食文化背景下的视觉问答提供了新思路。 Abstract: The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.

[209] Saliency-Guided Domain Adaptation for Left-Hand Driving in Autonomous Steering

Zahra Mehraban,Sebastien Glaser,Michael Milford,Ronald Schroeter

Main category: cs.CV

TL;DR: 本文研究了四种训练方法在将PilotNet模型适应澳大利亚左舵驾驶条件时的表现,发现预训练使用翻转数据并结合微调能显著提升模型适应性。

Details Motivation: 自动驾驶模型需要适应不同道路条件,本文旨在探索如何有效实现从右舵到左舵驾驶环境的领域自适应。 Method: 评估了四种训练方法:基于美国数据的基线模型、翻转美国数据训练的模型、在美国数据上预训练后在澳大利亚数据上微调的模型,以及在翻转美国数据上预训练后再微调的模型,并通过显著性分析比较模型对转向预测和注意力分布的表现。 Result: 仅使用翻转数据预训练会降低预测稳定性,但结合微调后可显著降低预测误差,增强模型对左侧道路线索的关注;该结果在PilotNet和ResNet上均得到验证。 Conclusion: 翻转数据预训练结合微调是一种有效的领域自适应策略,可在最小重训练的前提下提升模型在新环境中的性能。 Abstract: Domain adaptation is required for automated driving models to generalize well across diverse road conditions. This paper explores a training method for domain adaptation to adapt PilotNet, an end-to-end deep learning-based model, for left-hand driving conditions using real-world Australian highway data. Four training methods were evaluated: (1) a baseline model trained on U.S. right-hand driving data, (2) a model trained on flipped U.S. data, (3) a model pretrained on U.S. data and then fine-tuned on Australian highways, and (4) a model pretrained on flipped U.S. data and then finetuned on Australian highways. This setup examines whether incorporating flipped data enhances the model adaptation by providing an initial left-hand driving alignment. The paper compares model performance regarding steering prediction accuracy and attention, using saliency-based analysis to measure attention shifts across significant road regions. Results show that pretraining on flipped data alone worsens prediction stability due to misaligned feature representations, but significantly improves adaptation when followed by fine-tuning, leading to lower prediction error and stronger focus on left-side cues. To validate this approach across different architectures, the same experiments were done on ResNet, which confirmed similar adaptation trends. These findings emphasize the importance of preprocessing techniques, such as flipped-data pretraining, followed by fine-tuning to improve model adaptation with minimal retraining requirements.

[210] Gesture Generation (Still) Needs Improved Human Evaluation Practices: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy,Hendric Voss,Thanh Hoang-Minh,Mihail Tsakov,Teodor Nikolov,Zeyi Zhang,Tenglong Ao,Sicheng Yang,Shaoli Huang,Yongkang Cheng,M. Hamza Mughal,Rishabh Dabral,Kiran Chhatre,Christian Theobalt,Libin Liu,Stefan Kopp,Rachel McDonnell,Michael Neff,Taras Kucherenko,Youngwoo Yoon,Gustav Eje Henter

Main category: cs.CV

TL;DR: 本文审查了语音驱动3D手势生成中的人类评估实践,发现缺乏标准化和常见实验设计缺陷,并提出了一种针对BEAT2数据集的详细人类评估协议。通过大规模众包评估比较六种最新模型,结果表明新模型并未持续优于早期方法,且现有高评价主张在严格评估下可能不成立,强调需分离运动质量与多模态对齐评估以推动领域发展。

Details Motivation: 当前语音驱动3D手势生成领域的人类评估缺乏标准化,常使用有缺陷的实验设计,导致无法准确比较不同方法或确定技术现状。 Method: 提出一个针对BEAT2动作捕捉数据集的详细人类评估协议,并基于该协议开展大规模众包实验,对六个由原作者训练的最新手势生成模型在运动真实感和语音-手势对齐两个维度进行评估。 Result: 1) 更新的模型并未一致优于早期方法;2) 已发表的高 realism 或对齐性能主张在严格评估下可能不成立;3) 必须采用解耦的运动质量与多模态对齐评估才能实现准确基准测试。 Conclusion: 为推动标准化并支持新的评估研究,作者将公开五小时合成动作、750多个渲染视频、开源渲染脚本以及16,000条人类成对偏好投票,促进无需重新实现模型即可开展新评估的研究。 Abstract: We review human evaluation practices in automated, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results provide strong evidence that 1) newer models do not consistently outperform earlier approaches; 2) published claims of high motion realism or speech-gesture alignment may not hold up under rigorous evaluation; and 3) the field must adopt disentangled assessments of motion quality and multimodal alignment for accurate benchmarking in order to make progress. Finally, in order to drive standardisation and enable new evaluation research, we will release five hours of synthetic motion from the benchmarked models; over 750 rendered video stimuli from the user studies -- enabling new evaluations without model reimplementation required -- alongside our open-source rendering script, and the 16,000 pairwise human preference votes collected for our benchmark.

[211] Eyes on Target: Gaze-Aware Object Detection in Egocentric Video

Vishakha Lall,Yisi Liu

Main category: cs.CV

TL;DR: 提出了一种基于注视信息的深度感知、注视引导的视觉Transformer目标检测框架Eyes on Target,通过将注视特征注入注意力机制来增强对人眼关注区域的检测性能。

Details Motivation: 利用人类注视提供的丰富监督信号,提升复杂视觉环境中对视觉注意力的理解,并改进传统目标检测模型对所有区域一视同仁的不足。 Method: 将注视衍生特征注入Vision Transformer的注意力机制中,实现对人类关注区域的空间特征选择偏向,同时引入注视感知的注意力头重要性度量以解释模型行为。 Result: 在自建的egocentric模拟器数据集及Ego4D Ego-Motion、Ego-CH-Gaze等公开基准上,相比忽略注视信息的基线模型,所提方法在检测精度上表现出一致提升。 Conclusion: 该方法有效利用注视信息增强了目标检测能力,尤其适用于需评估人类表现的模拟场景,展示了注视引导建模在第一人称视觉任务中的潜力。 Abstract: Human gaze offers rich supervisory signals for understanding visual attention in complex visual environments. In this paper, we propose Eyes on Target, a novel depth-aware and gaze-guided object detection framework designed for egocentric videos. Our approach injects gaze-derived features into the attention mechanism of a Vision Transformer (ViT), effectively biasing spatial feature selection toward human-attended regions. Unlike traditional object detectors that treat all regions equally, our method emphasises viewer-prioritised areas to enhance object detection. We validate our method on an egocentric simulator dataset where human visual attention is critical for task assessment, illustrating its potential in evaluating human performance in simulation scenarios. We evaluate the effectiveness of our gaze-integrated model through extensive experiments and ablation studies, demonstrating consistent gains in detection accuracy over gaze-agnostic baselines on both the custom simulator dataset and public benchmarks, including Ego4D Ego-Motion and Ego-CH-Gaze datasets. To interpret model behaviour, we also introduce a gaze-aware attention head importance metric, revealing how gaze cues modulate transformer attention dynamics.

[212] Beyond Deceptive Flatness: Dual-Order Solution for Strengthening Adversarial Transferability

Zhixuan Zhang,Pingyu Wang,Xingjian Zheng,Linbo Qing,Qi Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于二阶信息的新型黑盒梯度可迁移攻击方法,通过引入对抗平坦性(AF)和蒙特卡洛对抗采样(MCAS)提升跨模型的对抗样本迁移性。

Details Motivation: 现有可迁移攻击方法易陷入平缓但尖锐的局部区域(即欺骗性平坦),导致迁移性能受限,因此需要更有效的优化目标来提升黑盒攻击成功率。 Method: 从二阶信息角度出发,提出对抗平坦性(AF)以解决欺骗性平坦问题,并设计对抗平坦攻击(AFA)实现高效近似优化;进一步提出蒙特卡洛对抗采样(MCAS)提升内层采样效率。 Result: 在ImageNet兼容数据集上优于六种基线方法,生成更平坦的对抗样本,在多种模型架构间表现出更高的迁移性,并在输入变换和百度云API测试中表现优越。 Conclusion: 所提AFA与MCAS有效提升了黑盒对抗攻击的迁移能力,为构建高迁移性对抗样本提供了新的理论视角和实践方案。 Abstract: Transferable attacks generate adversarial examples on surrogate models to fool unknown victim models, posing real-world threats and growing research interest. Despite focusing on flat losses for transferable adversarial examples, recent studies still fall into suboptimal regions, especially the flat-yet-sharp areas, termed as deceptive flatness. In this paper, we introduce a novel black-box gradient-based transferable attack from a perspective of dual-order information. Specifically, we feasibly propose Adversarial Flatness (AF) to the deceptive flatness problem and a theoretical assurance for adversarial transferability. Based on this, using an efficient approximation of our objective, we instantiate our attack as Adversarial Flatness Attack (AFA), addressing the altered gradient sign issue. Additionally, to further improve the attack ability, we devise MonteCarlo Adversarial Sampling (MCAS) by enhancing the inner-loop sampling efficiency. The comprehensive results on ImageNet-compatible dataset demonstrate superiority over six baselines, generating adversarial examples in flatter regions and boosting transferability across model architectures. When tested on input transformation attacks or the Baidu Cloud API, our method outperforms baselines.

[213] CenterMamba-SAM: Center-Prioritized Scanning and Temporal Prototypes for Brain Lesion Segmentation

Yu Tian,Zhongheng Yang,Chenshi Liu,Yiyun Su,Ziwei Hong,Zexi Gong,Jingyuan Xu

Main category: cs.CV

TL;DR: 提出CenterMamba-SAM,一种用于脑病变分割的端到端框架,通过冻结预训练主干网络并仅训练轻量级适配器实现高效微调,结合新颖的扫描策略和记忆增强机制,在公开基准上达到最先进性能。

Details Motivation: 脑病变分割因病灶小、对比度低、采样各向异性及切片间不连续而具有挑战性,现有方法在捕捉微弱边界和保持跨切片一致性方面存在不足。 Method: 提出CenterMamba-SAM框架:采用CenterMamba编码器,引入3x3角-轴-中心短序列扫描策略以增强中心优先、轴强化和对角补偿的信息聚合;设计基于记忆的结构化提示生成器,维护邻近切片的原型库来自动生成可靠提示;结合多尺度解码器与记忆注意力模块,实现深度监督与渐进式细化。 Result: 在多个公开脑病变分割数据集上进行了广泛实验,结果表明该方法在分割精度和跨切片一致性方面优于现有方法,尤其在检测微小、低对比度病灶上表现突出。 Conclusion: CenterMamba-SAM通过新颖的扫描策略和记忆增强机制,有效提升了脑病变分割的性能,特别是在处理小而模糊的病灶时表现出色,具有良好的应用前景。 Abstract: Brain lesion segmentation remains challenging due to small, low-contrast lesions, anisotropic sampling, and cross-slice discontinuities. We propose CenterMamba-SAM, an end-to-end framework that freezes a pretrained backbone and trains only lightweight adapters for efficient fine-tuning. At its core is the CenterMamba encoder, which employs a novel 3x3 corner-axis-center short-sequence scanning strategy to enable center-prioritized, axis-reinforced, and diagonally compensated information aggregation. This design enhances sensitivity to weak boundaries and tiny foci while maintaining sparse yet effective feature representation. A memory-driven structural prompt generator maintains a prototype bank across neighboring slices, enabling automatic synthesis of reliable prompts without user interaction, thereby improving inter-slice coherence. The memory-augmented multi-scale decoder integrates memory attention modules at multiple levels, combining deep supervision with progressive refinement to restore fine details while preserving global consistency. Extensive experiments on public benchmarks demonstrate that CenterMamba-SAM achieves state-of-the-art performance in brain lesion segmentation.

[214] Source-Only Cross-Weather LiDAR via Geometry-Aware Point Drop

YoungJae Cheong,Jhonghyun An

Main category: cs.CV

TL;DR: 提出一种轻量级几何感知适配器,通过水平循环填充和局部窗口K近邻统计,在训练时增强LiDAR语义分割在恶劣天气下的鲁棒性,显著提升跨天气场景下的分割性能。

Details Motivation: 现有方法在恶劣天气下LiDAR语义分割性能下降严重,且忽视边界、角落和稀疏区域等结构脆弱区域的几何特性影响。 Method: 设计几何感知适配器,采用方位对齐和水平循环填充保持0~360度边界连续性,利用局部窗口KNN提取局部几何统计特征并压缩为紧凑的几何感知线索,用于训练时的区域感知正则化。 Result: 在仅使用源域数据(SemanticKITTI)训练并在目标域(SemanticSTF)测试的跨天气设置下,相比基于数据增强的基线mIoU提升7.9个百分点,相比类别正则化基线提升0.6个百分点。 Conclusion: 几何驱动的正则化是提升全天候LiDAR语义分割性能的关键方向,该适配器即插即用、训练专用且推理开销极低。 Abstract: LiDAR semantic segmentation degrades in adverse weather because refraction, scattering, and point dropouts corrupt geometry. Prior work in weather simulation, mixing-based augmentation, domain randomization, and uncertainty or boundary regularization improves robustness but still overlooks structural vulnerabilities near boundaries, corners, and sparse regions. We present a Light Geometry-aware adapter. The module aligns azimuth and applies horizontal circular padding to preserve neighbor continuity across the 0~360 degree wrap-around boundary. A local-window K-Nearest Neighbors gathers nearby points and computes simple local statistics, which are compressed into compact geometry-aware cues. During training, these cues drive region-aware regularization that stabilizes predictions in structurally fragile areas. The adapter is plug and play, complements augmentation, and can be enabled only during training with negligible inference cost. We adopt a source-only cross-weather setup where models train on SemanticKITTI and are evaluated on SemanticSTF without target labels or fine-tuning. The adapter improves mIoU by 7.9 percentage points over the data-centric augmentation baseline and by 0.6 points over the class-centric regularization baseline. These results indicate that geometry-driven regularization is a key direction for all-weather LiDAR segmentation.

[215] MotionStream: Real-Time Video Generation with Interactive Motion Controls

Joonghyuk Shin,Zhengqi Li,Richard Zhang,Jun-Yan Zhu,Jaesik Park,Eli Schechtman,Xun Huang

Main category: cs.CV

TL;DR: 提出MotionStream,实现基于单个GPU的亚秒级延迟视频生成,支持高达29FPS的流式生成,具备实时交互能力。

Details Motivation: 现有运动条件视频生成方法存在高延迟和非因果处理问题,难以实现实时交互。 Method: 通过在文本到视频模型中引入运动控制,并利用带分布匹配蒸馏的自强制方法将双向教师模型蒸馏为因果学生模型,结合滑动窗口因果注意力与注意力sink技术,实现固定上下文窗口下的恒定速度生成。 Result: 在运动跟随和视频质量方面达到最先进水平,生成速度比现有方法快两个数量级,支持无限长度视频的实时流式生成。 Conclusion: MotionStream实现了低延迟、高质量、可扩展的实时视频生成,为用户提供了真正的交互体验,适用于轨迹绘制、相机控制和运动迁移等应用。 Abstract: Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons: (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.

[216] PRevivor: Reviving Ancient Chinese Paintings using Prior-Guided Color Transformers

Tan Tang,Yanhong Wu,Junming Gao,Yingcai Wu

Main category: cs.CV

TL;DR: 提出PRevivor,一种基于先验引导的色彩恢复Transformer模型,用于修复古代中国画的褪色问题。

Details Motivation: 古代中国画因复杂的化学机制导致颜色退化,修复困难,且缺乏高质量数据集阻碍了端到端数字修复工具的发展。 Method: 将色彩恢复分解为亮度增强和色调校正两个子任务:使用双变分U-Net和多尺度映射模块进行亮度增强;设计双分支颜色查询模块,结合局部色调先验实现局部与全局兼顾的色调校正。 Result: 在多个先进着色方法上的实验表明,PRevivor在定量和定性评估中均表现出优越性能。 Conclusion: PRevivor能有效恢复古代绘画的色彩,为文化遗产的数字化修复提供了可行方案。 Abstract: Ancient Chinese paintings are a valuable cultural heritage that is damaged by irreversible color degradation. Reviving color-degraded paintings is extraordinarily difficult due to the complex chemistry mechanism. Progress is further slowed by the lack of comprehensive, high-quality datasets, which hampers the creation of end-to-end digital restoration tools. To revive colors, we propose PRevivor, a prior-guided color transformer that learns from recent paintings (e.g., Ming and Qing Dynasty) to restore ancient ones (e.g., Tang and Song Dynasty). To develop PRevivor, we decompose color restoration into two sequential sub-tasks: luminance enhancement and hue correction. For luminance enhancement, we employ two variational U-Nets and a multi-scale mapping module to translate faded luminance into restored counterparts. For hue correction, we design a dual-branch color query module guided by localized hue priors extracted from faded paintings. Specifically, one branch focuses attention on regions guided by masked priors, enforcing localized hue correction, whereas the other branch remains unconstrained to maintain a global reasoning capability. To evaluate PRevivor, we conduct extensive experiments against state-of-the-art colorization methods. The results demonstrate superior performance both quantitatively and qualitatively.

[217] Adaptation of Foundation Models for Medical Image Analysis: Strategies, Challenges, and Future Directions

Karma Phuntsho,Abdullah,Kyungmi Lee,Ickjai Lee,Euijoon Ahn

Main category: cs.CV

TL;DR: 本文综述了基础模型(FMs)在医学图像分析中的适应策略,涵盖了监督微调、领域特定预训练、高效参数微调、自监督学习等方法,并探讨了持续学习、联邦学习和数据合成等新兴方向,旨在推动可信赖且临床可用的FMs发展。

Details Motivation: 尽管基础模型在医学影像中具有巨大潜力,但其在真实临床环境中的应用仍受限于领域偏移、标注数据稀缺、计算开销大和隐私问题,因此需要系统性评估适应策略以促进实际部署。 Method: 本文对多种FMs适应方法进行了全面回顾,包括监督微调、领域特定预训练、参数高效微调、自监督学习、混合方法及多模态框架,并分析其性能增益、临床适用性和局限性,同时提出新兴研究方向。 Result: 总结了各类FMs适应策略的优劣与权衡,识别出先前综述常忽略的挑战,提出了涵盖持续学习、隐私保护、数据效率提升和系统化基准测试的未来路径。 Conclusion: 为实现适应性强、可信且与临床深度融合的基础模型,需综合现有方法并推进新兴技术,建立标准化评估体系以应对真实世界医学影像的复杂性。 Abstract: Foundation models (FMs) have emerged as a transformative paradigm in medical image analysis, offering the potential to provide generalizable, task-agnostic solutions across a wide range of clinical tasks and imaging modalities. Their capacity to learn transferable representations from large-scale data has the potential to address the limitations of conventional task-specific models. However, adaptation of FMs to real-world clinical practice remains constrained by key challenges, including domain shifts, limited availability of high-quality annotated data, substantial computational demands, and strict privacy requirements. This review presents a comprehensive assessment of strategies for adapting FMs to the specific demands of medical imaging. We examine approaches such as supervised fine-tuning, domain-specific pretraining, parameter-efficient fine-tuning, self-supervised learning, hybrid methods, and multimodal or cross-modal frameworks. For each, we evaluate reported performance gains, clinical applicability, and limitations, while identifying trade-offs and unresolved challenges that prior reviews have often overlooked. Beyond these established techniques, we also highlight emerging directions aimed at addressing current gaps. These include continual learning to enable dynamic deployment, federated and privacy-preserving approaches to safeguard sensitive data, hybrid self-supervised learning to enhance data efficiency, data-centric pipelines that combine synthetic generation with human-in-the-loop validation, and systematic benchmarking to assess robust generalization under real-world clinical variability. By outlining these strategies and associated research gaps, this review provides a roadmap for developing adaptive, trustworthy, and clinically integrated FMs capable of meeting the demands of real-world medical imaging.

[218] Detecting Generated Images by Fitting Natural Image Distributions

Yonggang Zhang,Jun Nie,Xinmei Tian,Mingming Gong,Kun Zhang,Bo Han

Main category: cs.CV

TL;DR: 提出一种基于自然图像与生成图像数据流形几何差异的检测方法,利用自监督模型和归一化流增强差异,实现对高度逼真的生成图像的有效检测。

Details Motivation: 随着生成图像真实感的提升,其滥用风险增加,现有检测方法依赖大量生成样本且难以应对先进生成模型,因此需要更鲁棒的检测技术。 Method: 设计一对函数,使它们在自然图像上输出一致而在生成图像上发散,利用梯度正交性;通过沿数据流形变换并观察自监督模型损失变化来判断图像是否为生成;引入归一化流将生成图像从自然图像流形中推离以放大可检测差异。 Result: 实验表明该方法在多种生成模型下均表现出优异的检测性能,尤其对先进生成模型仍保持有效性。 Conclusion: 该框架不依赖大量生成样本进行训练,通过几何特性与流形变换有效提升了生成图像的检测能力,具有良好的泛化性和应用潜力。 Abstract: The increasing realism of generated images has raised significant concerns about their potential misuse, necessitating robust detection methods. Current approaches mainly rely on training binary classifiers, which depend heavily on the quantity and quality of available generated images. In this work, we propose a novel framework that exploits geometric differences between the data manifolds of natural and generated images. To exploit this difference, we employ a pair of functions engineered to yield consistent outputs for natural images but divergent outputs for generated ones, leveraging the property that their gradients reside in mutually orthogonal subspaces. This design enables a simple yet effective detection method: an image is identified as generated if a transformation along its data manifold induces a significant change in the loss value of a self-supervised model pre-trained on natural images. Further more, to address diminishing manifold disparities in advanced generative models, we leverage normalizing flows to amplify detectable differences by extruding generated images away from the natural image manifold. Extensive experiments demonstrate the efficacy of this method. Code is available at https://github.com/tmlr-group/ConV.

[219] UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Feng Han,Yibin Wang,Chenglin Li,Zheming Liang,Dianyi Wang,Yang Jiao,Zhipeng Wei,Chao Gong,Cheng Jin,Jingjing Chen,Jiaqi Wang

Main category: cs.CV

TL;DR: 本文提出了UniREditBench,一个用于基于推理的图像编辑评估的统一基准,涵盖真实和游戏场景下的多对象交互与复杂推理任务,并引入多模态双参考评估方法以提高评测可靠性。

Details Motivation: 现有图像编辑基准主要关注单个对象属性变换,忽视了多对象交互和规则明确的游戏场景,且仅依赖文本参考进行评估,易在复杂推理场景中产生误判。因此需要一个更全面、可靠的基准来系统评估生成模型在多样化推理任务中的表现。 Method: 构建包含2700个样本的UniREditBench,覆盖8个主维度和18个子维度;提出多模态双参考评估(文本+真实图像);设计自动化多场景数据合成 pipeline,生成含高质量思维链标注的大规模合成数据集UniREdit-Data-100K;基于该数据集微调Bagel模型,得到UniREdit-Bagel。 Result: 实验表明UniREdit-Bagel在领域内和分布外设置下均有显著性能提升;通过对开源与闭源模型的广泛评测,揭示了当前模型在不同推理场景下的优势与局限。 Conclusion: UniREditBench为基于推理的图像编辑提供了更全面、可靠的评估标准,推动了多模态生成模型在复杂、真实场景中的发展。 Abstract: Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning, underscoring the need for a comprehensive benchmark to systematically assess their performance across various reasoning scenarios. Existing benchmarks primarily focus on single-object attribute transformation in realistic scenarios, which, while effective, encounter two key challenges: (1) they largely overlook multi-object interactions as well as game-world scenarios that involve human-defined rules, which are common in real-life applications; (2) they only rely on textual references to evaluate the generated images, potentially leading to systematic misjudgments, especially in complex reasoning scenarios. To this end, this work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation. It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions. To improve evaluation reliability, we introduce multimodal dual-reference evaluation, providing both textual and ground-truth image references for each sample assessment. Furthermore, we design an automated multi-scenario data synthesis pipeline and construct UniREdit-Data-100K, a large-scale synthetic dataset with high-quality chain-of-thought (CoT) reasoning annotations. We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings. Through thorough benchmarking of both open-source and closed-source image editing models, we reveal their strengths and weaknesses across various aspects.

[220] REASON: Probability map-guided dual-branch fusion framework for gastric content assessment

Nu-Fnag Xiao,De-Xing Huang,Le-Tian Wang,Mei-Jiang Gui,Qi Fu,Xiao-Liang Xie,Shi-Qi Liu,Shuangyi Wang,Zeng-Guang Hou,Ying-Wei Wang,Xiao-Hu Zhou

Main category: cs.CV

TL;DR: 提出了一种名为REASON的两阶段概率图引导双分支融合框架,用于胃内容物超声评估,显著提升了术前误吸风险自动评估的准确性与效率。

Details Motivation: 传统方法依赖人工勾画胃窦区域和经验公式,存在效率低、准确性差的问题,难以满足临床对快速准确评估胃内容物的需求。 Method: 第一阶段使用分割模型生成抑制伪影并突出胃部解剖结构的概率图;第二阶段采用双分支分类器融合右侧卧位(RLD)和仰卧位(SUP)两种标准视图的信息,提升特征判别能力。 Result: 在自建数据集上的实验表明,该框架显著优于当前最先进的方法,在胃内容物分类任务中表现出更高的准确性和鲁棒性。 Conclusion: REASON框架为自动化术前误吸风险评估提供了更可靠、高效和精确的解决方案,具有良好的临床应用前景。 Abstract: Accurate assessment of gastric content from ultrasound is critical for stratifying aspiration risk at induction of general anesthesia. However, traditional methods rely on manual tracing of gastric antra and empirical formulas, which face significant limitations in both efficiency and accuracy. To address these challenges, a novel two-stage probability map-guided dual-branch fusion framework (REASON) for gastric content assessment is proposed. In stage 1, a segmentation model generates probability maps that suppress artifacts and highlight gastric anatomy. In stage 2, a dual-branch classifier fuses information from two standard views, right lateral decubitus (RLD) and supine (SUP), to improve the discrimination of learned features. Experimental results on a self-collected dataset demonstrate that the proposed framework outperforms current state-of-the-art approaches by a significant margin. This framework shows great promise for automated preoperative aspiration risk assessment, offering a more robust, efficient, and accurate solution for clinical practice.

[221] Positive Semi-definite Latent Factor Grouping-Boosted Cluster-reasoning Instance Disentangled Learning for WSI Representation

Chentao Li,Behzad Bozorgtabar,Yifang Ping,Pan Huang,Jing Qin

Main category: cs.CV

TL;DR: 提出一种基于潜在因子分组和聚类推理的实例解耦学习框架,用于提升全切片病理图像的表示能力与可解释性。

Details Motivation: 现有MIL方法在空间、语义和决策层面存在实例纠缠问题,限制了其在全切片图像分析中的表现与可解释性。 Method: 采用正定潜在因子分组缓解空间纠缠;通过基于聚类推理的实例反事实推断优化缓解语义纠缠;利用广义线性加权决策与实例效应重加权解决决策纠缠。 Result: 在多中心数据集上性能优于所有现有最先进模型,并实现了与病理学家判断一致的可解释性。 Conclusion: 所提框架有效解耦了MIL中的多重纠缠,提升了WSI分析的性能与透明度,具有良好的临床可解释性。 Abstract: Multiple instance learning (MIL) has been widely used for representing whole-slide pathology images. However, spatial, semantic, and decision entanglements among instances limit its representation and interpretability. To address these challenges, we propose a latent factor grouping-boosted cluster-reasoning instance disentangled learning framework for whole-slide image (WSI) interpretable representation in three phases. First, we introduce a novel positive semi-definite latent factor grouping that maps instances into a latent subspace, effectively mitigating spatial entanglement in MIL. To alleviate semantic entanglement, we employs instance probability counterfactual inference and optimization via cluster-reasoning instance disentangling. Finally, we employ a generalized linear weighted decision via instance effect re-weighting to address decision entanglement. Extensive experiments on multicentre datasets demonstrate that our model outperforms all state-of-the-art models. Moreover, it attains pathologist-aligned interpretability through disentangled representations and a transparent decision-making process.

[222] Perturb a Model, Not an Image: Towards Robust Privacy Protection via Anti-Personalized Diffusion Models

Tae-Young Lee,Juwon Seo,Jong Hwan Ko,Gyeong-Moon Park

Main category: cs.CV

TL;DR: 本文提出了一种新的框架APDM,用于防止扩散模型对特定主体的未经授权的个性化,通过将保护目标从图像转移到模型本身,并引入了Direct Protective Optimization (DPO) 损失函数和Learning to Protect (L2P) 双路径优化策略。

Details Motivation: 现有的对抗性扰动方法在存在少量干净图像或简单图像变换时效果不佳,且依赖于不现实的假设,因此需要一种更鲁棒的方法来防止扩散模型的滥用。 Method: 提出了Anti-Personalized Diffusion Models (APDM) 框架,包括DPO损失函数和L2P双路径优化策略,通过对模型进行直接保护优化并模拟未来的个性化轨迹来增强防护能力。 Result: 实验结果表明,该框架在防止未经授权的个性化方面优于现有方法,达到了最先进的性能。 Conclusion: APDM框架能有效阻止特定主体的个性化,同时保持生成质量,为扩散模型的安全应用提供了新的解决方案。 Abstract: Recent advances in diffusion models have enabled high-quality synthesis of specific subjects, such as identities or objects. This capability, while unlocking new possibilities in content creation, also introduces significant privacy risks, as personalization techniques can be misused by malicious users to generate unauthorized content. Although several studies have attempted to counter this by generating adversarially perturbed samples designed to disrupt personalization, they rely on unrealistic assumptions and become ineffective in the presence of even a few clean images or under simple image transformations. To address these challenges, we shift the protection target from the images to the diffusion model itself to hinder the personalization of specific subjects, through our novel framework called Anti-Personalized Diffusion Models (APDM). We first provide a theoretical analysis demonstrating that a naive approach of existing loss functions to diffusion models is inherently incapable of ensuring convergence for robust anti-personalization. Motivated by this finding, we introduce Direct Protective Optimization (DPO), a novel loss function that effectively disrupts subject personalization in the target model without compromising generative quality. Moreover, we propose a new dual-path optimization strategy, coined Learning to Protect (L2P). By alternating between personalization and protection paths, L2P simulates future personalization trajectories and adaptively reinforces protection at each step. Experimental results demonstrate that our framework outperforms existing methods, achieving state-of-the-art performance in preventing unauthorized personalization. The code is available at https://github.com/KU-VGI/APDM.

[223] MVSMamba: Multi-View Stereo with State Space Model

Jianfei Jiang,Qiankun Liu,Hongyuan Liu,Haochen Yu,Liyong Wang,Jiansheng Chen,Huimin Ma

Main category: cs.CV

TL;DR: 提出首个基于Mamba架构的多视角立体匹配网络MVSMamba,通过动态Mamba模块实现高效全局特征聚合,在DTU和Tanks-and-Temples数据集上实现了性能与效率的双赢。

Details Motivation: 现有基于Transformer的多视角立体匹配方法因二次复杂度在性能和效率之间难以平衡,而Mamba架构具有线性复杂度和强大的全局建模能力,因此探索其在MVS中的应用。 Method: 提出MVSMamba,设计基于参考中心动态扫描策略的动态Mamba模块(DM-module),实现参考视图与源视图间的高效特征交互、全向多视图特征表示和多尺度全局特征聚合。 Result: 在DTU数据集和Tanks-and-Temples基准上,MVSMamba优于当前最先进的方法,兼具更高精度和计算效率。 Conclusion: MVSMamba成功将Mamba架构引入多视角立体匹配,为高效高精度三维重建提供新思路。 Abstract: Robust feature representations are essential for learning-based Multi-View Stereo (MVS), which relies on accurate feature matching. Recent MVS methods leverage Transformers to capture long-range dependencies based on local features extracted by conventional feature pyramid networks. However, the quadratic complexity of Transformer-based MVS methods poses challenges to balance performance and efficiency. Motivated by the global modeling capability and linear complexity of the Mamba architecture, we propose MVSMamba, the first Mamba-based MVS network. MVSMamba enables efficient global feature aggregation with minimal computational overhead. To fully exploit Mamba's potential in MVS, we propose a Dynamic Mamba module (DM-module) based on a novel reference-centered dynamic scanning strategy, which enables: (1) Efficient intra- and inter-view feature interaction from the reference to source views, (2) Omnidirectional multi-view feature representations, and (3) Multi-scale global feature aggregation. Extensive experimental results demonstrate MVSMamba outperforms state-of-the-art MVS methods on the DTU dataset and the Tanks-and-Temples benchmark with both superior performance and efficiency. The source code is available at https://github.com/JianfeiJ/MVSMamba.

[224] A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model

Sampriti Soor,Alik Pramanick,Jothiprakash K,Arijit Sur

Main category: cs.CV

TL;DR: 提出了一种基于CLIP模型的生成性对抗攻击方法,利用文本-图像对齐能力生成视觉上难以察觉且语义引导的对抗扰动,有效欺骗多标签分类器。

Details Motivation: 现有对抗攻击方法在保持图像视觉保真度和语义一致性方面存在不足,需结合自然语言语义提升攻击的有效性和隐蔽性。 Method: 结合CLIP模型的图文对齐能力,采用SSAE的集中扰动策略与GAMA的非相似文本嵌入,通过引导损失生成对抗样本。 Result: 在多种黑盒模型上实验表明,该方法在攻击成功率上具有竞争力,且生成的对抗样本保持更高的结构相似性和视觉保真度。 Conclusion: 所提方法能有效生成语义相关、视觉不可辨的对抗扰动,适用于复杂多对象场景下的多标签分类器攻击。 Abstract: The rapid growth of deep learning has brought about powerful models that can handle various tasks, like identifying images and understanding language. However, adversarial attacks, an unnoticed alteration, can deceive models, leading to inaccurate predictions. In this paper, a generative adversarial attack method is proposed that uses the CLIP model to create highly effective and visually imperceptible adversarial perturbations. The CLIP model's ability to align text and image representation helps incorporate natural language semantics with a guided loss to generate effective adversarial examples that look identical to the original inputs. This integration allows extensive scene manipulation, creating perturbations in multi-object environments specifically designed to deceive multilabel classifiers. Our approach integrates the concentrated perturbation strategy from Saliency-based Auto-Encoder (SSAE) with the dissimilar text embeddings similar to Generative Adversarial Multi-Object Scene Attacks (GAMA), resulting in perturbations that both deceive classification models and maintain high structural similarity to the original images. The model was tested on various tasks across diverse black-box victim models. The experimental results show that our method performs competitively, achieving comparable or superior results to existing techniques, while preserving greater visual fidelity.

[225] RDTE-UNet: A Boundary and Detail Aware UNet for Precise Medical Image Segmentation

Jierui Qu,Jianchun Zhao

Main category: cs.CV

TL;DR: 提出RDTE-UNet,一种结合局部建模与全局上下文的医学图像分割网络,通过混合骨干和三个模块提升边界精度和细节保留。

Details Motivation: 医学图像中解剖结构变异大、边界模糊,导致精细结构分割困难,影响诊断和治疗规划的可靠性。 Method: 设计RDTE-UNet网络,采用混合ResBlock与Transformer骨干,并引入ASBE(自适应边界增强)、HVDA(细粒度特征建模)和EulerFF(基于欧拉公式的融合加权)三个模块。 Result: 在Synapse和BUSI数据集上,RDTE-UNet在分割准确性和边界质量方面达到可比的性能水平。 Conclusion: RDTE-UNet有效提升了医学图像分割中的结构一致性和边界准确性,适用于不同形态、方向和尺度的结构。 Abstract: Medical image segmentation is essential for computer-assisted diagnosis and treatment planning, yet substantial anatomical variability and boundary ambiguity hinder reliable delineation of fine structures. We propose RDTE-UNet, a segmentation network that unifies local modeling with global context to strengthen boundary delineation and detail preservation. RDTE-UNet employs a hybrid ResBlock detail-aware Transformer backbone and three modules: ASBE for adaptive boundary enhancement, HVDA for fine-grained feature modeling, and EulerFF for fusion weighting guided by Euler's formula. Together, these components improve structural consistency and boundary accuracy across morphology, orientation, and scale. On Synapse and BUSI dataset, RDTE-UNet has achieved a comparable level in terms of segmentation accuracy and boundary quality.

[226] MIQ-SAM3D: From Single-Point Prompt to Multi-Instance Segmentation via Competitive Query Refinement

Jierui Qu,Jianchun Zhao

Main category: cs.CV

TL;DR: 提出MIQ-SAM3D,一种支持单点提示生成多实例分割的3D医学图像分割框架,通过混合CNN-Transformer编码器和竞争性查询优化策略,在多病灶分割任务中实现高效、鲁棒的性能。

Details Motivation: 现有SAM-based方法多采用单点对单目标范式,难以处理多病灶分割;同时ViT骨干网络缺乏局部细节捕捉能力。 Method: 设计prompt-conditioned实例查询生成器,将单个点提示转化为多个专用查询;采用混合CNN-Transformer编码器,通过空间门控注入CNN边界显著性信息;引入竞争性优化查询解码器,实现端到端并行多实例预测。 Result: 在LiTS17和KiTS21数据集上达到可比性能,且对提示点位置具有强鲁棒性,支持高效标注临床多病灶案例。 Conclusion: MIQ-SAM3D实现了从单点提示到多实例分割的转变,有效提升多病灶医学图像分割效率与实用性。 Abstract: Accurate segmentation of medical images is fundamental to tumor diagnosis and treatment planning. SAM-based interactive segmentation has gained attention for its strong generalization, but most methods follow a single-point-to-single-object paradigm, which limits multi-lesion segmentation. Moreover, ViT backbones capture global context but often miss high-fidelity local details. We propose MIQ-SAM3D, a multi-instance 3D segmentation framework with a competitive query optimization strategy that shifts from single-point-to-single-mask to single-point-to-multi-instance. A prompt-conditioned instance-query generator transforms a single point prompt into multiple specialized queries, enabling retrieval of all semantically similar lesions across the 3D volume from a single exemplar. A hybrid CNN-Transformer encoder injects CNN-derived boundary saliency into ViT self-attention via spatial gating. A competitively optimized query decoder then enables end-to-end, parallel, multi-instance prediction through inter-query competition. On LiTS17 and KiTS21 dataset, MIQ-SAM3D achieved comparable levels and exhibits strong robustness to prompts, providing a practical solution for efficient annotation of clinically relevant multi-lesion cases.

[227] Expanding the Content-Style Frontier: a Balanced Subspace Blending Approach for Content-Style LoRA Fusion

Linhao Huang

Main category: cs.CV

TL;DR: 本文提出了一种通过内容-风格子空间融合和内容-风格平衡损失来扩展内容-风格前沿的新方法,显著提升了文本到图像生成中在不同风格强度下的内容相似性。

Details Motivation: 现有研究仅在单一风格强度下评估内容相似性,而高风格强度会导致内容特征丢失,限制了内容与风格的权衡表现。 Method: 引入内容-风格子空间融合机制,并设计内容-风格平衡损失函数,以在增强风格化的同时保留更多内容特征。 Result: 实验表明,该方法在定性和定量评估中均优于现有技术,实现了更优的内容-风格权衡,且IGD和GD分数显著更低。 Conclusion: 所提方法有效扩展了内容-风格前沿,能够在不同风格强度下保持更高内容保真度,推动个性化图像生成的发展。 Abstract: Recent advancements in text-to-image diffusion models have significantly improved the personalization and stylization of generated images. However, previous studies have only assessed content similarity under a single style intensity. In our experiments, we observe that increasing style intensity leads to a significant loss of content features, resulting in a suboptimal content-style frontier. To address this, we propose a novel approach to expand the content-style frontier by leveraging Content-Style Subspace Blending and a Content-Style Balance loss. Our method improves content similarity across varying style intensities, significantly broadening the content-style frontier. Extensive experiments demonstrate that our approach outperforms existing techniques in both qualitative and quantitative evaluations, achieving superior content-style trade-off with significantly lower Inverted Generational Distance (IGD) and Generational Distance (GD) scores compared to current methods.

[228] CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

Qiangguo Jin,Xianyao Zheng,Hui Cui,Changming Sun,Yuqi Fang,Cong Cong,Ran Su,Leyi Wei,Ping Xuan,Junbo Wang

Main category: cs.CV

TL;DR: 提出了一种基于Cross-Mamba交互的多任务学习框架CMI-MTL,用于医学视觉问答(Med-VQA),在三个基准数据集上优于现有方法。

Details Motivation: 现有自注意力方法难以有效处理视觉与语言之间的跨模态语义对齐,且分类方法受限于预定义答案集,无法适应自由形式答案的多样性。 Method: 设计了包含细粒度视觉-文本特征对齐(FVTA)、跨模态交错特征表示(CIFR)和自由形式答案增强的多任务学习(FFAE)三个模块的CMI-MTL框架,以提升跨模态表征与开放性问答能力。 Result: CMI-MTL在VQA-RAD、SLAKE和OVQA三个Med-VQA数据集上超越了现有的最先进方法,并通过可解释性实验验证了其有效性。 Conclusion: CMI-MTL能更有效地捕捉跨模态语义信息,提升自由形式医学视觉问答的性能,具有良好的应用潜力。 Abstract: Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

[229] EREBUS: End-to-end Robust Event Based Underwater Simulation

Hitesh Kyatham,Arjun Suresh,Aadi Palnitkar,Yiannis Aloimonos

Main category: cs.CV

TL;DR: 本文提出了一种用于生成安装在自主水下航行器(AUV)上的事件相机在水下环境中拍摄的逼真合成数据的管道,以应对传统视觉技术在恶劣光照和高动态范围场景下的局限性。

Details Motivation: 由于水下环境存在光照差、高动态范围等问题,传统视觉方法表现不佳,因此需要更有效的解决方案来提升水下机器人视觉系统的性能。 Method: 设计并实现了一个能够生成事件相机水下合成数据的管道,利用仿真环境模拟真实水下条件,包括悬浮颗粒物和低能见度等复杂因素。 Result: 通过岩石检测任务验证了所提方法的有效性,在低可见度条件下表现出良好的潜力,且该方法可推广至其他水下视觉任务。 Conclusion: 该合成数据生成管道为训练和测试水下事件相机视觉模型提供了一个有效且灵活的工具,有助于推动水下机器人视觉技术的发展。 Abstract: The underwater domain presents a vast array of challenges for roboticists and computer vision researchers alike, such as poor lighting conditions and high dynamic range scenes. In these adverse conditions, traditional vision techniques struggle to adapt and lead to suboptimal performance. Event-based cameras present an attractive solution to this problem, mitigating the issues of traditional cameras by tracking changes in the footage on a frame-by-frame basis. In this paper, we introduce a pipeline which can be used to generate realistic synthetic data of an event-based camera mounted to an AUV (Autonomous Underwater Vehicle) in an underwater environment for training vision models. We demonstrate the effectiveness of our pipeline using the task of rock detection with poor visibility and suspended particulate matter, but the approach can be generalized to other underwater tasks.

[230] SEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignment

Xinyu Mao,Junsi Li,Haoji Zhang,Yu Liang,Ming Sun

Main category: cs.CV

TL;DR: 提出语义增强的补丁瘦身框架(SEPS),通过融合密集和稀疏文本的统一语义,解决细粒度跨模态对齐中的补丁冗余与歧义问题,显著提升视觉-语言检索性能。

Details Motivation: 现有方法在处理模态间信息密度差异导致的补丁冗余与歧义方面存在挑战,且难以准确衡量视觉补丁与文本描述之间的语义相关性。 Method: 设计两阶段机制,融合MLLM生成的密集文本与原始稀疏文本的语义,通过相关性感知选择和均值计算突出关键的补丁-词对应关系,实现补丁精简与跨模态对齐优化。 Result: 在Flickr30K和MS-COCO数据集上,SEPS在多种模型架构下rSum指标超越现有方法23%-86%,尤其在文本到图像检索任务中表现突出。 Conclusion: SEPS有效缓解了跨模态对齐中的冗余与歧义问题,提升了细粒度视觉-语言对齐的准确性,为多模态理解提供了新的解决方案。 Abstract: Fine-grained cross-modal alignment aims to establish precise local correspondences between vision and language, forming a cornerstone for visual question answering and related multimodal applications. Current approaches face challenges in addressing patch redundancy and ambiguity, which arise from the inherent information density disparities across modalities. Recently, Multimodal Large Language Models (MLLMs) have emerged as promising solutions to bridge this gap through their robust semantic generation capabilities. However, the dense textual outputs from MLLMs may introduce conflicts with the original sparse captions. Furthermore, accurately quantifying semantic relevance between rich visual patches and concise textual descriptions remains a core challenge. To overcome these limitations, we introduce the Semantic-Enhanced Patch Slimming (SEPS) framework, which systematically addresses patch redundancy and ambiguity. Our approach employs a two-stage mechanism to integrate unified semantics from both dense and sparse texts, enabling the identification of salient visual patches. Additionally, it leverages relevance-aware selection with mean value computation to highlight crucial patch-word correspondences, thereby improving cross-modal similarity assessment. Comprehensive experiments on Flickr30K and MS-COCO datasets validate that SEPS achieves superior performance, surpassing existing approaches by 23\%-86\% in rSum across diverse model architectures, with notable enhancements in text-to-image retrieval scenarios. Our implementation is available at https://github.com/Sweet4tars/seps.git.

[231] Semantic BIM enrichment for firefighting assets: Fire-ART dataset and panoramic image-based 3D reconstruction

Ya Wen,Yutong Qiao,Chi Chiu Lam,Ioannis Brilakis,Sanghoon Lee,Mun On Wong

Main category: cs.CV

TL;DR: 本研究提出了Fire-ART数据集和基于全景图像的重建方法,用于将消防设备语义信息高效融入BIM模型,提升火灾安全设备的数字化管理精度。

Details Motivation: 传统的消防设备库存管理方法在自动识别与重建方面能力有限,效率低下,难以满足应急准备和现场响应的需求。 Method: 构建了包含15类基本设备、2,626张图像和6,627个实例的Fire-ART数据集,并提出一种融合改进立方图转换和基于半径的球面相机投影的全景图像重建方法,以提高识别与定位精度。 Result: 在两个真实案例中验证,F1分数分别为73%和88%,定位误差分别为0.620米和0.428米。 Conclusion: Fire-ART数据集和所提出的重建方法为消防设备的精确数字管理提供了有价值的资源和可靠的解决方案。 Abstract: Inventory management of firefighting assets is crucial for emergency preparedness, risk assessment, and on-site fire response. However, conventional methods are inefficient due to limited capabilities in automated asset recognition and reconstruction. To address the challenge, this research introduces the Fire-ART dataset and develops a panoramic image-based reconstruction approach for semantic enrichment of firefighting assets into BIM models. The Fire-ART dataset covers 15 fundamental assets, comprising 2,626 images and 6,627 instances, making it an extensive and publicly accessible dataset for asset recognition. In addition, the reconstruction approach integrates modified cube-map conversion and radius-based spherical camera projection to enhance recognition and localization accuracy. Through validations with two real-world case studies, the proposed approach achieves F1-scores of 73% and 88% and localization errors of 0.620 and 0.428 meters, respectively. The Fire-ART dataset and the reconstruction approach offer valuable resources and robust technical solutions to enhance the accurate digital management of fire safety equipment.

[232] Extremal Contours: Gradient-driven contours for compact visual attribution

Reza Karimzadeh,Albert Alonso,Frans Zdyb,Julius B. Kirkegaard,Bulat Ibragimov

Main category: cs.CV

TL;DR: 提出一种无需训练的视觉模型解释方法,使用平滑可调轮廓替代传统的密集扰动掩码,通过傅里叶级数参数化星凸区域,并在保留/删除目标下利用分类器梯度优化,生成紧凑、连通且稳定的解释掩码。

Details Motivation: 现有基于密集扰动的解释方法常产生碎片化、过拟合的掩码,需复杂后处理,且易受对抗性干扰,缺乏稳定性和可解释性。 Method: 将掩码建模为由截断傅里叶级数参数化的星凸区域,在梯度指导下优化保留/删除目标函数,实现平滑、低自由度的轮廓生成,并扩展至多轮廓以定位多个对象。 Result: 在ImageNet上达到与密集掩码相当的保真度,同时提升运行间一致性、降低复杂度,显著提高相关性质量(尤其在DINO模型上提升超15%),并支持面积控制生成重要性轮廓图。 Conclusion: 该方法通过限制解空间为平滑低维轮廓,实现了更忠实、紧凑、鲁棒且可解释的视觉模型归因,优于梯度和扰动类基线方法。 Abstract: Faithful yet compact explanations for vision models remain a challenge, as commonly used dense perturbation masks are often fragmented and overfitted, needing careful post-processing. Here, we present a training-free explanation method that replaces dense masks with smooth tunable contours. A star-convex region is parameterized by a truncated Fourier series and optimized under an extremal preserve/delete objective using the classifier gradients. The approach guarantees a single, simply connected mask, cuts the number of free parameters by orders of magnitude, and yields stable boundary updates without cleanup. Restricting solutions to low-dimensional, smooth contours makes the method robust to adversarial masking artifacts. On ImageNet classifiers, it matches the extremal fidelity of dense masks while producing compact, interpretable regions with improved run-to-run consistency. Explicit area control also enables importance contour maps, yielding a transparent fidelity-area profiles. Finally, we extend the approach to multi-contour and show how it can localize multiple objects within the same framework. Across benchmarks, the method achieves higher relevance mass and lower complexity than gradient and perturbation based baselines, with especially strong gains on self-supervised DINO models where it improves relevance mass by over 15% and maintains positive faithfulness correlations.

[233] Towards One-step Causal Video Generation via Adversarial Self-Distillation

Yongqi Yang,Huayang Huang,Xu Peng,Xiaobin Hu,Donghao Luo,Jiangning Zhang,Chengjie Wang,Yu Wu

Main category: cs.CV

TL;DR: 提出了一种基于蒸馏的高效因果视频生成框架,结合对抗自蒸馏(ASD)和首帧增强(FFE)策略,在极少数去噪步数下实现高质量视频生成,支持多步推理且无需重复蒸馏。

Details Motivation: 现有混合视频生成模型因顺序迭代去噪导致误差累积和推理时间长,难以在极少步数下保持高质量生成。 Method: 基于分布匹配蒸馏(DMD)框架,提出对抗自蒸馏(ASD),使学生模型的n步与(n+1)步输出在分布上对齐;并设计首帧增强(FFE)策略,为首帧分配更多去噪步以抑制误差传播。 Result: 在VBench上的一次和两次去噪步生成中均优于当前最先进方法,生成质量更高且训练更稳定。 Conclusion: 该框架能用单一蒸馏模型灵活支持多种推理步数配置,显著提升极低步数下的视频生成效率与质量,无需重复蒸馏。 Abstract: Recent hybrid video generation models combine autoregressive temporal dynamics with diffusion-based spatial denoising, but their sequential, iterative nature leads to error accumulation and long inference times. In this work, we propose a distillation-based framework for efficient causal video generation that enables high-quality synthesis with extremely limited denoising steps. Our approach builds upon the Distribution Matching Distillation (DMD) framework and proposes a novel Adversarial Self-Distillation (ASD) strategy, which aligns the outputs of the student model's n-step denoising process with its (n+1)-step version at the distribution level. This design provides smoother supervision by bridging small intra-student gaps and more informative guidance by combining teacher knowledge with locally consistent student behavior, substantially improving training stability and generation quality in extremely few-step scenarios (e.g., 1-2 steps). In addition, we present a First-Frame Enhancement (FFE) strategy, which allocates more denoising steps to the initial frames to mitigate error propagation while applying larger skipping steps to later frames. Extensive experiments on VBench demonstrate that our method surpasses state-of-the-art approaches in both one-step and two-step video generation. Notably, our framework produces a single distilled model that flexibly supports multiple inference-step settings, eliminating the need for repeated re-distillation and enabling efficient, high-quality video synthesis.

[234] UniSOT: A Unified Framework for Multi-Modality Single Object Tracking

Yinchao Ma,Yuyang Tang,Wenfei Yang,Tianzhu Zhang,Xu Zhou,Feng Wu

Main category: cs.CV

TL;DR: 本文提出了一种统一的单目标跟踪器UniSOT,能够同时处理三种参考模态和四种视频模态的任意组合,实现了跨模态的统一建模,在多个基准上表现出优于专用模型的性能。

Details Motivation: 现有的跟踪器通常只针对特定的视频或参考模态设计,导致模型碎片化,限制了实际应用。需要一个统一的框架来应对不同模态组合的多样化需求。 Method: 提出UniSOT,采用统一的网络架构和参数,支持三种参考模态(边界框、自然语言、两者结合)和四种视频模态(RGB、RGB+Depth、RGB+Thermal、RGB+Event)的任意组合,实现多模态统一跟踪。 Result: 在18个视觉跟踪、视觉-语言跟踪和RGB+X跟踪基准上实验表明,UniSOT在所有模态组合下均优于模态专用方法,例如在TNL2K上比之前方法提升超3.0% AUC,在RGB+X模态下比Un-Track提升超2.0%主指标。 Conclusion: UniSOT是首个支持多种参考和视频模态组合的统一跟踪器,具有良好的通用性和优越性能,推动了统一多模态跟踪的发展。 Abstract: Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0\% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0\% main metric across all three RGB+X video modalities.

[235] Terrain-Enhanced Resolution-aware Refinement Attention for Off-Road Segmentation

Seongkyu Choi,Jhonghyun An

Main category: cs.CV

TL;DR: 提出一种分辨率感知的token解码器,用于解决非结构化道路语义分割中的边界模糊、稀疏监督和标签噪声问题,在保持高效计算的同时提升边界精度和局部一致性。

Details Motivation: 现有方法在低分辨率融合时边界模糊且易传播误差,而维持高分辨率路径或重复融合则计算昂贵且对噪声敏感,难以应对标签不完整和噪声普遍的实际场景。 Method: 引入分辨率感知token解码器:大部分计算在低分辨率瓶颈进行;通过门控交叉注意力注入细粒度特征;仅对不确定性选中的稀疏像素进行细化;结合全局自注意力与轻量空洞深度卷积恢复局部一致性,并设计边界带一致性正则项增强边缘预测连贯性。 Result: 在多个基准上实现了具有竞争力的性能,尤其在边界区域表现更优,同时表现出更强的训练稳定性与对标签噪声的鲁棒性。 Conclusion: 该方法有效平衡了全局语义、局部一致性和边界保真度,兼顾效率与精度,适用于复杂真实环境下的语义分割任务。 Abstract: Off-road semantic segmentation suffers from thick, inconsistent boundaries, sparse supervision for rare classes, and pervasive label noise. Designs that fuse only at low resolution blur edges and propagate local errors, whereas maintaining high-resolution pathways or repeating high-resolution fusions is costly and fragile to noise. We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision. Most computation occurs at a low-resolution bottleneck; a gated cross-attention injects fine-scale detail, and only a sparse, uncertainty-selected set of pixels is refined. The components are co-designed and tightly integrated: global self-attention with lightweight dilated depthwise refinement restores local coherence; a gated cross-attention integrates fine-scale features from a standard high-resolution encoder stream without amplifying noise; and a class-aware point refinement corrects residual ambiguities with negligible overhead. During training, we add a boundary-band consistency regularizer that encourages coherent predictions in a thin neighborhood around annotated edges, with no inference-time cost. Overall, the results indicate competitive performance and improved stability across transitions.

[236] Contrast-Guided Cross-Modal Distillation for Thermal Object Detection

SiWoo Kim,JhongHyun An

Main category: cs.CV

TL;DR: 提出一种仅在训练时使用的多目标学习方法,通过增强类别间决策边界和利用RGB预训练教师模型的语义先验来提升热红外图像检测性能。

Details Motivation: 热红外图像在夜间感知中面临低对比度、高频信息弱等问题,导致检测结果存在重复框、漏检小物体和类别混淆;现有方法依赖RGB转换或多模态融合,鲁棒性差且增加测试成本。 Method: 保持单模态推理,在训练阶段引入两个目标:1)通过同类特征拉近、异类推远以 sharpen 决策边界;2)利用RGB教师模型的多层特征指导学生模型,注入跨模态语义先验。 Result: 实验表明该方法优于先前方法,在热红外检测任务上达到SOTA性能。 Conclusion: 所提训练策略有效提升了热红外检测的鲁棒性和精度,无需测试时额外输入或传感器,具有实际应用价值。 Abstract: Robust perception at night remains challenging for thermal-infrared detection: low contrast and weak high-frequency cues lead to duplicate, overlapping boxes, missed small objects, and class confusion. Prior remedies either translate TIR to RGB and hope pixel fidelity transfers to detection -- making performance fragile to color or structure artifacts -- or fuse RGB and TIR at test time, which requires extra sensors, precise calibration, and higher runtime cost. Both lines can help in favorable conditions, but do not directly shape the thermal representation used by the detector. We keep mono-modality inference and tackle the root causes during training. Specifically, we introduce training-only objectives that sharpen instance-level decision boundaries by pulling together features of the same class and pushing apart those of different classes -- suppressing duplicate and confusing detections -- and that inject cross-modal semantic priors by aligning the student's multi-level pyramid features with an RGB-trained teacher, thereby strengthening texture-poor thermal features without visible input at test time. In experiments, our method outperformed prior approaches and achieved state-of-the-art performance.

[237] Privacy Preserving Ordinal-Meta Learning with VLMs for Fine-Grained Fruit Quality Prediction

Riddhi Jain,Manasi Patwardhan,Aayush Mishra,Parijat Deshpande,Beena Rai

Main category: cs.CV

TL;DR: 提出一种模型无关的序数元学习算法(MAOML),用于在零样本和少样本设置下提升开源视觉语言模型在水果新鲜度分类任务中的性能,解决了数据稀缺和标签序数性问题,达到行业标准92.71%的准确率。

Details Motivation: 现有开源视觉语言模型在水果新鲜度预测任务中表现不佳,且闭源模型存在数据隐私问题,同时专家标注成本高导致数据稀缺,亟需一种高效利用少量标注数据并保护隐私的方法。 Method: 提出模型无关的序数元学习(MAOML)算法,结合元学习应对数据稀疏性,并利用标签的序数关系进行优化,适用于小型视觉语言模型的训练。 Result: 在零样本和少样本设置下实现了最先进的水果新鲜度分类性能,平均准确率达到92.71%,满足行业标准。 Conclusion: MAOML有效提升了开源视觉语言模型在水果新鲜度预测任务中的表现,兼顾隐私保护与高性能,在实际食品零售场景中具有广泛应用潜力。 Abstract: To effectively manage the wastage of perishable fruits, it is crucial to accurately predict their freshness or shelf life using non-invasive methods that rely on visual data. In this regard, deep learning techniques can offer a viable solution. However, obtaining fine-grained fruit freshness labels from experts is costly, leading to a scarcity of data. Closed proprietary Vision Language Models (VLMs), such as Gemini, have demonstrated strong performance in fruit freshness detection task in both zero-shot and few-shot settings. Nonetheless, food retail organizations are unable to utilize these proprietary models due to concerns related to data privacy, while existing open-source VLMs yield sub-optimal performance for the task. Fine-tuning these open-source models with limited data fails to achieve the performance levels of proprietary models. In this work, we introduce a Model-Agnostic Ordinal Meta-Learning (MAOML) algorithm, designed to train smaller VLMs. This approach utilizes meta-learning to address data sparsity and leverages label ordinality, thereby achieving state-of-the-art performance in the fruit freshness classification task under both zero-shot and few-shot settings. Our method achieves an industry-standard accuracy of 92.71%, averaged across all fruits. Keywords: Fruit Quality Prediction, Vision Language Models, Meta Learning, Ordinal Regression

[238] Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

Jie Du,Xinyu Gong,Qingshan Tan,Wen Li,Yangming Cheng,Weitao Wang,Chenlu Zhan,Suhui Wu,Hao Zhang,Jun Zhang

Main category: cs.CV

TL;DR: 提出了一种新的基于直接偏好优化(DPO)的视频生成方法,通过GT-Pair自动生成高质量偏好对,并引入Reg-DPO提升训练稳定性与生成质量,结合FSDP和内存优化技术显著提高训练容量,在多种任务和数据集上优于现有方法。

Details Motivation: 现有DPO方法主要沿用图像领域的范式且局限于小规模模型,难以应对视频生成中的高成本数据构建、训练不稳定和高内存消耗等挑战。 Method: 提出GT-Pair,利用真实视频作为正样本、模型生成视频作为负样本自动构建偏好对;设计Reg-DPO,在DPO目标中加入SFT损失作为正则项以增强训练稳定性和生成保真度;结合FSDP框架与多种内存优化技术提升训练效率。 Result: 在I2V和T2V任务的多个数据集上实验表明,该方法在视频生成质量上持续优于现有方法,训练容量达到仅用FSDP的近三倍。 Conclusion: 所提出的方法有效解决了视频生成中偏好学习的数据、稳定性和效率问题,显著提升了大规模模型下的视频生成性能。 Abstract: Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO objective to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.

[239] When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA

Dennis Pierantozzi,Luca Carlini,Mauro Orazio Drago,Chiara Lena,Cesare Hassan,Elena De Momi,Danail Stoyanov,Sophia Bano,Mobarak I. Hoque

Main category: cs.CV

TL;DR: 本文提出了一种名为QA-SNNE的黑箱不确定性估计方法,通过结合问题语义和医学文本嵌入空间中的语义熵来提升手术场景中视觉问答(VQA)系统的安全性与可靠性。实验表明,该方法在多种模型上显著提高了AUROC,尤其在零样本模型中提升了15-38%,并有效增强了幻觉检测能力。

Details Motivation: 现有外科VQA研究多关注准确性和语言质量,忽视了模糊感知、转诊专家等安全行为。为提升系统安全性,作者引入不确定性估计以实现自动故障检测(AFD),从而支持更安全的决策。 Method: 提出Question Aligned Semantic Nearest Neighbor Entropy (QA-SNNE),通过在医学文本嵌入空间中比较生成答案与其最近邻的语义相似性来衡量语义熵,并将问题语义纳入预测置信度评估。该方法适用于黑箱模型,具有可解释性。 Result: 在EndoVis18-VQA和PitVQA数据集上评估了五种模型(包括PEFT和LVLM)。PEFT模型对轻微改写敏感,而LVLM更具鲁棒性。QA-SNNE在多数模板内设置下提升了AUROC,在零样本模型中提升15-38%,且在模板外压力测试下性能保持稳定,同时增强了幻觉检测能力。 Conclusion: QA-SNNE通过将语义不确定性与问题上下文关联,为外科VQA中的自动故障检测提供了实用且可解释的解决方案。结合LVLM骨干网络与问题对齐的不确定性估计,有助于提升系统安全性与临床医生信任。 Abstract: Safety and reliability are essential for deploying Visual Question Answering (VQA) in surgery, where incorrect or ambiguous responses can harm the patient. Most surgical VQA research focuses on accuracy or linguistic quality while overlooking safety behaviors such as ambiguity awareness, referral to human experts, or triggering a second opinion. Inspired by Automatic Failure Detection (AFD), we study uncertainty estimation as a key enabler of safer decision making. We introduce Question Aligned Semantic Nearest Neighbor Entropy (QA-SNNE), a black box uncertainty estimator that incorporates question semantics into prediction confidence. It measures semantic entropy by comparing generated answers with nearest neighbors in a medical text embedding space, conditioned on the question. We evaluate five models, including domain specific Parameter-Efficient Fine-Tuned (PEFT) models and zero-shot Large Vision-Language Models (LVLMs), on EndoVis18-VQA and PitVQA. PEFT models degrade under mild paraphrasing, while LVLMs are more resilient. Across three LVLMs and two PEFT baselines, QA-SNNE improves AUROC in most in-template settings and enhances hallucination detection. The Area Under the ROC Curve (AUROC) increases by 15-38% for zero-shot models, with gains maintained under out-of-template stress. QA-SNNE offers a practical and interpretable step toward AFD in surgical VQA by linking semantic uncertainty to question context. Combining LVLM backbones with question aligned uncertainty estimation can improve safety and clinician trust. The code and model are available at https://github.com/DennisPierantozzi/QASNNE

[240] Efficiently Training A Flat Neural Network Before It has been Quantizated

Peng Xia,Junbiao Pang,Tianyang Cai

Main category: cs.CV

TL;DR: 提出了一种针对视觉Transformer的后训练量化方法,通过将激活和权重量化误差建模为独立高斯噪声,并引入噪声注入优化策略,获得平坦的全精度网络,从而提升低比特量化的性能。

Details Motivation: 现有PTQ方法忽略了全精度模型与量化模型之间的关系,导致量化误差较大,缺乏对模型无关且适配低比特精度的训练方法。 Method: 将激活量化误差(AQE)和权重量化误差(WQE)建模为独立高斯噪声,通过噪声注入优化寻找平坦极小值,预先调整模型结构以适应低比特量化。 Result: 实验结果表明该方法在低比特PTQ下显著降低量化误差,提升了量化模型的性能。 Conclusion: 平坦的全精度网络对低比特量化至关重要,所提框架有效改善了ViT模型的后训练量化效果。 Abstract: Post-training quantization (PTQ) for vision transformers (ViTs) has garnered significant attention due to its efficiency in compressing models. However, existing methods typically overlook the relationship between a well-trained NN and the quantized model, leading to considerable quantization error for PTQ. However, it is unclear how to efficiently train a model-agnostic neural network which is tailored for a predefined precision low-bit model. In this paper, we firstly discover that a flat full precision neural network is crucial for low-bit quantization. To achieve this, we propose a framework that proactively pre-conditions the model by measuring and disentangling the error sources. Specifically, both the Activation Quantization Error (AQE) and the Weight Quantization Error (WQE) are statistically modeled as independent Gaussian noises. We study several noise injection optimization methods to obtain a flat minimum. Experimental results attest to the effectiveness of our approach. These results open novel pathways for obtaining low-bit PTQ models.

[241] HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA

Lei Hu,Yongjing Ye,Shihong Xia

Main category: cs.CV

TL;DR: 提出了一种基于MoE LoRA策略的统一框架HMVLM,用于融合3D人体运动与文本模态,通过零专家机制缓解灾难性遗忘,并采用身体部位特定的分词方法提升姿态表示的空间分辨率,在多种下游任务中表现出色。

Details Motivation: 解决3D人体运动与文本模态融合中的灾难性遗忘问题,并克服自回归兼容姿态表示的技术障碍。 Method: 采用Mixture of Expert Low-Rank Adaptation(MoE LoRA)策略,引入门控网络动态分配LoRA专家权重,并设计零专家以保护预训练语言参数;对身体不同关节组进行分组实现身体部位特定的分词。 Result: 实验表明该方法有效缓解了指令调优过程中的知识遗忘问题,并在多种人体运动相关的下游任务中取得了优异性能。 Conclusion: HMVLM框架能够有效整合多模态信息,平衡特定任务微调与通用语言能力保持,为人体运动与语言的跨模态建模提供了有效解决方案。 Abstract: The expansion of instruction-tuning data has enabled foundation language models to exhibit improved instruction adherence and superior performance across diverse downstream tasks. Semantically-rich 3D human motion is being progressively integrated with these foundation models to enhance multimodal understanding and cross-modal generation capabilities. However, the modality gap between human motion and text raises unresolved concerns about catastrophic forgetting during this integration. In addition, developing autoregressive-compatible pose representations that preserve generalizability across heterogeneous downstream tasks remains a critical technical barrier. To address these issues, we propose the Human Motion-Vision-Language Model (HMVLM), a unified framework based on the Mixture of Expert Low-Rank Adaption(MoE LoRA) strategy. The framework leverages the gating network to dynamically allocate LoRA expert weights based on the input prompt, enabling synchronized fine-tuning of multiple tasks. To mitigate catastrophic forgetting during instruction-tuning, we introduce a novel zero expert that preserves the pre-trained parameters for general linguistic tasks. For pose representation, we implement body-part-specific tokenization by partitioning the human body into different joint groups, enhancing the spatial resolution of the representation. Experiments show that our method effectively alleviates knowledge forgetting during instruction-tuning and achieves remarkable performance across diverse human motion downstream tasks.

[242] SecDiff: Diffusion-Aided Secure Deep Joint Source-Channel Coding Against Adversarial Attacks

Changyuan Zhao,Jiacheng Wang,Ruichen Zhang,Dusit Niyato,Hongyang Du,Zehui Xiong,Dong In Kim,Ping Zhang

Main category: cs.CV

TL;DR: 本文提出SecDiff,一种基于扩散模型的解码框架,用于增强深度联合信源信道编码(JSCC)在对抗性无线环境下的安全性和鲁棒性。该方法通过伪逆引导采样和自适应引导加权实现高效语义重建,并结合子载波掩蔽和EM驱动算法应对干扰与导频欺骗攻击。

Details Motivation: 现有JSCC框架在物理层易受对抗攻击(如导频欺骗和子载波干扰),影响语义保真度,亟需提升其安全性与鲁棒性。 Method: 提出SecDiff框架:1)采用伪逆引导采样与自适应加权以降低推理延迟;2)设计基于功率的子载波掩蔽策略,将恢复问题转化为扩散引导的掩码修复问题;3)将信道估计建模为盲逆问题,提出EM驱动的重建算法,结合重构损失与信道算子进行联合优化;4)在扩散过程中交替执行导频恢复与信道估计,实现联合优化。 Result: 在OFDM信道下的实验表明,SecDiff在对抗环境下优于现有的安全与生成式JSCC基线方法,在重建质量与计算成本之间实现了更优权衡。 Conclusion: SecDiff通过扩散辅助解码有效提升了JSCC系统的抗攻击能力与语义重建效率,是迈向实用化、低延迟、抗攻击语义通信的重要进展。 Abstract: Deep joint source-channel coding (JSCC) has emerged as a promising paradigm for semantic communication, delivering significant performance gains over conventional separate coding schemes. However, existing JSCC frameworks remain vulnerable to physical-layer adversarial threats, such as pilot spoofing and subcarrier jamming, compromising semantic fidelity. In this paper, we propose SecDiff, a plug-and-play, diffusion-aided decoding framework that significantly enhances the security and robustness of deep JSCC under adversarial wireless environments. Different from prior diffusion-guided JSCC methods that suffer from high inference latency, SecDiff employs pseudoinverse-guided sampling and adaptive guidance weighting, enabling flexible step-size control and efficient semantic reconstruction. To counter jamming attacks, we introduce a power-based subcarrier masking strategy and recast recovery as a masked inpainting problem, solved via diffusion guidance. For pilot spoofing, we formulate channel estimation as a blind inverse problem and develop an expectation-minimization (EM)-driven reconstruction algorithm, guided jointly by reconstruction loss and a channel operator. Notably, our method alternates between pilot recovery and channel estimation, enabling joint refinement of both variables throughout the diffusion process. Extensive experiments over orthogonal frequency-division multiplexing (OFDM) channels under adversarial conditions show that SecDiff outperforms existing secure and generative JSCC baselines by achieving a favorable trade-off between reconstruction quality and computational cost. This balance makes SecDiff a promising step toward practical, low-latency, and attack-resilient semantic communications.

[243] EPAN: Robust Pedestrian Re-Identification via Enhanced Alignment Network for IoT Surveillance

Zhiyang Jia,Hongyan Cui,Ge Gao,Bo Li,Minjie Zhang,Zishuo Gao,Huiwen Huang,Caisheng Zhuo

Main category: cs.CV

TL;DR: 本文提出了一种用于物联网环境下行人重识别的增强型行人对齐网络(EPAN),通过双分支架构提升在不同视角和环境变化下的特征提取能力,在Inspection-Personnel数据集上取得了90.09%的Rank-1精度和78.82%的mAP。

Details Motivation: 为了解决物联网监控环境中因视角和环境变化导致的行人重识别性能下降问题,提升跨摄像头的识别鲁棒性。 Method: 采用双分支网络结构,提取多尺度和多视角下的对齐特征,增强模型对复杂场景的适应能力。 Result: 在Inspection-Personnel数据集上达到90.09%的Rank-1准确率和78.82%的mAP,显著优于现有方法。 Conclusion: EPAN在复杂物联网监控场景中表现出优异的行人重识别性能,具备实际部署潜力。 Abstract: Person re-identification (ReID) plays a pivotal role in computer vision, particularly in surveillance and security applications within IoT-enabled smart environments. This study introduces the Enhanced Pedestrian Alignment Network (EPAN), tailored for robust ReID across diverse IoT surveillance conditions. EPAN employs a dual-branch architecture to mitigate the impact of perspective and environmental changes, extracting alignment information under varying scales and viewpoints. Here, we demonstrate EPAN's strong feature extraction capabilities, achieving outstanding performance on the Inspection-Personnel dataset with a Rank-1 accuracy of 90.09% and a mean Average Precision (mAP) of 78.82%. This highlights EPAN's potential for real-world IoT applications, enabling effective and reliable person ReID across diverse cameras in surveillance and security systems. The code and data are available at: https://github.com/ggboy2580/EPAN

[244] SE(3)-PoseFlow: Estimating 6D Pose Distributions for Uncertainty-Aware Robotic Manipulation

Yufeng Jin,Niklas Funk,Vignesh Prasad,Zechu Li,Mathias Franzius,Jan Peters,Georgia Chalvatzaki

Main category: cs.CV

TL;DR: 提出了一种基于SE(3)流形上流匹配的新型概率框架,用于估计6D物体姿态分布,能够处理对称性和遮挡带来的不确定性,并在多个数据集上达到SOTA性能。

Details Motivation: 现有确定性深度网络难以捕捉姿态分布的多模态性,且在部分观测、遮挡和物体对称情况下表现不佳,缺乏对不确定性的建模能力。 Method: 采用流匹配方法在SE(3)流形上建模6D姿态的概率分布,使用基于采样的方式表示姿态分布,从而显式地表达位姿不确定性。 Result: 在Real275、YCB-V和LM-O数据集上取得了当前最优的结果,并展示了该方法在主动感知和不确定性感知抓取规划等机器人操作任务中的应用潜力。 Conclusion: 所提出的方法能有效建模6D姿态的多模态分布,提升在遮挡和对称情况下的鲁棒性,支持下游机器人任务中的不确定性推理。 Abstract: Object pose estimation is a fundamental problem in robotics and computer vision, yet it remains challenging due to partial observability, occlusions, and object symmetries, which inevitably lead to pose ambiguity and multiple hypotheses consistent with the same observation. While deterministic deep networks achieve impressive performance under well-constrained conditions, they are often overconfident and fail to capture the multi-modality of the underlying pose distribution. To address these challenges, we propose a novel probabilistic framework that leverages flow matching on the SE(3) manifold for estimating 6D object pose distributions. Unlike existing methods that regress a single deterministic output, our approach models the full pose distribution with a sample-based estimate and enables reasoning about uncertainty in ambiguous cases such as symmetric objects or severe occlusions. We achieve state-of-the-art results on Real275, YCB-V, and LM-O, and demonstrate how our sample-based pose estimates can be leveraged in downstream robotic manipulation tasks such as active perception for disambiguating uncertain viewpoints or guiding grasp synthesis in an uncertainty-aware manner.

[245] Discriminately Treating Motion Components Evolves Joint Depth and Ego-Motion Learning

Mengtan Zhang,Zizhan Guo,Hongbo Zhao,Yi Feng,Zuyi Xiong,Yue Wang,Shaoyi Du,Hanli Wang,Rui Fan

Main category: cs.CV

TL;DR: 本文提出了一种名为DiMoDE的深度与自运动联合学习框架,通过区分处理运动分量并利用其几何规律,显著提升了无监督深度和自运动估计的性能,尤其在复杂条件下表现优越。

Details Motivation: 现有方法通常将自运动作为辅助任务,未能充分利用几何约束,限制了深度和自运动估计的可靠性与鲁棒性。 Method: 提出对运动分量进行判别式处理,通过校准相机光轴和成像平面,转换光流并量化偏差,分别对各运动分量施加几何约束;进一步将联合学习重构为同轴和共面形式,利用闭式几何关系相互推导深度与平移分量。 Result: DiMoDE在多个公开数据集及新采集的真实世界数据集上达到最先进性能,尤其在挑战性条件下表现突出。 Conclusion: 通过引入针对不同运动分量的几何约束和重构学习方式,DiMoDE有效提升了无监督深度与自运动估计的精度与鲁棒性。 Abstract: Unsupervised learning of depth and ego-motion, two fundamental 3D perception tasks, has made significant strides in recent years. However, most methods treat ego-motion as an auxiliary task, either mixing all motion types or excluding depth-independent rotational motions in supervision. Such designs limit the incorporation of strong geometric constraints, reducing reliability and robustness under diverse conditions. This study introduces a discriminative treatment of motion components, leveraging the geometric regularities of their respective rigid flows to benefit both depth and ego-motion estimation. Given consecutive video frames, network outputs first align the optical axes and imaging planes of the source and target cameras. Optical flows between frames are transformed through these alignments, and deviations are quantified to impose geometric constraints individually on each ego-motion component, enabling more targeted refinement. These alignments further reformulate the joint learning process into coaxial and coplanar forms, where depth and each translation component can be mutually derived through closed-form geometric relationships, introducing complementary constraints that improve depth robustness. DiMoDE, a general depth and ego-motion joint learning framework incorporating these designs, achieves state-of-the-art performance on multiple public datasets and a newly collected diverse real-world dataset, particularly under challenging conditions. Our source code will be publicly available at mias.group/DiMoDE upon publication.

[246] Luminance-Aware Statistical Quantization: Unsupervised Hierarchical Learning for Illumination Enhancement

Derong Kong,Zhixiong Yang,Shengxi Li,Shuaifeng Zhi,Li Liu,Zhen Liu,Jingyuan Xia

Main category: cs.CV

TL;DR: 本文提出了一种名为Luminance-Aware Statistical Quantification (LASQ)的低光照图像增强新框架,将亮度转换建模为强度空间中的幂律分布,通过分层抽样和扩散过程实现无需正常光照参考的无监督增强,在有无参考图像的情况下均表现出优异性能和跨场景泛化能力。

Details Motivation: 现有低光照增强方法多依赖成对数据进行确定性像素映射,忽略了真实环境中亮度变化的连续物理过程,导致在缺乏正常光照参考时泛化性能下降。 Method: 提出LASQ框架,将亮度增强重构为在分层亮度分布上的统计采样过程;利用幂函数逼近强度空间中的幂律分布,并设计扩散前向过程来自适应发现亮度层间的最优转换路径,实现无监督分布模拟。 Result: 在无正常光照参考的情况下显著提升增强效果,具备良好的实际应用能力;在有参考图像的特定领域数据集上表现优越,且在非参考数据集上展现出更强的跨域泛化能力。 Conclusion: LASQ通过统计建模亮度转移过程,克服了传统确定性映射的局限,为低光照图像增强提供了一种更具适应性和通用性的新范式。 Abstract: Low-light image enhancement (LLIE) faces persistent challenges in balancing reconstruction fidelity with cross-scenario generalization. While existing methods predominantly focus on deterministic pixel-level mappings between paired low/normal-light images, they often neglect the continuous physical process of luminance transitions in real-world environments, leading to performance drop when normal-light references are unavailable. Inspired by empirical analysis of natural luminance dynamics revealing power-law distributed intensity transitions, this paper introduces Luminance-Aware Statistical Quantification (LASQ), a novel framework that reformulates LLIE as a statistical sampling process over hierarchical luminance distributions. Our LASQ re-conceptualizes luminance transition as a power-law distribution in intensity coordinate space that can be approximated by stratified power functions, therefore, replacing deterministic mappings with probabilistic sampling over continuous luminance layers. A diffusion forward process is designed to autonomously discover optimal transition paths between luminance layers, achieving unsupervised distribution emulation without normal-light references. In this way, it considerably improves the performance in practical situations, enabling more adaptable and versatile light restoration. This framework is also readily applicable to cases with normal-light references, where it achieves superior performance on domain-specific datasets alongside better generalization-ability across non-reference datasets.

[247] Example-Based Feature Painting on Textures

Andrei-Timotei Ardelean,Tim Weyrich

Main category: cs.CV

TL;DR: 提出了一种基于学习的系统,通过无监督异常检测自动识别并编辑具有局部特征(如污渍、撕裂、孔洞等)的纹理,实现交互式纹理生成与编辑。

Details Motivation: 为了生成更真实的纹理,需要在合成过程中包含自然界中普遍存在的表面瑕疵和局部变化,但现有方法通常依赖人工标注,缺乏自动化和可控性。 Method: 采用基于扩散模型的学习方法,利用未标记样本进行训练;通过无监督异常检测识别外观改变的特征,并将其聚类为语义一致的组,用于条件图像生成;构建从少量图像到生成任意大小纹理的完整流程。 Result: 实现了从少量图像集合出发的全自动纹理瑕疵建模,支持用户交互式创建和绘制纹理特征,并提出了适用于其他场景的扩散模型编辑与无限平稳纹理生成算法。 Conclusion: 该方法无需人工标注,能有效生成包含多种局部瑕疵的真实感纹理,具备良好的通用性和应用潜力。 Abstract: In this work, we propose a system that covers the complete workflow for achieving controlled authoring and editing of textures that present distinctive local characteristics. These include various effects that change the surface appearance of materials, such as stains, tears, holes, abrasions, discoloration, and more. Such alterations are ubiquitous in nature, and including them in the synthesis process is crucial for generating realistic textures. We introduce a novel approach for creating textures with such blemishes, adopting a learning-based approach that leverages unlabeled examples. Our approach does not require manual annotations by the user; instead, it detects the appearance-altering features through unsupervised anomaly detection. The various textural features are then automatically clustered into semantically coherent groups, which are used to guide the conditional generation of images. Our pipeline as a whole goes from a small image collection to a versatile generative model that enables the user to interactively create and paint features on textures of arbitrary size. Notably, the algorithms we introduce for diffusion-based editing and infinite stationary texture generation are generic and should prove useful in other contexts as well. Project page: https://reality.tf.fau.de/pub/ardelean2025examplebased.html

[248] NSYNC: Negative Synthetic Image Generation for Contrastive Training to Improve Stylized Text-To-Image Translation

Serkan Ozturk,Samet Hicsonmez,Pinar Duygulu

Main category: cs.CV

TL;DR: 本文提出了一种基于对比学习的新型框架NSYNC,用于提升大尺度文本到图像扩散模型的风格化生成能力,通过引入负向合成数据优化梯度更新,从而更好地捕捉独特艺术风格。

Details Motivation: 现有文本到图像生成方法难以精确捕捉特定艺术风格,即使微调也效果有限;同时,尽管合成数据在训练中被广泛使用,但其潜力尚未在风格化任务中充分挖掘。 Method: 提出NSYNC框架,利用扩散模型生成负向合成图像,并与真实正样本构成对比训练对;在训练中分别计算正负样本的梯度,通过去除正梯度在负梯度上的投影分量来更新模型参数,保留对风格敏感的正交特征。 Result: 在多个画家和插画师风格的数据集上实验表明,该方法在定量指标(如FID、CLIP分数)和视觉质量上均优于基线模型,能更准确地还原目标艺术风格。 Conclusion: NSYNC通过创新的对比梯度优化机制,有效提升了文本到图像模型的风格化生成能力,证明了利用负向合成数据进行对比学习是一种可行且高效的方法。 Abstract: Current text conditioned image generation methods output realistic looking images, but they fail to capture specific styles. Simply finetuning them on the target style datasets still struggles to grasp the style features. In this work, we present a novel contrastive learning framework to improve the stylization capability of large text-to-image diffusion models. Motivated by the astonishing advance in image generation models that makes synthetic data an intrinsic part of model training in various computer vision tasks, we exploit synthetic image generation in our approach. Usually, the generated synthetic data is dependent on the task, and most of the time it is used to enlarge the available real training dataset. With NSYNC, alternatively, we focus on generating negative synthetic sets to be used in a novel contrastive training scheme along with real positive images. In our proposed training setup, we forward negative data along with positive data and obtain negative and positive gradients, respectively. We then refine the positive gradient by subtracting its projection onto the negative gradient to get the orthogonal component, based on which the parameters are updated. This orthogonal component eliminates the trivial attributes that are present in both positive and negative data and directs the model towards capturing a more unique style. Experiments on various styles of painters and illustrators show that our approach improves the performance over the baseline methods both quantitatively and qualitatively. Our code is available at https://github.com/giddyyupp/NSYNC.

[249] Driving scenario generation and evaluation using a structured layer representation and foundational models

Arthur Hubert,Gamal Elghazaly,Raphaël Frank

Main category: cs.CV

TL;DR: 提出一种结构化的五层模型来改进罕见驾驶场景的评估与生成,利用基础大模型和数据增强策略生成新场景,并引入多样性与原创性评分评估合成数据集的相关性。

Details Motivation: 罕见且具有挑战性的驾驶场景对自动驾驶车辆开发至关重要,但由于现实中难以遇到,需要通过生成模型来模拟这些场景。现有的表示方法不足以有效支持这类场景的生成与评估。 Method: 设计了一个包含五个层次的结构化模型,为每个场景中的代理引入子类和特征,并结合大型基础模型使用数据增强策略生成新的驾驶场景;采用特定嵌入方法进行场景比较,并提出多样性分数和原创性分数两种指标用于评估合成数据集。 Result: 在不同生成设置下验证了所提指标的有效性,展示了基于结构化描述生成的合成视频的定性评估结果,实验表明该方法能够有效提升罕见驾驶场景的生成质量与评估能力。 Conclusion: 所提出的五层结构化模型有助于更好地表示、生成和评估罕见驾驶场景,结合大模型与定制化评估指标的方法为自动驾驶领域的数据增强提供了新思路。 Abstract: Rare and challenging driving scenarios are critical for autonomous vehicle development. Since they are difficult to encounter, simulating or generating them using generative models is a popular approach. Following previous efforts to structure driving scenario representations in a layer model, we propose a structured five-layer model to improve the evaluation and generation of rare scenarios. We use this model alongside large foundational models to generate new driving scenarios using a data augmentation strategy. Unlike previous representations, our structure introduces subclasses and characteristics for every agent of the scenario, allowing us to compare them using an embedding specific to our layer-model. We study and adapt two metrics to evaluate the relevance of a synthetic dataset in the context of a structured representation: the diversity score estimates how different the scenarios of a dataset are from one another, while the originality score calculates how similar a synthetic dataset is from a real reference set. This paper showcases both metrics in different generation setup, as well as a qualitative evaluation of synthetic videos generated from structured scenario descriptions. The code and extended results can be found at https://github.com/Valgiz/5LMSG.

[250] PCD-ReID: Occluded Person Re-Identification for Base Station Inspection

Ge Gao,Zishuo Gao,Hongyan Cui,Zhiyang Jia,Zhuang Luo,ChaoPeng Liu

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的PCD-ReID算法,用于解决基站环境中遮挡行人重识别的难题,通过提取共享部件特征并在自建真实巡逻监控数据集上训练,显著提升了遮挡情况下的识别性能。

Details Motivation: 传统ResNet-based行人重识别方法在处理遮挡问题时效果不佳,难以有效识别被遮挡的关键身体特征,因此需要更鲁棒的方法来提升复杂场景下的识别准确率。 Method: 设计了一种基于Transformer的PCD网络(Pedestrian Component Discrepancy),能够提取如头盔、制服等行人部件的共享特征;同时收集了一个包含10,000人、超过50,000张图像的真实巡逻监控数据集用于训练,以减少在公共数据集上的过拟合。 Result: 在实验中,该方法达到79.0%的mAP和82.7%的Rank-1准确率,相比ResNet50-based方法Rank-1提升了15.9%,在塔式巡检场景中表现出优异的遮挡感知重识别能力。 Conclusion: PCD-ReID能有效应对遮挡条件下的行人重识别挑战,具有在实际安防和监控系统中部署的应用潜力。 Abstract: Occluded pedestrian re-identification (ReID) in base station environments is a critical task in computer vision, particularly for surveillance and security applications. This task faces numerous challenges, as occlusions often obscure key body features, increasing the complexity of identification. Traditional ResNet-based ReID algorithms often fail to address occlusions effectively, necessitating new ReID methods. We propose the PCD-ReID (Pedestrian Component Discrepancy) algorithm to address these issues. The contributions of this work are as follows: To tackle the occlusion problem, we design a Transformer-based PCD network capable of extracting shared component features, such as helmets and uniforms. To mitigate overfitting on public datasets, we collected new real-world patrol surveillance images for model training, covering six months, 10,000 individuals, and over 50,000 images. Comparative experiments with existing ReID algorithms demonstrate that our model achieves a mean Average Precision (mAP) of 79.0% and a Rank-1 accuracy of 82.7%, marking a 15.9% Rank-1 improvement over ResNet50-based methods. Experimental evaluations indicate that PCD-ReID effectively achieves occlusion-aware ReID performance for personnel in tower inspection scenarios, highlighting its potential for practical deployment in surveillance and security applications.

[251] NOA: a versatile, extensible tool for AI-based organoid analysis

Mikhail Konov,Lion J. Gleiter,Khoa Co,Monica Yabal,Tingying Peng

Main category: cs.CV

TL;DR: 本文介绍了Napari Organoid Analyzer (NOA),一个用于简化基于AI的类器官图像分析的通用图形用户界面,旨在降低无编程背景生物学家的使用门槛。

Details Motivation: 现有的AI工具在类器官图像分析中存在可访问性差、功能单一的问题,导致生物学家依赖繁琐的手动分析流程。 Method: 开发了一个名为NOA的开源napari插件,集成了检测、分割、追踪、特征提取、自定义标注和机器学习预测等多个模块,并整合了多种先进的算法。 Result: 通过三个案例研究展示了NOA的多功能性,包括类器官分化过程中的形态变化量化、光毒性效应评估以及类器官活力和分化状态的预测。 Conclusion: NOA提供了一个易于访问、灵活且可扩展的框架,实现了全面的AI驱动类器官图像分析,有助于推动类器官研究的自动化和普及化。 Abstract: AI tools can greatly enhance the analysis of organoid microscopy images, from detection and segmentation to feature extraction and classification. However, their limited accessibility to biologists without programming experience remains a major barrier, resulting in labor-intensive and largely manual workflows. Although a few AI models for organoid analysis have been developed, most existing tools remain narrowly focused on specific tasks. In this work, we introduce the Napari Organoid Analyzer (NOA), a general purpose graphical user interface to simplify AI-based organoid analysis. NOA integrates modules for detection, segmentation, tracking, feature extraction, custom feature annotation and ML-based feature prediction. It interfaces multiple state-of-the-art algorithms and is implemented as an open-source napari plugin for maximal flexibility and extensibility. We demonstrate the versatility of NOA through three case studies, involving the quantification of morphological changes during organoid differentiation, assessment of phototoxicity effects, and prediction of organoid viability and differentiation state. Together, these examples illustrate how NOA enables comprehensive, AI-driven organoid image analysis within an accessible and extensible framework.

[252] PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model

Wenqi Liang,Gan Sun,Yao He,Jiahua Dong,Suyan Dai,Ivan Laptev,Salman Khan,Yang Cong

Main category: cs.CV

TL;DR: 本文提出了PixelVLA,首个支持像素级推理和多模态(文本与视觉)提示的视觉-语言-动作模型,通过引入多尺度像素感知编码器和视觉提示编码器,并基于自动化标注生成的Pixel-160K数据集进行训练,在显著降低预训练成本的同时,在多个基准上显著提升了操作成功率。

Details Motivation: 现有视觉-语言-动作模型在像素级场景理解方面表现不足,且严重依赖文本提示,限制了其在真实场景中的灵活性。因此,需要一种能同时支持像素级细粒度理解和多模态输入的新型模型。 Method: 提出PixelVLA模型,结合多尺度像素感知编码器与视觉提示编码器,构建面向视觉运动控制的指令微调框架;并设计两阶段自动标注流程,生成包含像素级标注的大规模数据集Pixel-160K用于训练。 Result: 在三个标准VLA基准和两种模型变体上实验表明,PixelVLA相比OpenVLA将操作成功率提高了10.1%–17.8%,且仅需其1.5%的预训练成本。 Conclusion: PixelVLA能够有效提升机器人在复杂环境中的控制精度、效率和灵活性,可集成到现有VLA模型中,推动更通用的视觉运动策略学习。 Abstract: Vision-Language-Action models (VLAs) are emerging as powerful tools for learning generalizable visuomotor control policies. However, current VLAs are mostly trained on large-scale image-text-action data and remain limited in two key ways: (i) they struggle with pixel-level scene understanding, and (ii) they rely heavily on textual prompts, which reduces their flexibility in real-world settings. To address these challenges, we introduce PixelVLA, the first VLA model designed to support both pixel-level reasoning and multimodal prompting with text and visual inputs. Our approach is built on a new visuomotor instruction tuning framework that integrates a multiscale pixel-aware encoder with a visual prompting encoder. To train PixelVLA effectively, we further propose a two-stage automated annotation pipeline that generates Pixel-160K, a large-scale dataset with pixel-level annotations derived from existing robot data. Experiments on three standard VLA benchmarks and two VLA model variants show that PixelVLA improves manipulation success rates by 10.1%-17.8% over OpenVLA, while requiring only 1.5% of its pretraining cost. These results demonstrate that PixelVLA can be integrated into existing VLAs to enable more accurate, efficient, and versatile robot control in complex environments. The dataset and code will be released as open source.

[253] Generative Adversarial Synthesis and Deep Feature Discrimination of Brain Tumor MRI Images

Md Sumon Ali,Muzammil Behzad

Main category: cs.CV

TL;DR: 提出一种基于深度学习的DC-GAN方法生成合成MRI数据,并利用CNN分类器验证合成图像的质量和实用性,结果表明其在脑肿瘤分类任务中与真实数据具有相当的性能。

Details Motivation: 由于真实MRI数据有限,生成逼真的医学图像具有挑战性,需要有效的方法来扩充数据集。 Method: 采用深度卷积生成对抗网络(DC-GAN)生成合成MRI数据,并使用卷积神经网络(CNN)对真实和合成的MRI图像进行脑肿瘤分类,以评估合成数据的质量和可用性。 Result: CNN在合成数据和真实数据上的分类性能相当,证明了GAN生成的MRI图像具有较高的质量和可用于下游任务的有效性。 Conclusion: DC-GAN能够有效生成高质量的合成MRI数据,可缓解医学图像数据稀缺问题,并支持后续的诊断任务。 Abstract: Compared to traditional methods, Deep Learning (DL) becomes a key technology for computer vision tasks. Synthetic data generation is an interesting use case for DL, especially in the field of medical imaging such as Magnetic Resonance Imaging (MRI). The need for this task since the original MRI data is limited. The generation of realistic medical images is completely difficult and challenging. Generative Adversarial Networks (GANs) are useful for creating synthetic medical images. In this paper, we propose a DL based methodology for creating synthetic MRI data using the Deep Convolutional Generative Adversarial Network (DC-GAN) to address the problem of limited data. We also employ a Convolutional Neural Network (CNN) classifier to classify the brain tumor using synthetic data and real MRI data. CNN is used to evaluate the quality and utility of the synthetic images. The classification result demonstrates comparable performance on real and synthetic images, which validates the effectiveness of GAN-generated images for downstream tasks.

[254] Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

Yizhu Chen,Chen Ju,Zhicheng Wang,Shuai Xiao,Xu Chen,Jinsong Lan,Xiaoyong Zhu,Ying Chen

Main category: cs.CV

TL;DR: 提出连续-离散双视觉分词器(CDD-VT),通过自适应分配图像基元数量,在简单样本上模拟离散分词、在复杂样本上逼近连续分词,实现统一的多模态理解与生成。

Details Motivation: 解决现有连续分词器(CT)工程复杂和离散分词器(DT)信息损失之间的矛盾,统一多模态大模型中的理解与生成。 Method: 设计Diverse Quantitative Primitives增强基元正交性,并通过Dynamic Primitive Allocator根据样本复杂度动态分配基元数量,实现连续与离散分词的统一。 Result: 在重建、检索和分类任务上均优于专用的CT和DT方法,实现了更优性能。 Conclusion: CDD-VT有效平衡了连续与离散分词的优势,为构建简洁且可扩展的多模态大模型提供了新思路。 Abstract: The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived from quantized codebooks, with the crucial insight that the primitive number assigned to each visual sample is adaptively determined according to its complexity: simple instances use a few primitives, emulating discrete tokenization, while complex instances use many, approximating continuous tokenization. Two core components are designed: Diverse Quantitative Primitives, which encourage primitives orthogonality to better populate information space, and Dynamic Primitive Allocator, which assesses sample complexity to determine the optimal set of primitives. Extensive experiments on reconstruction, retrieval and classification show that CDD-VT achieves superior performance over to specialized CT and DT, effectively getting strong result within a concise and scalable MLLM.

[255] Lite ENSAM: a lightweight cancer segmentation model for 3D Computed Tomography

Agnar Martin Bjørnstad,Elias Stenhede,Arian Ranjbar

Main category: cs.CV

TL;DR: 本文提出了一种轻量级的Lite ENSAM模型,用于从CT扫描中高效地进行体积肿瘤分割,仅使用RECIST标注训练,其在MICCAI FLARE 2025挑战赛中表现出良好的性能,同时保持较低的计算资源消耗。

Details Motivation: 由于手动体积标注耗时耗力,尽管体积测量更能准确评估治疗效果,但其临床应用受限,因此需要一种能基于常规RECIST单径标注实现自动、高效体积分割的方法。 Method: 提出Lite ENSAM,是ENSAM架构的轻量化版本,利用RECIST标注的CT图像进行训练,实现快速且低资源消耗的肿瘤体积分割。 Result: 在MICCAI FLARE 2025的隐藏测试集上达到60.7%的Dice相似系数和63.6%的归一化表面Dice分数,在公共验证集上平均推理时间为14.4秒,平均内存占用为50.6 GBs。 Conclusion: Lite ENSAM能够在仅使用RECIST标注的情况下实现高效的肿瘤体积分割,具有较低的计算需求,展现出在临床实践中推动体积评估应用的潜力。 Abstract: Accurate tumor size measurement is a cornerstone of evaluating cancer treatment response. The most widely adopted standard for this purpose is the Response Evaluation Criteria in Solid Tumors (RECIST) v1.1, which relies on measuring the longest tumor diameter in a single plane. However, volumetric measurements have been shown to provide a more reliable assessment of treatment effect. Their clinical adoption has been limited, though, due to the labor-intensive nature of manual volumetric annotation. In this paper, we present Lite ENSAM, a lightweight adaptation of the ENSAM architecture designed for efficient volumetric tumor segmentation from CT scans annotated with RECIST annotations. Lite ENSAM was submitted to the MICCAI FLARE 2025 Task 1: Pan-cancer Segmentation in CT Scans, Subtask 2, where it achieved a Dice Similarity Coefficient (DSC) of 60.7% and a Normalized Surface Dice (NSD) of 63.6% on the hidden test set, and an average total RAM time of 50.6 GBs and an average inference time of 14.4 s on CPU on the public validation dataset.

[256] DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning

Mahmut Selman Gokmen,Cody Bumgardner

Main category: cs.CV

TL;DR: DINO-MX是一个模块化、可扩展的视觉基础模型训练框架,支持多种Transformer架构和自监督学习策略,兼顾高效性与通用性。

Details Motivation: 现有视觉基础模型训练流程往往缺乏灵活性、领域特定或计算成本高,限制了跨领域和资源受限场景的应用。 Method: 整合DINO系列核心思想,构建统一配置驱动的框架,支持LoRA、层冻结、知识蒸馏等策略,并兼容DDP和FSDP分布式训练。 Result: 在多个数据集上实现具有竞争力的性能,显著降低计算成本,同时提升注意力定位能力并提供可解释性工具。 Conclusion: DINO-MX为自监督视觉模型的开发、适配和基准测试提供了可复现、可扩展的基础,适用于广泛的研究与实际应用。 Abstract: Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types, including single- and multi-channel images. Experimental results on diverse datasets show that DINO-MX achieves competitive performance while significantly reducing computational costs. Additionally, it offers interpretability tools and a label-guided data augmentation method that improves attention-based localization without the need for extra detection or segmentation heads. DINO-MX provides a reproducible and scalable foundation for developing, adapting, and benchmarking self-supervised vision models across a range of research and real-world applications.

[257] Benchmark-Ready 3D Anatomical Shape Classification

Tomáš Krsička,Tibor Kubík

Main category: cs.CV

TL;DR: 本文提出了Precomputed Structural Pooling (PSPooling) 方法,用于3D解剖形状分析中的图自编码器,提升了重建精度和分类性能,并构建了MedShapeNet19基准数据集以推动医学3D形状分类研究。

Details Motivation: 由于3D网格数据的复杂性和缺乏标准化基准,现有的解剖形状分类方法受限,需要更鲁棒的学习方法和可复现的评估体系。 Method: 提出PSPooling——一种基于几何邻近性预计算节点对应集合的非学习型网格池化操作,实现高效、结构保持的图粗化;将其集成到自监督图自编码器中,并在新构建的MedShapeNet19数据集上进行评估。 Result: 实验表明,PSPooling显著提高了重建保真度和低标签情况下的分类准确率,且适用于高分辨率医学网格;MedShapeNet19为解剖形状分类提供了标准化基准。 Conclusion: PSPooling是一种高效、可逆且结构保持的池化方法,结合MedShapeNet19数据集为医学3D形状学习建立了强有力的基线,有望推动该领域的标准化与进一步发展。 Abstract: Progress in anatomical 3D shape classification is limited by the complexity of mesh data and the lack of standardized benchmarks, highlighting the need for robust learning methods and reproducible evaluation. We introduce two key steps toward clinically and benchmark-ready anatomical shape classification via self-supervised graph autoencoding. We propose Precomputed Structural Pooling (PSPooling), a non-learnable mesh pooling operator designed for efficient and structure-preserving graph coarsening in 3D anatomical shape analysis. PSPooling precomputes node correspondence sets based on geometric proximity, enabling parallelizable and reversible pooling and unpooling operations with guaranteed support structure. This design avoids the sparsity and reconstruction issues of selection-based methods and the sequential overhead of edge contraction approaches, making it particularly suitable for high-resolution medical meshes. To demonstrate its effectiveness, we integrate PSPooling into a self-supervised graph autoencoder that learns anatomy-aware representations from unlabeled surface meshes. We evaluate the downstream benefits on MedShapeNet19, a new curated benchmark dataset we derive from MedShapeNet, consisting of 19 anatomical classes with standardized training, validation, and test splits. Experiments show that PSPooling significantly improves reconstruction fidelity and classification accuracy in low-label regimes, establishing a strong baseline for medical 3D shape learning. We hope that MedShapeNet19 will serve as a widely adopted benchmark for anatomical shape classification and further research in medical 3D shape analysis. Access the complete codebase, model weights, and dataset information here: https://github.com/TomasKrsicka/MedShapeNet19-PSPooling.

[258] Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Mohamed Eltahir,Ali Habibullah,Lama Ayash,Tanveer Hussain,Naeemullah Khan

Main category: cs.CV

TL;DR: 本文提出了Vote-in-Context (ViC),一种无需训练的通用框架,将视觉语言模型(VLM)用于视频检索中的列表重排序与融合,通过在提示中序列化内容证据和检索器元数据,实现了零样本下的优越性能。

Details Motivation: 传统融合方法仅依赖排名或分数信号,忽略了候选项目的表示信息,尤其在复杂多模态数据(如视频)中表现受限,因此需要一种能结合内容与检索器共识的更优融合方法。 Method: 提出Vote-in-Context (ViC) 框架,利用S-Grid将视频表示为图像网格,并将其与字幕等信息一同序列化输入VLM提示中,使VLM能够进行自适应的列表级重排序和融合决策。 Result: 在ActivityNet、VATEX和MSR-VTT等视频检索基准上,ViC在零样本设置下显著优于CombSUM等强基线,Recall@1最高提升达+40,在VATEX上达到99.6% (v2t) 的SOTA性能。 Conclusion: ViC是一种简单、可复现且高效的零样本重排序与融合方法,能够有效利用现代VLM处理复杂视频-文本检索任务。 Abstract: In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC

[259] Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward

Xiaogang Xu,Ruihang Chu,Jian Wang,Kun Zhou,Wenjie Shu,Harry Yang,Ser-Nam Lim,Hao Chen,Liang Lin

Main category: cs.CV

TL;DR: 本文提出了一种将强化学习(RL)有效集成到基于扩散的图像恢复模型中的新方法,通过使用图像质量评估(IQA)模型作为奖励函数,并针对难样本动态结合RL与监督微调(SFT),实现了性能提升。

Details Motivation: 现有的RL方法直接应用于扩散型图像恢复模型效果不佳,因为图像恢复任务更注重保真度,与生成任务目标不同,需要专门设计的RL整合策略。 Method: 采用基于多模态大语言模型(MLLM)的IQA模型作为奖励函数,针对远离真实数据分布的难样本优先使用RL,并在训练过程中自适应地结合SFT,通过自动加权机制调整两者的比重。 Result: 在多个图像恢复基准上的实验表明,所提方法显著提升了扩散模型的恢复性能,且具有良好的泛化性和即插即用特性。 Conclusion: 该研究验证了基于IQA的RL策略在扩散型图像恢复中的有效性,提出了一种动态、自适应的RL+SFT框架,为后续工作提供了可行路径。 Abstract: Reinforcement Learning (RL) has recently been incorporated into diffusion models, e.g., tasks such as text-to-image. However, directly applying existing RL methods to diffusion-based image restoration models is suboptimal, as the objective of restoration fundamentally differs from that of pure generation: it places greater emphasis on fidelity. In this paper, we investigate how to effectively integrate RL into diffusion-based restoration models. First, through extensive experiments with various reward functions, we find that an effective reward can be derived from an Image Quality Assessment (IQA) model, instead of intuitive ground-truth-based supervision, which has already been optimized during the Supervised Fine-Tuning (SFT) stage prior to RL. Moreover, our strategy focuses on using RL for challenging samples that are significantly distant from the ground truth, and our RL approach is innovatively implemented using MLLM-based IQA models to align distributions with high-quality images initially. As the samples approach the ground truth's distribution, RL is adaptively combined with SFT for more fine-grained alignment. This dynamic process is facilitated through an automatic weighting strategy that adjusts based on the relative difficulty of the training samples. Our strategy is plug-and-play that can be seamlessly applied to diffusion-based restoration models, boosting its performance across various restoration tasks. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our proposed RL framework.

[260] UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback

Ropeway Liu,Hangjie Yuan,Bo Dong,Jiazheng Xing,Jinwang Wang,Rui Zhao,Yan Xing,Weihua Chen,Fan Wang

Main category: cs.CV

TL;DR: 本文提出UniLumos,一种统一的图像与视频重光照框架,通过引入RGB空间的几何反馈(如深度和法向图)增强光照效果的物理合理性,并结合路径一致性学习实现高效少步训练,同时设计六维光照标注协议和LumosBench基准以实现细粒度可控性评估,实验表明其在质量和速度上均达到SOTA。

Details Motivation: 现有基于扩散模型的重光照方法通常在语义潜在空间优化,缺乏对视觉空间中物理一致性的保证,导致高光过曝、阴影错位等问题,因此需要一种能结合几何结构反馈的方法来提升结果的真实性与可控性。 Method: 提出UniLumos框架,将从输出中提取的深度和法向图作为几何反馈监督流程匹配主干网络;采用路径一致性学习以支持少步训练下的有效监督;设计六维光照标注协议并构建LumosBench基准,利用大视觉语言模型进行解耦的属性级自动评估。 Result: UniLumos在图像和视频重光照任务中实现了最先进的质量,显著提升了物理一致性,同时相比传统多步去噪方法提速20倍;LumosBench实现了可解释的细粒度光照控制评估。 Conclusion: 通过引入视觉空间的几何反馈与路径一致性学习,UniLumos在保持高效生成的同时显著提升了重光照结果的物理合理性和控制精度,为图像与视频编辑提供了实用且可靠的解决方案。 Abstract: Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.

[261] Progressive Translation of H&E to IHC with Enhanced Structural Fidelity

Yuhang Kang,Ziyu Su,Tianyang Wang,Zaibo Li,Wei Chen,Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: 提出一种渐进式结构-颜色-细胞边界生成网络,用于从H&E图像合成高质量IHC等效图像,显著提升视觉质量和结构细节。

Details Motivation: 现有染色转换方法因损失项间的相互依赖性导致图像质量不佳,难以同时保持结构真实性和色彩保真度。 Method: 设计渐进式网络架构,分阶段解耦优化结构、颜色和细胞边界生成;引入基于DAB显色浓度和图像梯度的新型损失函数,并基于ASP框架构建模型。 Result: 在HER2和ER数据集上实验表明,该方法显著改善了生成IHC图像的视觉质量,增强了色彩保真度和细胞边界清晰度,保留了更精细的结构细节。 Conclusion: 所提出的渐进式生成机制有效解决了传统染色转换方法中的多目标冲突问题,为低成本、高精度的虚拟IHC染色提供了新思路。 Abstract: Compared to hematoxylin-eosin (H&E) staining, immunohistochemistry (IHC) not only maintains the structural features of tissue samples, but also provides high-resolution protein localization, which is essential for aiding in pathology diagnosis. Despite its diagnostic value, IHC remains a costly and labor-intensive technique. Its limited scalability and constraints in multiplexing further hinder widespread adoption, especially in resource-limited settings. Consequently, researchers are increasingly exploring computational stain translation techniques to synthesize IHC-equivalent images from H&E-stained slides, aiming to extract protein-level information more efficiently and cost-effectively. However, most existing stain translation techniques rely on a linearly weighted summation of multiple loss terms within a single objective function, strategy that often overlooks the interdepedence among these components-resulting in suboptimal image quality and an inability to simultaneously preserve structural authenticity and color fidelity. To address this limitation, we propose a novel network architecture that follows a progressive structure, incorporating color and cell border generation logic, which enables each visual aspect to be optimized in a stage-wise and decoupled manner. To validate the effectiveness of our proposed network architecture, we build upon the Adaptive Supervised PatchNCE (ASP) framework as our baseline. We introduce additional loss functions based on 3,3'-diaminobenzidine (DAB) chromogen concentration and image gradient, enhancing color fidelity and cell boundary clarity in the generated IHC images. By reconstructing the generation pipeline using our structure-color-cell boundary progressive mechanism, experiments on HER2 and ER datasets demonstrated that the model significantly improved visual quality and achieved finer structural details.

[262] Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond

Xin Qiao,Matteo Poggi,Xing Wei,Pengchao Deng,Yanhui Zhou,Stefano Mattoccia

Main category: cs.CV

TL;DR: 提出了一种名为LFRD2的混合框架,用于改善屏幕下ToF成像的深度感知质量,结合神经网络与物理建模,在多个数据集上验证了有效性。

Details Motivation: 屏幕下的ToF成像受TOLED层引起的信号衰减、多路径干扰和时域噪声严重影响,导致深度质量下降,需有效方法进行恢复。 Method: 提出Learnable Fractional Reaction-Diffusion Dynamics(LFRD2),结合神经网络与物理模型,采用时间分数阶反应-扩散模块实现动态微分阶数的迭代深度优化,并引入基于系数预测和重复微分的连续卷积算子提升恢复质量。 Result: 在四个基准数据集上的实验表明,该方法显著提升了深度恢复质量,优于现有方法。 Conclusion: LFRD2通过融合可学习的分数阶动力学与高效连续卷积,在屏幕下ToF成像中实现了高精度深度恢复,具有良好的应用前景。 Abstract: Under-display ToF imaging aims to achieve accurate depth sensing through a ToF camera placed beneath a screen panel. However, transparent OLED (TOLED) layers introduce severe degradations-such as signal attenuation, multi-path interference (MPI), and temporal noise-that significantly compromise depth quality. To alleviate this drawback, we propose Learnable Fractional Reaction-Diffusion Dynamics (LFRD2), a hybrid framework that combines the expressive power of neural networks with the interpretability of physical modeling. Specifically, we implement a time-fractional reaction-diffusion module that enables iterative depth refinement with dynamically generated differential orders, capturing long-term dependencies. In addition, we introduce an efficient continuous convolution operator via coefficient prediction and repeated differentiation to further improve restoration quality. Experiments on four benchmark datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/wudiqx106/LFRD2.

[263] Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Yi Zhang,Zheng Wang,Chen Zhen,Wenjie Ruan,Qing Guo,Siddartha Khastgir,Carsten Maple,Xingyu Zhao

Main category: cs.CV

TL;DR: 本文提出了PRBench,首个专注于评估不同鲁棒性训练方法在概率鲁棒性(PR)方面改进效果的基准。通过综合指标比较了主流对抗训练(AT)和PR针对性方法,发现AT方法在提升对抗鲁棒性和PR方面更具通用性,而PR针对性方法则具有更低的泛化误差和更高的干净准确率。

Details Motivation: 尽管概率鲁棒性(PR)被视为对抗鲁棒性(AR)的实用补充,但专门针对PR的训练方法研究仍不足,且存在评估协议不统一、与强基线比较不足及缺乏统一框架评估泛化能力的问题。 Method: 构建了PRBench基准,系统评估了多种AT和PR针对性训练方法,采用包括干净准确率、PR与AR性能、训练效率和泛化误差在内的全面指标,并进行了理论分析。 Result: 实验结果显示,在不同超参数设置下,AT方法在提升AR和PR方面比PR针对性方法更通用;而后者在所有情况下均表现出更低的泛化误差和更高的干净准确率。 Conclusion: PRBench为评估PR提供了标准化平台,揭示了现有方法的优势与局限,推动未来鲁棒性训练方法的发展。 Abstract: Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 222 trained models across 7 datasets and 10 model architectures is publicly available at https://tmpspace.github.io/PRBenchLeaderboard/.

[264] Toward Strategy Identification and Subtask Decomposition In Task Exploration

Tom Odem

Main category: cs.CV

TL;DR: 提出了一种任务探索器流水线,利用聚类、因子分析和字符串编辑距离自动识别完成任务的关键全局与局部策略及子任务,有助于理解和建模用户知识、技能与行为。

Details Motivation: 为了提升机器对用户知识、技能和行为的理解,实现人机间的隐式协同。 Method: 开发了任务探索器流水线,结合聚类技术、因子分析和字符串编辑距离,自动识别任务中的全局与局部策略以及不同长度的有意义子任务,并构建分层子任务结构。 Result: 成功自动识别出完成任务的关键策略,对用户操作进行分层子任务编码,并开发了可视化应用以方便结果审查。 Conclusion: 该流水线可广泛适用于基于动作的时间序列数据,有助于人机交互中对用户行为的深入理解与建模。 Abstract: This research builds on work in anticipatory human-machine interaction, a subfield of human-machine interaction where machines can facilitate advantageous interactions by anticipating a user's future state. The aim of this research is to further a machine's understanding of user knowledge, skill, and behavior in pursuit of implicit coordination. A task explorer pipeline was developed that uses clustering techniques, paired with factor analysis and string edit distance, to automatically identify key global and local strategies that are used to complete tasks. Global strategies identify generalized sets of actions used to complete tasks, while local strategies identify sequences that used those sets of actions in a similar composition. Additionally, meaningful subtasks of various lengths are identified within the tasks. The task explorer pipeline was able to automatically identify key strategies used to complete tasks and encode user runs with hierarchical subtask structures. In addition, a Task Explorer application was developed to easily review pipeline results. The task explorer pipeline can be easily modified to any action-based time-series data and the identified strategies and subtasks help to inform humans and machines on user knowledge, skill, and behavior.

[265] CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays

Yefeng Wu,Yucheng Song,Ling Wu,Shan Wan,Yecheng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种用于肺炎检测的实时检测Transformer模型CGF-DETR,通过引入XFABlock、SPGA模块和GCFC3结构,在RSNA数据集上实现了82.2%的mAP@0.5,优于基线模型RT-DETR-l,并保持了48.1 FPS的推理速度。

Details Motivation: 尽管基于Transformer的检测器在目标检测中表现出色,但在胸部X光片中的肺炎检测应用仍不足,需要更高效准确的自动化系统。 Method: 提出CGF-DETR模型,包含三个关键模块:XFABlock(结合卷积注意力与CSP架构)、SPGA(动态门控与单头自注意力)和GCFC3(多路径卷积融合与结构重参数化),以提升特征提取与聚合效率。 Result: 在RSNA数据集上,CGF-DETR达到82.2% mAP@0.5,比RT-DETR-l高3.7%,且保持48.1 FPS;完整模型在mAP@[0.5:0.95]上达到50.4%。 Conclusion: CGF-DETR有效提升了肺炎检测的精度与效率,各模块经消融实验验证均对性能有显著贡献,适用于医学图像中的实时病变检测。 Abstract: Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2\% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7\% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4\% mAP@[0.5:0.95]

[266] 3EED: Ground Everything Everywhere in 3D

Rong Li,Yuhao Dong,Tianshuai Hu,Ao Liang,Youquan Liu,Dongyue Lu,Liang Pan,Lingdong Kong,Junwei Liang,Ziwei Liu

Main category: cs.CV

TL;DR: 提出3EED,一个大规模、多平台、多模态的3D视觉定位基准,包含超过128,000个物体和22,000条验证的指代表达,支持跨平台学习与评估。

Details Motivation: 现有3D视觉定位基准局限于室内、单一平台和小规模,难以满足开放世界中具身智能体的需求。 Method: 构建了一个包含RGB和LiDAR数据的多平台(车辆、无人机、四足机器人)3D定位基准;采用视觉-语言模型提示结合人工验证的标注流程;提出平台感知归一化和跨模态对齐方法以支持跨平台学习。 Result: 提供了比现有数据集大10倍的数据规模;建立了跨平台和域内评估的基准协议;实验揭示了显著的性能差距,表明通用3D定位仍具挑战性。 Conclusion: 3EED为语言驱动的3D具身感知提供了更真实、更具挑战性的评测平台,推动跨平台、可泛化的3D视觉定位研究。 Abstract: Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.

[267] HGFreNet: Hop-hybrid GraphFomer for 3D Human Pose Estimation with Trajectory Consistency in Frequency Domain

Kai Zhai,Ziyan Huang,Qiang Nie,Xiang Li,Bo Ouyang

Main category: cs.CV

TL;DR: 提出HGFreNet,一种结合hop-hybrid图注意力和频域3D轨迹一致性的GraphFormer架构,用于提升单目视频中2D到3D人体姿态估计的精度与时间一致性。

Details Motivation: 现有方法在处理2D姿态估计误差和深度模糊时,难以保持3D轨迹的时间连贯性,且忽视了骨骼关节运动的全局时空相关性。 Method: 设计HGFreNet,包含hop-hybrid图注意力(HGA)模块和Transformer编码器,HGA聚合k-hop邻域信息以扩大感受野,并在频域施加轨迹一致性约束,同时使用预网络提供跨帧3D信息用于深度推断。 Result: 在Human3.6M和MPI-INF-3DHP数据集上实验表明,HGFreNet在位置精度和时间一致性方面优于当前SOTA方法。 Conclusion: HGFreNet通过建模全局时空相关性和频域轨迹优化,有效提升了2D到3D姿态提升的性能。 Abstract: 2D-to-3D human pose lifting is a fundamental challenge for 3D human pose estimation in monocular video, where graph convolutional networks (GCNs) and attention mechanisms have proven to be inherently suitable for encoding the spatial-temporal correlations of skeletal joints. However, depth ambiguity and errors in 2D pose estimation lead to incoherence in the 3D trajectory. Previous studies have attempted to restrict jitters in the time domain, for instance, by constraining the differences between adjacent frames while neglecting the global spatial-temporal correlations of skeletal joint motion. To tackle this problem, we design HGFreNet, a novel GraphFormer architecture with hop-hybrid feature aggregation and 3D trajectory consistency in the frequency domain. Specifically, we propose a hop-hybrid graph attention (HGA) module and a Transformer encoder to model global joint spatial-temporal correlations. The HGA module groups all $k$-hop neighbors of a skeletal joint into a hybrid group to enlarge the receptive field and applies the attention mechanism to discover the latent correlations of these groups globally. We then exploit global temporal correlations by constraining trajectory consistency in the frequency domain. To provide 3D information for depth inference across frames and maintain coherence over time, a preliminary network is applied to estimate the 3D pose. Extensive experiments were conducted on two standard benchmark datasets: Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed HGFreNet outperforms state-of-the-art (SOTA) methods in terms of positional accuracy and temporal consistency.

[268] Wonder3D++: Cross-domain Diffusion for High-fidelity 3D Generation from a Single Image

Yuxiao Yang,Xiao-Xiao Long,Zhiyang Dou,Cheng Lin,Yuan Liu,Qingsong Yan,Yuexin Ma,Haoqian Wang,Zhiqiang Wu,Wei Yin

Main category: cs.CV

TL;DR: 本文提出Wonder3D++,一种从单视图图像高效生成高保真纹理网格的新方法,结合跨域扩散模型与多视图注意力机制,在质量、一致性和效率方面优于先前方法。

Details Motivation: 现有基于SDS的方法存在优化耗时和几何不一致问题,而快速网络推断方法则质量低、细节不足,因此需要在质量、一致性和效率之间取得更好平衡。 Method: 提出跨域扩散模型生成多视角法线图和对应彩色图像,采用多视图跨域注意力机制确保生成一致性,并设计级联式3D网格提取算法,以粗到精方式在约3分钟内生成高质量表面。 Result: 实验表明,该方法在重建质量、泛化能力和效率方面均优于先前方法,可在约3分钟内完成高质量3D网格生成。 Conclusion: Wonder3D++在单视图3D重建任务中实现了质量、一致性和效率的统一,为高效高保真3D生成提供了有效解决方案。 Abstract: In this work, we introduce \textbf{Wonder3D++}, a novel method for efficiently generating high-fidelity textured meshes from single-view images. Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of single-view reconstruction tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure the consistency of generation, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a cascaded 3D mesh extraction algorithm that drives high-quality surfaces from the multi-view 2D representations in only about $3$ minute in a coarse-to-fine manner. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and good efficiency compared to prior works. Code available at https://github.com/xxlong0/Wonder3D/tree/Wonder3D_Plus.

[269] UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs

Zhe Liu,Jinghua Hou,Xiaoqing Ye,Jingdong Wang,Hengshuang Zhao,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出了一种统一的自动驾驶模型UniLION,基于线性群RNN算子,高效处理大规模LiDAR点云、多视角图像和时序数据,无需显式融合模块即可支持多种模态和任务配置,并在3D感知、预测和规划等任务中达到领先性能。

Details Motivation: 为了解决Transformer在长序列数据上计算开销大的问题,并实现多模态、多任务自动驾驶系统的统一建模,避免复杂的手工设计融合模块。 Method: 采用线性群RNN算子对分组特征进行处理,构建统一的模型架构UniLION,支持LiDAR-only、多模态以及时序融合等多种配置,无需显式的跨模态或时序融合模块。 Result: UniLION在3D目标检测、跟踪、占据预测、BEV地图分割、运动预测和端到端规划等多个核心任务上表现出色,达到甚至超越现有最先进方法的性能。 Conclusion: UniLION提供了一个简洁而强大的统一范式,简化了多模态多任务自动驾驶系统的设计,为3D自动驾驶基础模型的发展提供了新思路。 Abstract: Although transformers have demonstrated remarkable capabilities across various domains, their quadratic attention mechanisms introduce significant computational overhead when processing long-sequence data. In this paper, we present a unified autonomous driving model, UniLION, which efficiently handles large-scale LiDAR point clouds, high-resolution multi-view images, and even temporal sequences based on the linear group RNN operator (i.e., performs linear RNN for grouped features). Remarkably, UniLION serves as a single versatile architecture that can seamlessly support multiple specialized variants (i.e., LiDAR-only, temporal LiDAR, multi-modal, and multi-modal temporal fusion configurations) without requiring explicit temporal or multi-modal fusion modules. Moreover, UniLION consistently delivers competitive and even state-of-the-art performance across a wide range of core tasks, including 3D perception (e.g., 3D object detection, 3D object tracking, 3D occupancy prediction, BEV map segmentation), prediction (e.g., motion prediction), and planning (e.g., end-to-end planning). This unified paradigm naturally simplifies the design of multi-modal and multi-task autonomous driving systems while maintaining superior performance. Ultimately, we hope UniLION offers a fresh perspective on the development of 3D foundation models in autonomous driving. Code is available at https://github.com/happinesslz/UniLION

[270] How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Zhen Chen,Qing Xu,Jinlin Wu,Biao Yang,Yuhao Zhai,Geng Guo,Jing Zhang,Yinlu Ding,Nassir Navab,Jiebo Luo

Main category: cs.CV

TL;DR: 本文提出了SurgVeo——首个用于评估手术视频生成模型的专家策划基准,以及外科合理性金字塔(SPP)框架,用以系统评估生成视频从外观到手术策略的多层级合理性。通过对Veo-3模型进行零样本测试并由四位认证外科医生评审,研究发现尽管该模型在视觉上具有高度逼真性,但在器械操作、环境反馈和手术意图等高阶层面存在显著缺陷,揭示了视觉逼真与因果理解之间的“合理性鸿沟”。

Details Motivation: 现有视频生成基础模型虽能模拟物理世界,但在需要深度专业因果知识的高风险领域(如外科手术)中的应用仍属空白。通用物理规则不足以支持手术模拟,亟需专门评估方法来衡量模型在专业医学场景中的真实合理性。 Method: 提出SurgVeo基准数据集和四层Surgical Plausibility Pyramid(SPP)评估框架,并对先进模型Veo-3在腹腔镜和神经外科手术视频上的零样本生成能力进行评估,由四位认证外科医生依据SPP进行多层级评分。 Result: Veo-3在视觉感知层面表现优异,但在仪器操作、环境反馈和手术意图等更高层级的合理性上表现不佳,暴露出当前模型仅能模仿表观模式而缺乏对手术过程的因果理解。 Conclusion: 视觉逼真的生成结果不等于具备外科领域的因果推理能力。SurgVeo和SPP为未来开发真正适用于复杂医疗场景的AI模型提供了首个量化评估基础和明确发展方向。 Abstract: Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP. Our results reveal a distinct "plausibility gap": while Veo-3 achieves exceptional Visual Perceptual Plausibility, it fails critically at higher levels of the SPP, including Instrument Operation Plausibility, Environment Feedback Plausibility, and Surgical Intent Plausibility. This work provides the first quantitative evidence of the chasm between visually convincing mimicry and causal understanding in surgical AI. Our findings from SurgVeo and the SPP establish a crucial foundation and roadmap for developing future models capable of navigating the complexities of specialized, real-world healthcare domains.

[271] PROPEX-RAG: Enhanced GraphRAG using Prompt-Driven Prompt Execution

Tejas Sarnaik,Manan Shah,Ravi Hegde

Main category: cs.CV

TL;DR: 本文提出了一种基于提示的GraphRAG框架,强调提示设计在多跳问答中对实体提取、事实选择和段落重排序的重要性,通过构建符号化知识图谱并结合LLM与个性化PageRank进行高效检索,在HotpotQA和2WikiMultiHopQA上取得了最先进的性能。

Details Motivation: 尽管检索增强生成(RAG)已广泛应用于增强大语言模型的外部知识能力,但提示设计对图结构检索和复杂推理过程的影响仍缺乏深入研究。 Method: 提出Prompt-driven GraphRAG框架:利用提示引导从文本中提取实体和关系,构建结构化三元组知识图;结合LLM进行语义过滤与答案生成;采用基于实体的Personalized PageRank进行图遍历以实现高效检索;并在检索过程中引入提示驱动的重排序机制。 Result: 在HotpotQA和2WikiMultiHopQA数据集上达到SOTA性能:F1分数分别为80.7%和78.9%,Recall@5分别为97.1%和98.1%;验证了提示设计对提升检索准确率和回答质量的关键作用。 Conclusion: 提示设计在图增强的多跳问答系统中起着至关重要的作用;该研究为构建更高效、可解释的多跳问答系统奠定了基础,推动了提示感知的图推理发展。 Abstract: Retrieval-Augmented Generation (RAG) has become a robust framework for enhancing Large Language Models (LLMs) with external knowledge. Recent advances in RAG have investigated graph based retrieval for intricate reasoning; however, the influence of prompt design on enhancing the retrieval and reasoning process is still considerably under-examined. In this paper, we present a prompt-driven GraphRAG framework that underscores the significance of prompt formulation in facilitating entity extraction, fact selection, and passage reranking for multi-hop question answering. Our approach creates a symbolic knowledge graph from text data by encoding entities and factual relationships as structured facts triples. We use LLMs selectively during online retrieval to perform semantic filtering and answer generation. We also use entity-guided graph traversal through Personalized PageRank (PPR) to support efficient, scalable retrieval based on the knowledge graph we built. Our system gets state-of-the-art performance on HotpotQA and 2WikiMultiHopQA, with F1 scores of 80.7% and 78.9%, and Recall@5 scores of 97.1% and 98.1%, respectively. These results show that prompt design is an important part of improving retrieval accuracy and response quality. This research lays the groundwork for more efficient and comprehensible multi-hop question-answering systems, highlighting the importance of prompt-aware graph reasoning.

[272] SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art

Sagi Eppel,Alona Strugatski

Main category: cs.CV

TL;DR: 本文提出了Scitextures数据集,包含来自科学、技术与艺术领域的1200多个模型和10万张纹理图像,旨在探索视觉模式与其生成机制之间的联系,并评估AI模型在理解、模拟和重建真实世界模式方面的能力。

Details Motivation: 建立视觉模式与其背后生成机制之间的关联是实现深层视觉理解的关键,现有数据集缺乏跨学科的系统性连接。 Method: 通过一个自主的AI流水线收集并标准化实现各类科学模型,生成对应的视觉纹理图像,构建大规模数据集,并设计基准任务评估AI模型在识别、推断和重建视觉模式生成机制方面的能力。 Result: 实验表明,当前领先的视觉-语言模型能够理解视觉模式背后的物理系统,并能根据真实图像推断机制、生成代码并模拟出相似图像。 Conclusion: Scitextures为研究视觉模式与生成机制的关系提供了有力工具,推动AI实现更深层次的视觉理解。 Abstract: The ability to connect visual patterns with the processes that form them represents one of the deepest forms of visual understanding. Textures of clouds and waves, the growth of cities and forests, or the formation of materials and landscapes are all examples of patterns emerging from underlying mechanisms. We present the Scitextures dataset, a large-scale collection of textures and visual patterns from all domains of science, tech, and art, along with the models and code that generate these images. Covering over 1,200 different models and 100,000 images of patterns and textures from physics, chemistry, biology, sociology, technology, mathematics, and art, this dataset offers a way to explore the connection between the visual patterns that shape our world and the mechanisms that produce them. Created by an agentic AI pipeline that autonomously collects and implements models in standardized form, we use SciTextures to evaluate the ability of leading AI models to link visual patterns to the models and code that generate them, and to identify different patterns that emerged from the same process. We also test AIs ability to infer and recreate the mechanisms behind visual patterns by providing a natural image of a real-world pattern and asking the AI to identify, model, and code the mechanism that formed the pattern, then run this code to generate a simulated image that is compared to the real image. These benchmarks show that vision-language models (VLMs) can understand and simulate the physical system beyond a visual pattern. The dataset and code are available at: https://zenodo.org/records/17485502

[273] TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

Ming Li,Jike Zhong,Shitian Zhao,Haoquan Zhang,Shaoheng Lin,Yuxiang Lai,Wei Chen,Konstantinos Psounis,Kaipeng Zhang

Main category: cs.CV

TL;DR: 本文提出了TIR-Bench,一个包含13个多样化任务的综合性基准,用于评估多模态大模型在链式思维中使用工具进行图像处理和操作的智能体式视觉推理能力。

Details Motivation: 现有基准(如Visual Search)仅测试基本图像操作,无法充分评估模型在复杂、动态和依赖工具的视觉推理中的表现,因此需要更全面的评估框架。 Method: 设计了TIR-Bench,涵盖13个需新颖工具使用的图像处理任务,并对22个多模态大语言模型进行了评估,包括开源、闭源及具备显式工具使用增强的模型,同时开展了直接微调与智能体式微调的初步对比研究。 Result: 实验结果表明,TIR-Bench对现有模型具有普遍挑战性,高性能依赖于真正的‘以图思考’能力,且智能体式微调显示出潜力。 Conclusion: TIR-Bench有效推动了对高级视觉推理能力的评估,强调了工具使用在视觉链式思维中的重要性,并为未来模型设计提供了方向。 Abstract: The frontier of visual reasoning is shifting toward models like OpenAI o3, which can intelligently create and operate tools to transform images for problem-solving, also known as thinking-\textit{with}-images in chain-of-thought. Yet existing benchmarks fail to fully capture this advanced capability. Even Visual Search, the most common benchmark for current thinking-\textit{with}-images methods, tests only basic operations such as localization and cropping, offering little insight into more complex, dynamic, and tool-dependent reasoning. We introduce \textbf{TIR-Bench}, a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation in chain-of-thought. We evaluate 22 multimodal large language models (MLLMs), from leading open-sourced and proprietary models to those with explicit tool-use augmentation. Results show that TIR-Bench is universally challenging, and strong performance requires genuine thinking-with-images capabilities. Finally, we present a pilot study comparing direct versus agentic fine-tuning.