Skip to content

Table of Contents

cs.CL [Back]

[1] A Lightweight LLM Framework for Disaster Humanitarian Information Classification

Han Jinzhen,Kim Jisung,Yang Jong Soo,Yun Hong Sik

Main category: cs.CL

TL;DR: 本文提出了一种轻量、低成本的灾难推文分类框架,采用参数高效微调(LoRA/QLoRA)在Llama 3.1 8B上实现高准确率与低资源消耗,发现RAG反而因标签噪声降低性能。

Details Motivation: 在资源受限的应急场景中,部署大语言模型进行人道主义信息实时分类面临挑战,亟需轻量、低成本且可靠的解决方案。 Method: 构建统一双任务基准(人道主义信息分类+事件类型识别),整合HumAID数据集;系统评估提示工程、LoRA微调和RAG策略;采用LoRA与QLoRA实现参数高效训练与部署。 Result: LoRA达79.62%分类准确率(较零样本提升37.79%),仅训练约2%参数;QLoRA以50%内存成本保持99.4% LoRA性能;RAG因检索示例的标签噪声导致性能下降。 Conclusion: LoRA/QLoRA是资源受限危机响应中构建可靠情报系统的实用、可复现方案,RAG在此类细粒度分类任务中并不适用。 Abstract: Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning. We construct a unified experimental corpus by integrating and normalizing the HumAID dataset (76,484 tweets across 19 disaster events) into a dual-task benchmark: humanitarian information categorization and event type identification. Through systematic evaluation of prompting strategies, LoRA fine-tuning, and retrieval-augmented generation (RAG) on Llama 3.1 8B, we demonstrate that: (1) LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only ~2% of parameters; (2) QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; (3) contrary to common assumptions, RAG strategies degrade fine-tuned model performance due to label noise from retrieved examples. These findings establish a practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources.

[2] From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness

Linbo Cao,Lihao Sun,Yang Yue

Main category: cs.CL

TL;DR: 本文首次系统研究了基于人口统计学特征的 persona 分配对大语言模型(LLM)智能体任务性能的影响,发现其可导致高达26.2%的性能下降,揭示了当前LLM智能体中一个被忽视的隐性偏见与行为不稳定性漏洞。

Details Motivation: 尽管LLM作为自主智能体在现实世界中执行任务的风险日益突出,但persona诱导的偏见对智能体行为与任务性能的影响尚未被系统研究,而此类影响可能带来更直接的操作风险。 Method: 通过在涵盖战略推理、规划与技术操作等领域的智能体基准测试上,对广泛部署的LLM进行评估,系统分析不同人口统计学persona设定对任务表现的影响,并检验其跨任务类型与模型架构的普适性。 Result: 发现基于人口统计学的persona分配会导致显著性能波动,最高达26.2%的性能下降;该效应普遍存在于各类任务和不同模型架构中,表明简单提示注入即可干扰智能体决策可靠性。 Conclusion: persona分配会引入隐性偏见并加剧行为不稳定性,构成LLM智能体安全可靠部署的重要隐患,亟需在智能体设计与评估中纳入persona鲁棒性考量。 Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation. While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored, even though such effects pose more direct operational risks. In this work, we present the first systematic case study showing that demographic-based persona assignments can alter LLM agents' behavior and degrade performance across diverse domains. Evaluating widely deployed models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, we uncover substantial performance variations - up to 26.2% degradation, driven by task-irrelevant persona cues. These shifts appear across task types and model architectures, indicating that persona conditioning and simple prompt injections can distort an agent's decision-making reliability. Our findings reveal an overlooked vulnerability in current LLM agentic systems: persona assignments can introduce implicit biases and increase behavioral volatility, raising concerns for the safe and robust deployment of LLM agents.

[3] Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction

Junjie An,Jingguang Tian,Tianyi Wang,Yu Gao,Xiaofeng Mou,Yi Xu

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型的检索增强生成框架(A-STAR),用于纠正ASR系统中的命名实体识别错误,通过重述语言模型与语音级编辑距离检索候选,并结合自适应链式推理提升纠错性能,在AISHELL-1和Homophone数据集上显著降低命名实体字符错误率。

Details Motivation: 端到端ASR系统常误识别领域专有短语(如命名实体),导致下游任务严重失败;现有基于LLM的命名实体纠错方法尚未充分利用LLM的复杂推理能力。 Method: 提出检索增强生成框架:(1) 使用重述语言模型(RLM)进行命名实体识别,并基于语音级编辑距离检索候选;(2) 设计自适应链式推理模型(A-STAR),动态调整推理深度。 Result: 在AISHELL-1和Homophone数据集上,命名实体字符错误率分别相对基线降低17.96%和34.42%。 Conclusion: 所提A-STAR框架有效提升了ASR中命名实体纠错性能,验证了结合语音感知检索与自适应推理的可行性与优越性。 Abstract: End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96\% and 34.42\%, respectively, compared to a strong baseline.

[4] Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática

Neemias da Silva,Júlio C. W. Scholz,John Harrison,Marina Borges,Paulo Ávila,Frances A Santos,Myriam Delgado,Rodrigo Minetto,Thiago H Silva

Main category: cs.CL

TL;DR: 本章介绍了多模态大语言模型(MLLMs)的基本原理、代表性模型、预处理与提示工程等实用技术,并探讨了当前挑战与未来趋势。

Details Motivation: 随着AI发展,融合语言理解生成能力与图像、音频等感知能力的多模态大语言模型成为关键方向,亟需系统性介绍与实践指导。 Method: 综述性阐述MLLMs的基础理论、典型架构,并结合LangChain与LangGraph讲解多模态流水线构建、预处理及提示工程等实践方法。 Result: 提供了MLLMs的系统性知识框架与可复现的实践路径,配套开源代码支持进一步学习。 Conclusion: MLLMs是AI前沿重要方向,其发展需兼顾理论深度与工程落地,未来将朝更强泛化性、跨模态对齐与高效推理演进。 Abstract: Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: https://github.com/neemiasbsilva/MLLMs-Teoria-e-Pratica. Finally, the chapter discusses the challenges and highlights promising trends.

[5] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Maximilian Idahl,Benedikt Droste,Björn Plüster,Jan Philipp Harries

Main category: cs.CL

TL;DR: 本文提出propella-1系列小型多语言大模型,用于对文本进行18个维度的细粒度质量标注,替代传统单一分数评估方法;模型支持57种语言、输出结构化JSON,并发布超30亿条标注数据集propella-annotations,揭示主流预训练语料在质量、推理深度和内容构成上的显著差异。

Details Motivation: 现有LLM预训练数据筛选依赖单一标量质量分数,导致多维质量信息被混淆、过滤不灵活且缺乏可解释性。 Method: 设计并训练propella-1系列小规模多语言LLM(0.6B/1.7B/4B),在6大类共18个属性上对文本进行结构化标注;构建覆盖57种语言、符合预定义schema的JSON输出;在FineWeb-2、FinePDFs等主流语料上生成大规模标注数据集propella-annotations。 Result: propella-1(4B)在与前沿商用大模型对比评估中,标注一致性高于更大规模通用模型;发布的propella-annotations包含超30亿文档标注;多维分析揭示了各预训练数据集在质量、推理深度和内容构成上的实质性差异。 Conclusion: 多维细粒度标注优于单一分数方法,propella-1提供了可解释、可组合、可商用的数据质量评估新范式,并开源模型与标注数据以推动社区发展。 Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

[6] RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

Ziqian Zhang,Xingjian Hu,Yue Huang,Kai Zhang,Ruoxi Chen,Yixin Liu,Qingsong Wen,Kaidi Xu,Xiangliang Zhang,Neil Zhenqiang Gong,Lichao Sun

Main category: cs.CL

TL;DR: 本文提出RankLLM框架,通过双向分数传播量化问题难度与模型能力,实现细粒度、难度感知的大语言模型评估。

Details Motivation: 现有基准测试无法区分问题难度,难以有效刻画模型能力差异。 Method: 提出RankLLM框架,引入问题难度作为核心评估维度,构建模型能力与问题难度之间的双向分数传播机制。 Result: 在35,550道跨领域问题上评估30个模型,与人工判断一致性达90%,优于IRT等强基线,且具备高稳定性、快速收敛和高计算效率。 Conclusion: RankLLM为大规模、难度感知的LLM评估提供了实用、可靠的新范式。 Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

[7] RBCorr: Response Bias Correction in Language Models

Om Bhatt,Anna A. Ivanova

Main category: cs.CL

TL;DR: 本文提出了一种简单有效的响应偏差校正策略RBCorr,用于消除语言模型在固定选项问题中的选项偏好偏差,显著提升小模型在闭合式基准测试中的性能表现。

Details Motivation: 语言模型在固定响应问题中存在响应偏差(如选项偏好),影响其性能评估的准确性,亟需低成本、高效果的偏差校正方法。 Method: 提出基于LogProbs的响应偏差校正策略RBCorr,在12个开源语言模型上,针对是非题、蕴含判断题和多选题进行测试,并分析偏差在模型、数据集和提示格式间的泛化性。 Result: 实验证明LM普遍存在响应偏差;RBCorr能有效消除偏差并提升模型性能;LogProbs校正效果高度依赖模型、数据集和提示格式。 Conclusion: RBCorr是一种易用、高效的方法,可提升小型语言模型在闭合响应基准上的表现,使其能力评估更真实可靠。 Abstract: Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy ($\texttt{RBCorr}$) and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that $\texttt{RBCorr}$ effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, $\texttt{RBCorr}$ is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.

[8] Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification

Bo Wang,Yuxuan Zhang,Yueqin Hu,Hanchao Hou,Kaiping Peng,Shiguang Ni

Main category: cs.CL

TL;DR: 本文提出了一种基于语义主题建模的无反应式心理量表简化框架,利用句子嵌入与密度聚类发现潜在语义因子,无需预设因子数量,并通过类别加权和代表性项目选择实现高效简化,在多个经典量表上验证了其结构保持性与心理测量学合理性。

Details Motivation: 传统心理量表修订依赖大样本被试反应数据(如因子分析、项目反应理论),受限于数据可得性与跨文化可比性;而NLP进展提示问卷题项的语义结构本身可能隐含构念组织,亟需一种响应无关的补充方法。 Method: 采用上下文句子嵌入编码题项,结合密度聚类自动发现语义因子;引入基于类别的词权重生成可解释主题表征;通过隶属度标准在集成缩减流程中选取代表性题项;并开发可视化支持工具实现一键式语义分析与结构化简化。 Result: 在DASS、IPIP和EPOCH量表上验证表明:该方法平均缩减题项长度60.5%,仍保持良好内部一致性、因子结构一致性及因子间相关性保留;恢复的语义聚类与既有构念高度对齐。 Conclusion: 题项语义潜结构可作为响应无关的测量结构近似,本框架将语义分析形式化为量表构建与简化的可检验前端环节,为跨文化、小样本场景下的量表优化提供新范式。 Abstract: Psychological scale refinement traditionally relies on response-based methods such as factor analysis, item response theory, and network psychometrics to optimize item composition. Although rigorous, these approaches require large samples and may be constrained by data availability and cross-cultural comparability. Recent advances in natural language processing suggest that the semantic structure of questionnaire items may encode latent construct organization, offering a complementary response-free perspective. We introduce a topic-modeling framework that operationalizes semantic latent structure for scale simplification. Items are encoded using contextual sentence embeddings and grouped via density-based clustering to discover latent semantic factors without predefining their number. Class-based term weighting derives interpretable topic representations that approximate constructs and enable merging of semantically adjacent clusters. Representative items are selected using membership criteria within an integrated reduction pipeline. We benchmarked the framework across DASS, IPIP, and EPOCH, evaluating structural recovery, internal consistency, factor congruence, correlation preservation, and reduction efficiency. The proposed method recovered coherent factor-like groupings aligned with established constructs. Selected items reduced scale length by 60.5% on average while maintaining psychometric adequacy. Simplified scales showed high concordance with original factor structures and preserved inter-factor correlations, indicating that semantic latent organization provides a response-free approximation of measurement structure. Our framework formalizes semantic structure as an inspectable front-end for scale construction and reduction. To facilitate adoption, we provide a visualization-supported tool enabling one-click semantic analysis and structured simplification.

[9] Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

Pengxiang Zhao,Hui-Ling Zhen,Xing Li,Han Bao,Weizhe Lin,Zhiyuan Yang,Ziwei Yu,Xin Wang,Mingxuan Yuan,Xianzhi Yu,Zhenhua Dong

Main category: cs.CL

TL;DR: 本文提出HiFloat(HiF8和HiF4)浮点格式,专为Ascend NPU设计,在权重量化与KV缓存任务中验证其在低比特下兼顾精度与效率的优势,尤其HiF4通过分层缩放避免4比特整数量化精度崩溃,并兼容现有PTQ框架。

Details Motivation: 随着大语言模型规模扩大,低比特浮点格式(如MXFP、NVFP4)在精度与效率间提供新平衡;需针对国产Ascend NPU设计更适配的量化格式。 Method: 设计HiFloat系列格式(HiF8/HiF4),在权重-激活和KV缓存任务上进行系统性实验对比,分析不同格式(INT8、HiF4等)在窄范围/高方差数据下的表现,并验证其与主流后训练量化(PTQ)框架的兼容性。 Result: 发现:(1) INT8适合窄范围数据,浮点格式更适合高方差数据;(2) HiF4在4比特下通过分层缩放显著缓解精度崩溃;(3) HiFloat可无缝集成至现有PTQ流程。 Conclusion: HiFloat是一种面向Ascend NPU的高效、高兼容性低比特浮点量化方案,适用于大语言模型推理部署。 Abstract: As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.

[10] CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

Yiran Rex Ma,Yuxiao Ye,Huiyuan Xie

Main category: cs.CL

TL;DR: 本文提出CLASE,一种用于评估中文法律文本风格质量的混合评价方法,结合语言特征评分与经验引导的LLM-as-a-judge评分,显著提升与人工判断的一致性,并提供可解释的分数分解与改进建议。

Details Motivation: 现有法律文本风格评估方法存在缺陷:人工制定指标不现实(因法律文体规范隐性难形式化),参考式自动指标混淆语义准确性与风格保真度,纯LLM-as-a-judge方法缺乏透明性与一致性。 Method: 提出CLASE——一种混合式、无参考、可解释的法律文体评估方法,融合基于语言特征的评分(如句法、术语、格式等)与经对比学习优化的经验引导型LLM评分;两者参数均从真实法律文本与其LLM修复版本的对比对中学习。 Result: 在200份中文法律文档上的实验表明,CLASE与人类专家判断的一致性显著高于传统指标(如BLEU、BERTScore)和纯LLM-as-a-judge方法,并能输出可解释的分项得分与具体改进建议。 Conclusion: CLASE为法律文本生成中的专业风格评估提供了可扩展、实用且透明的解决方案,兼顾准确性、可解释性与可操作性。 Abstract: Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: https://github.com/rexera/CLASE).

[11] Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

Dohyung Kim,Minbeom Kim,Jeonghye Kim,Sangmook Lee,Sojeong Rhee,Kyomin Jung

Main category: cs.CL

TL;DR: 本文提出PACED-RL框架,通过将GFlowNets中的划分函数重新解释为每提示的期望奖励(即在线准确率)信号,提升LLM分布匹配训练的样本效率。

Details Motivation: 现有基于GFlowNets的RL方法虽能提升LLM推理性能,但会降低输出多样性;且传统上仅将划分函数视为归一化常数,忽略了其蕴含的每提示准确率信息。 Method: 提出Partition Function-Guided RL(PACED-RL):1)理论建立划分函数与每提示准确率估计的关系;2)利用该估计优先选择信息量大的问题提示进行训练;3)引入准确率估计误差驱动的优先回放机制。所有组件复用GFlowNet训练中已生成的信息。 Result: 在多个基准测试中显著优于GRPO和先前GFlowNet方法,验证了PACED-RL在样本效率和性能上的优势。 Conclusion: 划分函数可被有效用作在线准确率信号,PACED-RL为更高效、实用的LLM分布匹配训练提供了新方向。 Abstract: Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

[12] Learning Ordinal Probabilistic Reward from Preferences

Longze Chen,Lu Wang,Renke Shan,Ze Gong,Run Luo,Jiaming Li,Jing Luo,Qiyao Wang,Min Yang

Main category: cs.CL

TL;DR: 本文提出概率奖励模型(PRM),将奖励建模为随机变量而非确定性标量,并设计其离散实现OPRM与高效训练策略RgFT,显著提升奖励建模的准确性与数据效率。

Details Motivation: 现有生成式(GRMs)和判别式(DRMs)奖励模型分别存在点标注成本高和分数不可校准、缺乏概率解释的问题。 Method: 提出概率奖励模型(PRM),将其离散化为序数概率奖励模型(OPRM),并设计区域泛洪调优(RgFT)策略,利用质量等级标注引导概率质量集中在对应评分子区域内。 Result: 在多个奖励模型基准上,准确率较先前方法提升2.9%~7.4%,且分数分布分析表明模型能同时刻画相对排序与绝对质量。 Conclusion: PRM及其具体实现OPRM与RgFT提供了一种更合理、更高效、更具解释性的奖励建模范式。 Abstract: Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by $\textbf{2.9%}\sim\textbf{7.4%}$ compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.

[13] $\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models

Yuang Cai,Yuyu Yuan

Main category: cs.CL

TL;DR: 本文提出了一种名为Experiential Knowledge Distillation(X-KD)的新知识蒸馏框架,通过逆强化学习思想建模教师模型原始奖励函数,并在教师的学习环境中进行策略蒸馏,提升学生模型性能、多样性与数据效率。

Details Motivation: 现有大语言模型知识蒸馏方法多关注模仿教师行为,却忽略了塑造教师知识的原始学习环境;受经验学习理论和逆强化学习启发,需让学生在教师原始学习环境中学习。 Method: 提出X-KD框架,采用近似变分奖励模仿学习(AVRIL)联合建模教师原始奖励函数并执行策略蒸馏,保证学生策略与原始奖励函数一致;理论证明其兼容监督学习框架及序列级/散度型蒸馏方法。 Result: 在摘要生成、机器翻译和算术推理任务上,X-KD优于广义KD和MiniLLM基线;同时具备更优的性能-多样性权衡与更高数据效率。 Conclusion: X-KD是一种简单、通用且灵活的知识蒸馏新范式,通过还原教师学习环境显著提升蒸馏效果。 Abstract: Knowledge Distillation (KD) for Large Language Models (LLMs) has become increasingly important as models grow in size and complexity. While existing distillation approaches focus on imitating teacher behavior, they often overlook the original learning environment that shaped the teacher's knowledge. Inspired by the experiential learning theory and inverse reinforcement learning, we propose Experiential Knowledge Distillation ($\mathcal{X}$-KD), a novel and general framework that enables student models to learn in the teacher's original learning environment. $\mathcal{X}$-KD adopts the Approximated Variational Reward Imitation Learning (AVRIL) framework to jointly model the teacher's original reward function and perform policy distillation, encouraging consistency between the student policy and the original reward function. Our derivation demonstrates that $\mathcal{X}$-KD follows the supervised learning framework and applies to both sequence-level and divergence-based distillation methods, underlining the simplicity and flexibility of our approach. Empirical results show that $\mathcal{X}$-KD outperforms the generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks. Additionally, $\mathcal{X}$-KD achieves better performance-diversity trade-off and data efficiency than baseline KD approaches.

[14] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi,Bo Cui,Boyuan Jiang,Deli Yu,Fang Qian,Haihua Yang,Huichao Wang,Jiale Chen,Jianfei Pan,Jieqiong Cao,Jinghao Lin,Kai Wu,Lin Yang,Shengsheng Yao,Tao Chen,Xiaojun Xiao,Xiaozhong Ji,Xu Wang,Yijun He,Zhixiong Yang

Main category: cs.CL

TL;DR: MedXIAOHE是一个面向真实临床应用的医学视觉-语言基础模型,通过实体感知的持续预训练、强化学习与工具增强的智能体训练、以及基于证据和用户偏好的可靠生成机制,在多项医学基准测试中达到SOTA性能。

Details Motivation: 提升通用医学理解与推理能力,解决医学数据长尾分布(如罕见病)问题,并增强模型在真实临床场景中的可靠性与可解释性。 Method: 提出实体感知的持续预训练框架;结合强化学习与工具增强的智能体训练以支持多步诊断推理;引入用户偏好准则、证据支撑推理和低幻觉长报告生成机制。 Result: 在多个医学基准上达到SOTA性能,超越主流闭源多模态系统,在医学推理、诊断支持和报告生成等能力上表现优异。 Conclusion: MedXIAOHE为构建可靠、可解释、临床实用的医学多模态大模型提供了系统性实践方案与开源参考。 Abstract: We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.

[15] ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter

Yixin Chen,Ying Xiong,Shangyu Wu,Xiangrui Ke,Nan Guan,Chun Jason Xue

Main category: cs.CL

TL;DR: 本文提出ReFilter,一种基于潜在表示的检索增强生成(RAG)融合框架,通过token级过滤与融合提升大语言模型在知识密集型问答中的性能,尤其在高k值检索下保持高效与鲁棒性。

Details Motivation: 现有RAG内部融合方法(如query-based、parametric、latent-based)在检索数量k增大时难以优雅扩展:k增大虽提升证据覆盖,但也引入无关/冗余内容并增加推理开销。 Method: 提出ReFilter框架,包含三部分:上下文编码器(编码上下文特征)、门控过滤器(对每个token加权)、token融合模块(将加权token特征融入LLM隐藏状态),实现token级过滤与融合。 Result: 在四个通用领域QA基准上,ReFilter在域内适配与跨域迁移中均取得最佳平均性能;零样本迁移到五个生物医学QA基准,使用Qwen2.5-14B-Instruct达到70.01%平均准确率。 Conclusion: ReFilter是一种可扩展、高效且领域通用的latent-based RAG融合方法,显著缓解了高k检索下的噪声与计算负担问题。 Abstract: Retrieval-augmented generation (RAG) has become a dominant paradigm for grounding large language models (LLMs) with external evidence in knowledge-intensive question answering. A core design choice is how to fuse retrieved samples into the LLMs, where existing internal fusion approaches broadly fall into query-based fusion, parametric fusion, and latent-based fusion. Despite their effectiveness at modest retrieval scales, these methods often fail to scale gracefully as the number of retrieved candidates k increases: Larger k improves evidence coverage, yet realistic top-k retrieval inevitably contains irrelevant or redundant content and increases the inference cost. To address these limitations, we propose ReFilter, a novel latent-based fusion framework that performs token-level filtering and fusion. ReFilter consists of three key components: a context encoder for encoding context features, a gated filter for weighting each token, and a token fusion module for integrating the weighted token feature into the LLM's hidden states. Our experiments across four general-domain QA benchmarks show that ReFilter consistently achieves the best average performance under both in-domain adaptation and out-of-domain transfer. ReFilter further generalizes to five biomedical QA benchmarks in zero-shot transfer without domain fine-tuning, reaching 70.01% average accuracy with Qwen2.5-14B-Instruct.

[16] Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting

Jing Xu,Minglin Wu,Xueyuan Chen,Xixin Wu,Helen Meng

Main category: cs.CL

TL;DR: 本文提出Lamer-SSL框架,通过Layer-Aware MixturE of LoRA Experts(Lamer)模块与回放策略结合,提升自监督语音模型在多语言持续学习中的泛化能力与抗遗忘能力,仅需训练2.14%参数即可在ASR和LID任务上兼顾新旧语言性能。

Details Motivation: 自监督语音模型在跨语言泛化和持续学习中存在性能下降与灾难性遗忘问题。 Method: 提出Lamer-SSL:1)Lamer模块——基于LoRA的层感知专家混合机制,动态分配共享/语言特异性表示,并在深层分配更多专家;2)轻量回放策略以最小数据保留先验知识。 Result: 在ASR和LID任务上验证了Lamer-SSL能有效扩展至新语言,同时保持旧语言性能,仅需训练2.14%参数。 Conclusion: Lamer-SSL是一种高效、可扩展的参数高效持续学习框架,显著缓解多语言场景下的遗忘问题并提升泛化能力。 Abstract: Despite their impressive performance, self-supervised speech models often struggle to generalize to new languages and tend to forget previously acquired knowledge during continual training. To address this, we propose Lamer-SSL, a parameter-efficient framework that integrates a Layer-Aware MixturE of LoRA Experts (Lamer) module with a replay strategy. The Lamer module enables flexible balancing between shared and language-specific representations, while layer-aware expert allocation assigns more experts to deeper layers where semantic information is richer. Meanwhile, the replay strategy retains prior knowledge using minimal data, mitigating forgetting during continual training. Experiments on automatic speech recognition (ASR) and language identification (LID) demonstrate that Lamer-SSL extends self-supervised models to new languages effectively while maintaining strong performance on previously learned languages with only 2.14% parameters being trainable.

[17] Towards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks

Elena Alvarez-Mellado,Julio Gonzalo

Main category: cs.CL

TL;DR: 本文提出了一种基于错误分析的序列标注任务评估方法,通过手工构建小规模、语言学驱动的测试集来覆盖真实场景中的各类跨度属性,从而提供可诊断、可操作且具预测性的模型性能评估。

Details Motivation: 标准NLP评估仅给出平均性能比较,缺乏对系统改进方向的指导,且难以泛化到分布外数据;亟需一种能揭示系统弱点、支持决策并预测跨分布性能的评估方法。 Method: 构建小规模、人工设计、语言学驱动的测试集,覆盖跨度的各种属性(如形状、长度、大小写、句中位置等),而非依赖大规模真实世界数据;结合错误分析进行定量与定性评估。 Result: 在西班牙语外来词识别基准上验证了该方法:结果具有诊断性(揭示系统性弱点)、可操作性(指导模型选择)和预测性(对外部数据集性能预测的中位数相关系数达0.85)。 Conclusion: 基于语言学属性的手工测试集是一种更有效、更鲁棒的序列标注评估范式,能弥补传统评估在可解释性、可操作性和泛化预测能力上的不足。 Abstract: Standard evaluation in NLP typically indicates that system A is better on average than system B, but it provides little info on how to improve performance and, what is worse, it should not come as a surprise if B ends up being better than A on outside data. We propose an evaluation methodology for sequence labeling tasks grounded on error analysis that provides both quantitative and qualitative information on where systems must be improved and predicts how models will perform on a different distribution. The key is to create test sets that, contrary to common practice, do not rely on gathering large amounts of real-world in-distribution scraped data, but consists in handcrafting a small set of linguistically motivated examples that exhaustively cover the range of span attributes (such as shape, length, casing, sentence position, etc.) a system may encounter in the wild. We demonstrate this methodology on a benchmark for anglicism identification in Spanish. Our methodology provides results that are diagnostic (because they help identify systematic weaknesses in performance), actionable (because they can inform which model is better suited for a given scenario) and predictive: our method predicts model performance on external datasets with a median correlation of 0.85.

[18] Aspect-Based Sentiment Analysis for Future Tourism Experiences: A BERT-MoE Framework for Persian User Reviews

Hamidreza Kazemi Taskooh,Taha Zare Harofte

Main category: cs.CL

TL;DR: 本文提出了一种基于BERT的混合模型,结合Top-K路由与辅助损失,用于波斯语旅游领域用户评论的方面级情感分析(ABSA),在自建标注数据集上达到90.6%加权F1,显著优于基线,并降低39% GPU功耗。

Details Motivation: 解决波斯语作为低资源语言在旅游领域方面级情感分析(ABSA)中的数据稀缺与模型效率问题,并填补该语言和领域的研究空白。 Method: 构建基于BERT的混合模型,引入Top-K路由机制与辅助损失以缓解路由坍塌;设计三阶段流程:整体情感分类、六类旅游相关方面的多标签抽取、动态路由集成ABSA;使用58,473条来自伊朗住宿平台Jabama的手动标注波斯语评论。 Result: ABSA任务加权F1达90.6%,优于BERT基线(89.25%)和标准混合方法(85.7%);GPU功耗降低39%;发现清洁度与设施是提及率最高的关键方面。 Conclusion: 所提模型在性能与能效上均具优势,是首个面向波斯语旅游评论的ABSA研究,所发布数据集将推动多语言旅游NLP研究,并支持联合国可持续发展目标(SDG 9 & 12)。 Abstract: This study advances aspect-based sentiment analysis (ABSA) for Persian-language user reviews in the tourism domain, addressing challenges of low-resource languages. We propose a hybrid BERT-based model with Top-K routing and auxiliary losses to mitigate routing collapse and improve efficiency. The pipeline includes: (1) overall sentiment classification using BERT on 9,558 labeled reviews, (2) multi-label aspect extraction for six tourism-related aspects (host, price, location, amenities, cleanliness, connectivity), and (3) integrated ABSA with dynamic routing. The dataset consists of 58,473 preprocessed reviews from the Iranian accommodation platform Jabama, manually annotated for aspects and sentiments. The proposed model achieves a weighted F1-score of 90.6% for ABSA, outperforming baseline BERT (89.25%) and a standard hybrid approach (85.7%). Key efficiency gains include a 39% reduction in GPU power consumption compared to dense BERT, supporting sustainable AI deployment in alignment with UN SDGs 9 and 12. Analysis reveals high mention rates for cleanliness and amenities as critical aspects. This is the first ABSA study focused on Persian tourism reviews, and we release the annotated dataset to facilitate future multilingual NLP research in tourism.

[19] RAT-Bench: A Comprehensive Benchmark for Text Anonymization

Nataša Krčo,Zexi Yao,Matthieu Meeus,Yves-Alexandre de Montjoye

Main category: cs.CL

TL;DR: 本文提出了RAT-Bench基准,用于评估文本匿名化工具在防止重识别方面的实际效果,发现现有工具(包括基于NER和LLM的方法)仍存在显著缺陷,尤其在处理非标准直接标识符和间接标识符时;LLM-based匿名器虽计算成本高,但隐私-效用权衡更优且支持多语言。

Details Motivation: 现有PII脱敏工具(如Presidio、PII purifier)仅以识别和移除特定标识符为评估标准,其真实重识别风险未知,亟需面向重识别风险的系统性评估基准。 Method: 构建RAT-Bench基准:基于美国人口统计数据生成含直接/间接标识符、多领域、多语言、多难度级别的合成文本;设计LLM-based攻击者,通过推断匿名文本中残留属性,量化其在美国人口中的重识别风险,并考虑标识符的差异性影响;评估多种NER与LLM匿名化工具。 Result: 所有工具均未达理想重识别防护水平,尤其面对非标准写法的直接标识符和强推断性的间接标识符;LLM-based匿名器(含新型迭代式)隐私-效用平衡更优、跨语言表现好,但计算开销更高。 Conclusion: 当前文本匿名化工具在重识别防御上存在根本性不足;应推动以重识别风险为核心指标的评估范式;建议未来工具加强间接标识符建模、提升鲁棒性与多语言支持,并开放基准以促进全球地理扩展。 Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population, while properly accounting for the disparate impact of identifiers. We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways and when indirect identifiers enable re-identification. Overall we find LLM-based anonymizers, including new iterative anonymizers, to provide a better privacy-utility trade-off albeit at a higher computational cost. Importantly, we also find them to work well across languages. We conclude with recommendations for future anonymization tools and will release the benchmark and encourage community efforts to expand it, in particular to other geographies.

[20] Left-right asymmetry in predicting brain activity from LLMs' representations emerges with their formal linguistic competence

Laurent Bonnasse-Gahot,Christophe Pallier

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在训练过程中,其内部激活与人类大脑左右半球fMRI信号预测能力差异(左半球提升更显著)的成因,发现该左右不对称性与LLM形式语言能力(如语法判断和生成能力)的发展同步出现,而与算术、括号语言、世界知识或推理任务无关,并在Pythia模型及法语数据中得到验证。

Details Motivation: 理解大语言模型训练过程中,其对人类大脑左右半球活动预测能力差异(左半球更强)所对应的内在语言能力基础。 Method: 使用OLMo-2 7B模型多个训练检查点,结合英语被试fMRI数据,分析左右脑预测性能差异随训练的演化,并与多种基准任务(语法判断、句子生成、算术、Dyck语言、世界知识与推理等)表现进行相关性比较;进一步在Pythia模型和法语数据上做泛化验证。 Result: LLM对大脑左/右半球预测能力的不对称性(左>右)与其形式语言能力(如最小对立句语法判别和文本生成质量)的发展高度共现;但与算术、Dyck语言、世界知识及推理任务性能无相关性;该现象在Pythia模型和法语中亦成立。 Conclusion: LLM在训练中发展出的形式语言能力(即对语言结构模式的掌握)是驱动其大脑预测左右不对称性的关键因素,支持语言处理神经偏侧化可能源于对形式语法结构的学习。 Abstract: When humans and large language models (LLMs) process the same text, activations in the LLMs correlate with brain activity measured, e.g., with functional magnetic resonance imaging (fMRI). Moreover, it has been shown that, as the training of an LLM progresses, the performance in predicting brain activity from its internal activations improves more in the left hemisphere than in the right one. The aim of the present work is to understand which kind of competence acquired by the LLMs underlies the emergence of this left-right asymmetry. Using the OLMo-2 7B language model at various training checkpoints and fMRI data from English participants, we compare the evolution of the left-right asymmetry in brain scores alongside performance on several benchmarks. We observe that the asymmetry co-emerges with the formal linguistic abilities of the LLM. These abilities are demonstrated in two ways: by the model's capacity to assign a higher probability to an acceptable sentence than to a grammatically unacceptable one within a minimal contrasting pair, or its ability to produce well-formed text. On the opposite, the left-right asymmetry does not correlate with the performance on arithmetic or Dyck language tasks; nor with text-based tasks involving world knowledge and reasoning. We generalize these results to another family of LLMs (Pythia) and another language, namely French. Our observations indicate that the left-right asymmetry in brain predictivity matches the progress in formal linguistic competence (knowledge of linguistic patterns).

[21] AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection

Luca Tedeschini,Matteo Fasulo

Main category: cs.CL

TL;DR: 本文提出了一种分层建模方法,通过弱监督LLM标注用户LGBTQ+身份概率,并将其融入BERT模型以提升意大利语和西班牙语中被收编蔑称(reclaimed slurs)的检测效果,性能与强基线相当,且框架模块化、可扩展。

Details Motivation: 同一蔑称在不同社会身份和语境下可能具有侮辱性或群体内肯定性,现有仇恨言论检测系统难以区分,需建模用户身份与语境对语义的影响。 Method: 采用两阶段分层方法:第一阶段用弱监督LLM为用户生成LGBTQ+归属概率的模糊标签,并训练BERT类模型学习该身份的潜在表征;第二阶段将该潜在空间与新初始化的slur重用检测模型融合,结合社会语言学信号与仇恨言论预训练表征。 Result: 在意大利语和西班牙语数据集上,该方法性能与强BERT基线统计上无显著差异,同时提供模块化、可扩展的框架,支持社会语言学上下文建模。 Conclusion: 分层建模用户身份与话语语境有助于提升被收编蔑称识别效果;更细粒度的社会语言学建模有望进一步改进此类任务。 Abstract: Detecting reclaimed slurs represents a fundamental challenge for hate speech detection systems, as the same lexcal items can function either as abusive expressions or as in-group affirmations depending on social identity and context. In this work, we address Subtask B of the MultiPRIDE shared task at EVALITA 2026 by proposing a hierarchical approach to modeling the slur reclamation process. Our core assumption is that members of the LGBTQ+ community are more likely, on average, to employ certain slurs in a eclamatory manner. Based on this hypothesis, we decompose the task into two stages. First, using a weakly supervised LLM-based annotation, we assign fuzzy labels to users indicating the likelihood of belonging to the LGBTQ+ community, inferred from the tweet and the user bio. These soft labels are then used to train a BERT-like model to predict community membership, encouraging the model to learn latent representations associated with LGBTQ+ identity. In the second stage, we integrate this latent space with a newly initialized model for the downstream slur reclamation detection task. The intuition is that the first model encodes user-oriented sociolinguistic signals, which are then fused with representations learned by a model pretrained for hate speech detection. Experimental results on Italian and Spanish show that our approach achieves performance statistically comparable to a strong BERT-based baseline, while providing a modular and extensible framework for incorporating sociolinguistic context into hate speech modeling. We argue that more fine-grained hierarchical modeling of user identity and discourse context may further improve the detection of reclaimed language. We release our code at https://github.com/LucaTedeschini/multipride.

[22] MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Hoyun Song,Migyeong Kang,Jisu Shin,Jihyun Kim,Chanbi Park,Hangyeol Yoo,Jihyun An,Alice Oh,Jinyoung Han,KyungTae Lim

Main category: cs.CL

TL;DR: 本文提出了MentalBench,一个用于评估大语言模型在精神科诊断决策能力的基准测试,其核心是基于DSM-5构建的精神病学知识图谱MentalKG,并生成大量合成临床案例进行系统性评测。

Details Motivation: 现有心理健康基准多依赖社交媒体数据,难以评估符合DSM标准的诊断判断能力,因此需要更专业、结构化、可解释的评估工具。 Method: 构建由精神科医生设计并验证的知识图谱MentalKG,覆盖23种精神障碍的DSM-5诊断标准与鉴别诊断规则;基于该图谱生成24,750个在信息完整性与诊断复杂度上系统变化的合成临床案例;对主流大语言模型开展诊断决策与置信度校准评测。 Result: 当前最先进大语言模型在直接查询DSM-5知识时表现良好,但在区分临床症状高度重叠的精神障碍时,其诊断置信度校准能力显著不足。 Conclusion: MentalBench揭示了现有基准未覆盖的关键评估缺口,强调需关注模型在真实临床推理场景中的不确定性建模与鉴别诊断能力。 Abstract: We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.

[23] BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

Jiangxi Chen,Qian Liu

Main category: cs.CL

TL;DR: 本文提出了BaziQA-Benchmark,一个用于评估大语言模型在符号与时间组合推理能力的标准化基准,基于专业命理竞赛题目构建,并引入结构化推理协议分析模型表现。

Details Motivation: 现有评估多为轶事性或提示驱动,缺乏客观、可控的标准化基准来衡量模型在符号与时间组合推理上的真实能力。 Method: 构建了源自全球命理师竞赛的专业多选题基准BaziQA-Benchmark;采用多轮对话设置评估主流大模型;提出轻量级结构化推理协议以约束推理顺序。 Result: 模型表现显著高于随机水平但远未饱和,对时间组合性、推理顺序敏感,在精确时间定位和多条件符号判断上存在系统性失败。 Conclusion: 当前大语言模型在符号化、时序复合推理方面仍存在根本性局限,亟需更精细的评估框架与建模机制。 Abstract: We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference protocols.To further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.

[24] ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark

Tung X. Nguyen,Nhu Vo,Giang-Son Nguyen,Duy Mai Hoang,Chien Dinh Huynh,Inigo Jauregi Unanue,Massimo Piccardi,Wray Buntine,Dung D. Le

Main category: cs.CL

TL;DR: This paper introduces ViMedCSS, the first benchmark dataset for Vietnamese medical code-switching speech, and evaluates ASR models with fine-tuning strategies to improve recognition of English medical terms in Vietnamese.

Details Motivation: Code-switching of English medical terms in Vietnamese speech poses a challenge for ASR systems, especially in low-resource settings, and no existing benchmark addresses this issue. Method: The authors construct a 34-hour Vietnamese medical code-switching speech dataset (ViMedCSS) with 16,576 utterances containing English medical terms from a curated bilingual lexicon; they then evaluate state-of-the-art ASR models and compare fine-tuning strategies—including Vietnamese-optimized and multilingual pretraining approaches. Result: Vietnamese-optimized models excel on general segments, multilingual pretraining better captures English insertions, and their combination achieves the best balance between overall and code-switched accuracy. Conclusion: ViMedCSS establishes the first benchmark for Vietnamese medical code-switching ASR, and combining Vietnamese-specific optimization with multilingual pretraining is the most effective strategy for domain adaptation in low-resource, multilingual ASR. Abstract: Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.

[25] When Words Don't Mean What They Say: Figurative Understanding in Bengali Idioms

Adib Sakhawat,Shamim Ara Parveen,Md Ruhul Amin,Shamim Al Mahmud,Md Saiful Islam,Tahera Khatun

Main category: cs.CL

TL;DR: 本文构建了一个包含10361个孟加拉语习语的大规模、文化嵌入式数据集,每个习语按19个维度进行专家标注,并基于该数据集对30个主流多语言及指令微调大模型进行评测,发现所有模型准确率均未超过50%,远低于人类的83.4%,揭示了当前LLM在低资源语言隐喻理解与跨文化推理上的严重不足。

Details Motivation: 解决大型语言模型(LLMs)在低资源语言(特别是孟加拉语)中理解比喻性语言(如习语)能力薄弱的问题。 Method: 构建大规模、文化扎根的孟加拉语习语数据集(10,361条),采用经专家共识确立的19维标注体系;并在该数据集上系统评测30个前沿多语言和指令微调LLM在习语意义推断任务上的表现。 Result: 所有30个被测模型在习语意义推断任务上准确率均低于50%,显著低于人类水平(83.4%),暴露了现有模型在跨语言与文化推理方面的根本局限。 Conclusion: 当前LLM在低资源语言的比喻理解方面存在重大缺陷;本工作通过发布高质量数据集与基准,为提升孟加拉语及其他低资源语言中LLM的文化接地与隐喻理解能力提供了关键基础设施。 Abstract: Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process, that captures its semantic, syntactic, cultural, and religious dimensions, providing a rich, structured resource for computational linguistics. To establish a robust benchmark for Bangla figurative language understanding, we evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning. Our results reveal a critical performance gap, with no model surpassing 50% accuracy, a stark contrast to significantly higher human performance (83.4%). This underscores the limitations of existing models in cross-linguistic and cultural reasoning. By releasing the new idiom dataset and benchmark, we provide foundational infrastructure for advancing figurative language understanding and cultural grounding in LLMs for Bengali and other low-resource languages.

[26] Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Ali Mekky,Mohamed El Zeftawy,Lara Hassan,Amr Keleg,Preslav Nakov

Main category: cs.CL

TL;DR: 本文提出将阿拉伯方言识别(ADI)从单标签分类任务转向多标签分类任务(MLADI),并构建了首个大规模多标签数据集,结合GPT-4o与二元可接受性分类器生成标注,利用ALDi指导聚合;再通过面向方言复杂度与标签基数的课程学习策略训练BERT模型LAHJATBERT,在MLADI榜单上达到0.69宏F1,显著优于此前最佳系统(0.55)。

Details Motivation: 现有ADI研究长期采用单标签设定,但实际中一句阿拉伯语常可被多个方言接受;缺乏大规模多标签数据严重制约MLADI发展。 Method: 1)利用GPT-4o和二元方言可接受性分类器自动生成多标签标注;2)以阿拉伯方言性等级(ALDi)为依据进行标注聚合;3)基于BERT构建多标签分类器LAHJATBERT,并采用契合方言复杂度与标签基数的课程学习策略训练。 Result: LAHJATBERT在MLADI leaderboard上取得宏F1为0.69,较此前最强系统提升0.14;代码与数据已开源。 Conclusion: 单标签ADI数据难以直接适配MLADI主因是负样本选择不当;本工作验证了结合大模型生成+领域知识引导聚合+课程学习的范式可有效推动MLADI发展。 Abstract: Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at https://mohamedalaa9.github.io/lahjatbert/.

[27] ProbeLLM: Automating Principled Diagnosis of LLM Failures

Yue Huang,Zhengzhe Jiang,Yuchen Ma,Yu Jiang,Xiangqi Wang,Yujun Zhou,Yuexing Hao,Kehan Guo,Pin-Yu Chen,Stefan Feuerriegel,Xiangliang Zhang

Main category: cs.CL

TL;DR: 本文提出ProbeLLM,一种基准无关的自动化探测框架,通过分层蒙特卡洛树搜索,系统性地发现大语言模型(LLM)的结构化失败模式,而非孤立错误案例;它结合工具增强生成与验证、失败感知嵌入和边界感知归纳,显著提升失败分析的广度、清晰度与细粒度。

Details Motivation: 现有自动化探测方法常仅发现零散失败案例,缺乏对探索过程的系统控制,且难以揭示模型弱点的内在结构;同时静态评测难以跟上LLM快速演进的步伐。 Method: 将探测建模为分层蒙特卡洛树搜索(MCTS),在全局探索新失败区域与局部精炼重复错误模式间动态分配探测预算;采用可验证测试用例、工具增强的生成与验证机制,并利用失败感知嵌入与边界感知归纳将离散失败聚类为可解释的失败模式。 Result: 在多个基准和不同LLM上,ProbeLLM揭示出比静态基准及以往自动化方法更广泛、更清晰、更细粒度的失败图景,支持从‘个案导向评估’转向‘原理驱动的弱点发现’。 Conclusion: ProbeLLM为理解LLM失败机制提供了更系统、可靠和可解释的自动化探测范式,推动评测从静态、表面走向动态、结构化和原理化。 Abstract: Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

[28] SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Yujiong Shen,Yajie Yang,Zhiheng Xi,Binze Hu,Huayu Sha,Jiazheng Zhang,Qiyuan Peng,Junlin Shang,Jixuan Huang,Yutao Fan,Jingqi Tong,Shihan Dou,Ming Zhang,Lei Bai,Zhenfei Yin,Tao Gui,Xingjun Ma,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang

Main category: cs.CL

TL;DR: 本文提出SciAgentGym环境和SciAgentBench评测基准,揭示大模型在长程科学工具使用中的性能瓶颈,并通过新方法SciForge合成训练数据,提升小模型SciAgent-8B的跨领域科学工具调用能力。

Details Motivation: 现有评测基准忽视了智能体在复杂科学任务中协调多工具的能力,缺乏对严谨科学工作流的支持。 Method: 构建包含1780个领域专用工具的交互式环境SciAgentGym;设计分层评测基准SciAgentBench;提出基于依赖图建模的逻辑感知数据合成方法SciForge,用于生成高质量训练轨迹。 Result: 发现GPT-5等先进模型在长程工具调用中成功率从60.6%骤降至30.9%;SciAgent-8B经SciForge微调后,性能超越参数量大得多的Qwen3-VL-235B-Instruct,并展现跨学科迁移能力。 Conclusion: 科学智能体需专门的数据与环境支持,SciForge与SciAgent系列为构建下一代自主科学代理提供了可行路径与实证基础。 Abstract: Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

[29] Evaluating the Homogeneity of Keyphrase Prediction Models

Maël Houbre,Florian Boudin,Beatrice Daille

Main category: cs.CL

TL;DR: 本文提出了一种评估关键词预测模型同质性的新方法,发现关键词提取方法在同质性上可与生成模型竞争,甚至生成‘未出现关键词’的能力可能对同质性产生负面影响。

Details Motivation: 现有基准未涵盖关键词预测模型的同质性评估;直观上生成模型因能预测未在文中出现的关键词(absent keyphrases),可能在主题相同的文档间产生更一致的索引,但该假设缺乏验证。 Method: 提出一种评估关键词预测模型同质性的新方法,通过对比提取模型与生成模型在相同主题文档上的关键词预测一致性,并进行实证分析。 Result: 实验表明关键词提取方法在同质性上不逊于生成模型,且生成absent keyphrases的能力反而可能降低同质性。 Conclusion: 生成absent keyphrases的能力并不必然提升模型同质性,现有评估需纳入同质性维度,关键词提取方法仍具竞争力。 Abstract: Keyphrases which are useful in several NLP and IR applications are either extracted from text or predicted by generative models. Contrarily to keyphrase extraction approaches, keyphrase generation models can predict keyphrases that do not appear in a document's text called `absent keyphrases`. This ability means that keyphrase generation models can associate a document to a notion that is not explicitly mentioned in its text. Intuitively, this suggests that for two documents treating the same subjects, a keyphrase generation model is more likely to be homogeneous in their indexing i.e. predict the same keyphrase for both documents, regardless of those keyphrases appearing in their respective text or not; something a keyphrase extraction model would fail to do. Yet, homogeneity of keyphrase prediction models is not covered by current benchmarks. In this work, we introduce a method to evaluate the homogeneity of keyphrase prediction models and study if absent keyphrase generation capabilities actually help the model to be more homogeneous. To our surprise, we show that keyphrase extraction methods are competitive with generative models, and that the ability to generate absent keyphrases can actually have a negative impact on homogeneity. Our data, code and prompts are available on huggingface and github.

[30] Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

Hao Chen,Ye He,Yuchun Fan,Yukun Yan,Zhenghao Liu,Qingfu Zhu,Maosong Sun,Wanxiang Che

Main category: cs.CL

TL;DR: 本文提出了一种元认知框架,通过区分干预与对齐来实现可靠的外部知识增强,利用模型内部认知信号划分知识空间,并引入认知一致性机制校准主观置信度与客观准确性。

Details Motivation: 现有知识增强方法假设模型性能等于其内部知识,忽略了知识-置信度差距导致的过度自信错误或不确定真相。 Method: 提出元认知框架,利用内部认知信号将知识空间划分为已掌握、混淆和缺失区域,进行针对性知识扩展;并引入认知一致性机制,使主观确定性与客观准确性同步。 Result: 大量实验表明该框架持续优于强基线,在提升知识能力的同时,促进模型更好地区分已知与未知。 Conclusion: 所提框架不仅提升了LLM在知识密集型任务中的性能,还增强了其元认知能力,实现了更可靠的知识增强。 Abstract: Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns.

[31] Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech

Madhurananda Pahar,Caitlin Illingworth,Dorota Braun,Bahman Mirheidari,Lise Sproson,Daniel Blackburn,Heidi Christensen

Main category: cs.CL

TL;DR: 本研究发现,当前用于检测认知衰退的AI模型在英国少数族裔多语者(如索马里语、中文、南亚语言使用者)中存在显著偏见,尤其在记忆、流利度和阅读任务中易将其误判为认知障碍,且对南约克郡口音者偏见更明显;因此现有工具尚不可靠,需开发去偏见、泛化性更强的模型。

Details Motivation: 英国少数族裔人口占比高且痴呆症发病率增长迅速,而AI辅助诊断工具在多语及少数族裔人群中的公平性与可信度尚不明确,亟需评估其潜在偏见。 Method: 招募英国全国单语者与谢菲尔德、布拉德福德四家社区中心的多语英语使用者(含非母语口音及索马里语、中文、南亚语言背景),按约克郡西/南部口音细分;使用ASR系统及基于声学与语言特征的分类/回归模型,在DementiaBank等数据上进行实验评估。 Result: ASR系统无显著群体偏见,但分类与回归模型在多语者中表现出明显偏见,尤其在记忆、流利度和阅读任务中;使用DementiaBank训练时偏见加剧;多语者更易被误诊为认知衰退,南约克郡口音者更易被判定为病情更严重。 Conclusion: 这是首次揭示当前AI模型在英国少数族裔多语者中存在系统性偏见的研究,表明现有工具尚不具备临床诊断可靠性,未来需构建更具泛化性与公平性的模型。 Abstract: Conversational speech often reveals early signs of cognitive decline, such as dementia and MCI. In the UK, one in four people belongs to an ethnic minority, and dementia prevalence is expected to rise most rapidly among Black and Asian communities. This study examines the trustworthiness of AI models, specifically the presence of bias, in detecting healthy multilingual English speakers among the cognitively impaired cohort, to make these tools clinically beneficial. For experiments, monolingual participants were recruited nationally (UK), and multilingual speakers were enrolled from four community centres in Sheffield and Bradford. In addition to a non-native English accent, multilinguals spoke Somali, Chinese, or South Asian languages, who were further divided into two Yorkshire accents (West and South) to challenge the efficiency of the AI tools thoroughly. Although ASR systems showed no significant bias across groups, classification and regression models using acoustic and linguistic features exhibited bias against multilingual speakers, particularly in memory, fluency, and reading tasks. This bias was more pronounced when models were trained on the publicly available DementiaBank dataset. Moreover, multilinguals were more likely to be misclassified as having cognitive decline. This study is the first of its kind to discover that, despite their strong overall performance, current AI models show bias against multilingual individuals from ethnic minority backgrounds in the UK, and they are also more likely to misclassify speakers with a certain accent (South Yorkshire) as living with a more severe cognitive decline. In this pilot study, we conclude that the existing AI tools are therefore not yet reliable for diagnostic use in these populations, and we aim to address this in future work by developing more generalisable, bias-mitigated models.

[32] TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

Tejas Anvekar,Junha Park,Rajat Jha,Devanshu Gupta,Poojah Ganesan,Puneeth Mathur,Vivek Gupta

Main category: cs.CL

TL;DR: 本文提出TraceBack框架,用于表格问答中实现细粒度的单元格级归因,并发布CITEBench基准和FairScore评估指标,显著提升可解释性与评估可扩展性。

Details Motivation: 现有表格问答系统缺乏细粒度归因能力,导致答案虽正确却难以验证,影响高风险场景下的可信度。 Method: 提出模块化多智能体框架TraceBack,包含表格剪枝、问题分解和答案-单元格对齐三阶段;构建CITEBench基准;设计无需参考标签的FairScore评估指标。 Result: TraceBack在多个数据集和细粒度层级上显著优于强基线;FairScore与人工判断高度一致,并能保持方法间相对排序。 Conclusion: TraceBack实现了可扩展、可解释的单元格级归因,FairScore支持高效可靠的无监督评估,推动表格问答向可信、透明方向发展。 Abstract: Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.

[33] Exploring a New Competency Modeling Process with Large Language Models

Silin Du,Manqing Xin,Raymond Jia Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的新型胜任力建模方法,通过结构化分解专家实践、融合行为与心理特征、自适应信息加权及离线验证机制,实现了胜任力建模的数据驱动、可解释与可评估转型。

Details Motivation: 传统专家驱动的胜任力建模依赖人工分析大量访谈文本,成本高、随机性强、模糊且难以复现;亟需更可靠、可扩展、可验证的自动化方法。 Method: 将专家实践分解为结构化计算模块:利用LLM从原始文本中提取行为与心理描述,并通过嵌入相似性映射至预定义胜任力库;引入可学习参数自适应融合多源信息;设计无需新数据收集的离线评估流程。 Result: 在软件外包公司真实场景中验证,展现出强预测效度、跨胜任力库一致性及结构稳健性。 Conclusion: 该框架将胜任力建模从定性、专家依赖型实践,转变为透明、数据驱动、可评估的分析过程。 Abstract: Competency modeling is widely used in human resource management to select, develop, and evaluate talent. However, traditional expert-driven approaches rely heavily on manual analysis of large volumes of interview transcripts, making them costly and prone to randomness, ambiguity, and limited reproducibility. This study proposes a new competency modeling process built on large language models (LLMs). Instead of merely automating isolated steps, we reconstruct the workflow by decomposing expert practices into structured computational components. Specifically, we leverage LLMs to extract behavioral and psychological descriptions from raw textual data and map them to predefined competency libraries through embedding-based similarity. We further introduce a learnable parameter that adaptively integrates different information sources, enabling the model to determine the relative importance of behavioral and psychological signals. To address the long-standing challenge of validation, we develop an offline evaluation procedure that allows systematic model selection without requiring additional large-scale data collection. Empirical results from a real-world implementation in a software outsourcing company demonstrate strong predictive validity, cross-library consistency, and structural robustness. Overall, our framework transforms competency modeling from a largely qualitative and expert-dependent practice into a transparent, data-driven, and evaluable analytical process.

[34] Towards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts

Kais Allkivi

Main category: cs.CL

TL;DR: 本研究利用NLP技术对爱沙尼亚语学习者写作进行多层级语言特征分析,构建可解释性强、泛化能力好的机器学习分类模型,成功实现A2-C1级写作水平自动评估(准确率约0.9),并发现近7–10年考生写作复杂度显著提升。

Details Motivation: 现有研究缺乏将NLP用于真实二语学习者语言分析与自动化测评工具开发的有机结合;需探索可解释、可泛化的语言测试机器学习模型。 Method: 基于爱沙尼亚语水平考试写作样本(A2–C1),提取词汇、形态、表层及错误等任务无关的语言特征,训练并比较多种分类模型;同时在早期考卷上做跨时间验证。 Result: 预选语言特征模型测试准确率约0.9,且对不同文体分类波动更小;跨时段验证显示写作复杂度上升,部分特征集仍达0.8准确率;成果已集成至爱沙尼亚开源语言学习平台写作评估模块。 Conclusion: 任务无关的精细语言特征选择有助于构建高准确率、低方差、可解释且具时间鲁棒性的二语写作水平自动分类模型,并支持教学反馈与语言发展研究。 Abstract: Using NLP to analyze authentic learner language helps to build automated assessment and feedback tools. It also offers new and extensive insights into the development of second language production. However, there is a lack of research explicitly combining these aspects. This study aimed to classify Estonian proficiency examination writings (levels A2-C1), assuming that careful feature selection can lead to more explainable and generalizable machine learning models for language testing. Various linguistic properties of the training data were analyzed to identify relevant proficiency predictors associated with increasing complexity and correctness, rather than the writing task. Such lexical, morphological, surface, and error features were used to train classification models, which were compared to models that also allowed for other features. The pre-selected features yielded a similar test accuracy but reduced variation in the classification of different text types. The best classifiers achieved an accuracy of around 0.9. Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets. The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.

[35] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah,Ali Emami,Hassan Sajjad

Main category: cs.CL

TL;DR: 本文提出SCOPE框架,结合Bidirectional Preference Entropy(BPE)不确定性度量,在LLM作为评判者进行成对评估时实现可控错误率的主动拒绝机制,并在多个基准上验证其统计可靠性与高覆盖率。

Details Motivation: 现有大语言模型(LLM)作为评判者存在校准不准和系统性偏差问题,难以可靠替代人工偏好标注;需一种具备有限样本统计保证的选择性评判方法。 Method: 提出SCOPE框架:基于交换性假设,通过BPE(双向偏好熵)生成响应顺序不变的不确定性信号,并据此动态设定接受阈值,使非拒判样本的错误率≤预设α;BPE通过在两种响应位置分别查询、聚合偏好概率并转化为熵得分实现。 Result: 在MT-Bench、RewardBench和Chatbot Arena上,BPE显著优于传统置信度指标;SCOPE在α=0.10下实证风险稳定在0.097–0.099,覆盖率达0.89(Qwen-14B/RewardBench)至0.98(Qwen-32B/RewardBench);相比基线,Qwen-7B在MT-Bench上多接受2.4倍判断。 Conclusion: SCOPE+ BPE为LLM评判提供了首个具备有限样本误差控制保证的选择性评估方案,在保障严格风险约束的同时显著提升覆盖率,推动可信自动化评估的发展。 Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

[36] From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media

Maria Ryskina,Matthew R. Gormley,Kyle Mahowald,David R. Mortensen,Taylor Berg-Kirkpatrick,Vivek Kulkarni

Main category: cs.CL

TL;DR: 本文扩展了先前关于英语新词产生的分布语义研究,将静态词嵌入方法拓展至上下文嵌入,并在Twitter语料上验证了历史出版文本中发现的两个新词产生相关因素依然成立,但主题流行度增长因素在Twitter上的作用较弱,推测源于不同领域偏好的新词形成机制不同。

Details Motivation: 探究不同语言使用场景(如报纸 vs. 社交媒体)下新词产生(neology)的驱动因素是否一致,尤其是检验先前基于出版文本的发现是否适用于Twitter这一动态、口语化语境。 Method: 扩展Ryskina等人(2020)的方法,结合静态与上下文嵌入,在Twitter语料上复现并对比其关于新词产生两大因素(如主题流行度增长、语义可预测性)的分析。 Result: 先前在出版文本中发现的两个新词产生相关因素在Twitter数据中仍显著存在,但主题流行度增长对新词产生的贡献在Twitter上相对减弱。 Conclusion: 新词产生受语境影响,不同传播域(出版 vs. 社交媒体)可能偏好不同的构词机制,提示语言演化压力具有领域特异性。 Abstract: Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence (neology) identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts (Ryskina et al., 2020, arXiv:2001.07740). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different neologism formation mechanisms.

Mariia Fedorova,Nikolay Arefyev,Maja Buljan,Jindřich Helcl,Stephan Oepen,Egil Rønningstad,Yves Scherrer

Main category: cs.CL

TL;DR: 本文提出OpenLID-v3,通过扩充训练数据、合并易混淆语言变体簇及引入噪声标签,改进了现有语言识别工具在相近语言区分和噪声过滤上的不足,尤其提升了低资源语言的识别效果。

Details Motivation: 现有LID工具难以准确识别密切相关语言,并易将噪声误判为自然语言,导致低资源语言子集污染。 Method: 扩展OpenLID分类器:增加训练数据、合并问题语言变体簇、新增噪声标签;构建三组密切相关语言的新评测集;对比GlotLID并分析集成方法影响。 Result: OpenLID-v3在多个基准上优于GlotLID;集成方法提升精度但显著降低低资源语言覆盖度;新评测集弥补了现有数据不足。 Conclusion: OpenLID-v3有效提升了密切相关语言识别与噪声鲁棒性,开源发布于Hugging Face,为高质量多语种数据构建提供了更可靠的LID工具。 Abstract: Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on https://huggingface.co/HPLT/OpenLID-v3.

[38] Semantic Chunking and the Entropy of Natural Language

Weishun Zhong,Doron Sivan,Tankut Can,Mikhail Katkov,Misha Tsodyks

Main category: cs.CL

TL;DR: 本文提出一个统计模型,通过自相似地将文本分割为语义连贯的块,从第一性原理解释英语约80%冗余度及1比特/字符的熵率,并预测熵率随语料语义复杂度系统性增加。

Details Motivation: 解释自然语言中高冗余度(约80%)和低熵率(约1比特/字符)的成因,并提供第一性原理模型。 Method: 构建一个描述文本自相似语义分块(至单词级)的统计模型,实现语义结构的层次化分解与解析处理。 Result: 模型定量复现真实文本在不同语义层级的结构;预测熵率与英文印刷体实测值一致;揭示熵率随语料语义复杂度系统上升。 Conclusion: 自然语言熵率并非固定,而是由语义复杂度决定,该复杂度由模型唯一自由参数刻画。 Abstract: The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.

cs.CV [Back]

[39] Thermal Imaging for Contactless Cardiorespiratory and Sudomotor Response Monitoring

Constantino Álvarez Casado,Mohammad Rahman,Sasan Sharifipour,Nhi Nguyen,Manuel Lage Cañellas,Xiaoting Wu,Miguel Bordallo López

Main category: cs.CV

TL;DR: 本文提出了一种基于面部热红外视频的无接触多生物信号(EDA、HR、BR)提取方法,通过信号处理流程实现,并在SIM1数据集上进行了评估,给出了性能基线和设计指导。

Details Motivation: 可见光方法无法获取作为交感神经激活标志的电皮层活动(EDA),而热红外成像可捕捉由自主神经调节引起的皮肤温度变化,有望无接触估计EDA、心率(HR)和呼吸率(BR)。 Method: 采用信号处理流程:追踪解剖区域、空间聚合、分离慢速发汗趋势与快速心呼吸成分;HR估计使用正交矩阵图像变换(OMIT)分解多个面部感兴趣区域(ROI);BR估计对鼻部和脸颊信号平均后进行谱峰检测;EDA评估了288种配置。 Result: 最佳固定EDA配置(鼻区+指数滑动平均)与手掌EDA的平均绝对相关系数为0.40±0.23(单次会话最高达0.89);BR估计平均绝对误差为3.1±1.1 bpm;HR估计为13.8±7.5 bpm MAE(受限于7.5 Hz低帧率);还观察到信号极性跨会话交替、热力学延迟短、条件依赖及人口统计学影响。 Conclusion: 该研究为热红外无接触生物信号估计提供了性能基线和设计指导,验证了其可行性,但也揭示了帧率等技术限制及个体差异带来的挑战。 Abstract: Thermal infrared imaging captures skin temperature changes driven by autonomic regulation and can potentially provide contactless estimation of electrodermal activity (EDA), heart rate (HR), and breathing rate (BR). While visible-light methods address HR and BR, they cannot access EDA, a standard marker of sympathetic activation. This paper characterizes the extraction of these three biosignals from facial thermal video using a signal-processing pipeline that tracks anatomical regions, applies spatial aggregation, and separates slow sudomotor trends from faster cardiorespiratory components. For HR, we apply an orthogonal matrix image transformation (OMIT) decomposition across multiple facial regions of interest (ROIs), and for BR we average nasal and cheek signals before spectral peak detection. We evaluate 288 EDA configurations and the HR/BR pipeline on 31 sessions from the public SIMULATOR STUDY 1 (SIM1) driver monitoring dataset. The best fixed EDA configuration (nose region, exponential moving average) reaches a mean absolute correlation of $0.40 \pm 0.23$ against palm EDA, with individual sessions reaching 0.89. BR estimation achieves a mean absolute error of $3.1 \pm 1.1$ bpm, while HR estimation yields $13.8 \pm 7.5$ bpm MAE, limited by the low camera frame rate (7.5 Hz). We report signal polarity alternation across sessions, short thermodynamic latency for well-tracked signals, and condition-dependent and demographic effects on extraction quality. These results provide baseline performance bounds and design guidance for thermal contactless biosignal estimation.

[40] LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

Zekun Li,Sizhe An,Chengcheng Tang,Chuan Guo,Ivan Shugurov,Linguang Zhang,Amy Zhao,Srinath Sridhar,Lingling Tao,Abhay Mittal

Main category: cs.CV

TL;DR: 本文提出LLaMo框架,通过模态特定的Mixture-of-Transformers架构扩展预训练大语言模型,实现高质量、实时的文本到动作生成与动作到文本理解,避免灾难性遗忘和离散化抖动问题。

Details Motivation: 现有动作-语言统一建模方法受限于动作-文本配对数据规模小,易导致大语言模型语言能力退化;且常用离散量化表示动作,引入抖动伪影。 Method: 提出LLaMo框架:采用模态特定的Mixture-of-Transformers(MoT)扩展预训练LLM;将人体动作编码为因果连续隐空间;解码器端使用轻量级flow-matching头维持next-token预测范式,支持>30 FPS流式动作生成。 Result: 在文本到动作生成和动作到文本描述任务上达到高保真效果,尤其在零样本动作生成上表现突出;支持实时生成(>30 FPS)。 Conclusion: LLaMo实现了语言能力保持与动作模态可扩展适配的统一,是迈向通用动作-语言大模型的重要一步。 Abstract: Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (>30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.

[41] Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues

Marco Willi,Melanie Mathys,Michael Graber

Main category: cs.CV

TL;DR: 本文提出SynthCLIC数据集,用于评估CLIP在合成图像检测(SID)中的泛化能力,发现CLIP检测器主要依赖高层摄影属性而非明显伪影,但在不同生成模型间泛化不均。

Details Motivation: 现有SID方法难以泛化至新生成模型,且CLIP在SID中有效但其依赖的判别线索尚不明确——是明显视觉伪影还是微妙语义偏差?需厘清其鲁棒性与实用边界。 Method: 构建配对高质量合成/真实图像数据集SynthCLIC(基于最新扩散模型),采用可解释线性分类头与去相关激活、文本引导的概念模型,分析CLIP特征中用于SID的关键线索。 Result: CLIP线性检测器在GAN基准上mAP达0.96,但在SynthCLIC上降至0.92,跨生成器族泛化最低仅0.37;检测主要依赖高层摄影属性(如极简风格、镜头光晕、景深层次),而非低级伪影。 Conclusion: CLIP是SID的有力基础,但其泛化能力受限于生成模型多样性,需持续更新模型并扩大训练覆盖范围,以构建更通用、鲁棒的检测方法。 Abstract: Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.

[42] Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models

Ali Subhan,Ashir Raza

Main category: cs.CV

TL;DR: 本文对DragDiffusion方法进行了可复现性研究,验证了其核心结论,并分析了关键超参数对性能的影响。

Details Motivation: 验证DragDiffusion方法的可复现性及其核心主张在不同设置下的稳健性。 Method: 基于作者公开实现和DragBench基准,复现了关于扩散时间步选择、LoRA微调、掩码正则化强度及UNet特征监督的主要消融实验,并测试了多时间步隐变量优化变体。 Result: 主要趋势与原文基本一致;性能对优化时间步和运动监督特征层敏感,其余组件鲁棒性较强;多时间步优化未提升空间精度但显著增加计算开销。 Conclusion: DragDiffusion的核心主张成立,但其可靠复现依赖于若干关键超参数的合理设置。 Abstract: DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors' released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at https://github.com/AliSubhan5341/DragDiffusion-TMLR-Reproducibility-Challenge.

[43] What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

Xirui Li,Ming Li,Tianyi Zhou

Main category: cs.CV

TL;DR: 本文提出了一种'弗兰肯斯坦式'分析框架,通过因果探测、参数比较和模型融合,揭示强化学习(RL)在视觉-语言模型视觉推理提升中的真实作用:并非全面提升视觉感知,而是系统性地优化中后期Transformer计算,增强视觉到推理的对齐与推理性能。

Details Motivation: 现有研究难以区分强化学习(RL)相比监督微调(IN)具体提升了哪些能力,端到端基准增益混杂多种因素,无法归因于特定技能。 Method: 提出Frankenstein式分析框架,包括:(i) 通过因果探测进行功能定位;(ii) 通过参数比较刻画更新特征;(iii) 通过模型合并测试可迁移性。 Result: RL主要引起推理时中晚期层的一致性偏移,这些中晚期层的改进既可通过模型合并迁移,也对RL增益具有必要性(冻结即失效)。 Conclusion: RL在视觉推理中的可靠贡献不是均匀提升视觉感知,而是系统性优化中晚期Transformer计算,从而改善视觉到推理的对齐与推理性能,凸显仅依赖基准评估理解多模态推理提升的局限性。 Abstract: Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.

[44] ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning

Zihan Ye,Shreyank N Gowda,Kaile Du,Weijian Luo,Ling Shao

Main category: cs.CV

TL;DR: 本文提出ZeroDiff++,一种基于扩散模型的生成式零样本学习框架,通过扩散增强、监督对比表示、多视图判别器、测试时自适应和部分合成特征生成等技术,有效缓解了视觉-语义虚假相关问题,并在数据稀缺情况下仍保持鲁棒性能。

Details Motivation: 现有生成式零样本学习方法存在视觉-语义虚假相关问题,尤其在可见类样本稀缺时更严重;且固定噪声的生成器导致生成特征与真实测试样本脱节,加剧该问题。 Method: 提出ZeroDiff++框架:训练阶段采用扩散增强、监督对比学习表征、多视图Wasserstein互学习判别器;推理阶段引入DiffTTA(基于扩散的测试时自适应)和DiffGen(基于扩散的测试时生成),实现伪标签重建与部分合成特征生成。 Result: 在三个ZSL基准上显著优于现有方法,尤其在训练数据稀缺时仍保持强鲁棒性。 Conclusion: ZeroDiff++通过扩散建模与测试时优化,有效提升视觉-语义相关性建模质量,为生成式零样本学习提供了新范式。 Abstract: Zero-shot Learning (ZSL) enables classifiers to recognize classes unseen during training, commonly via generative two stage methods: (1) learn visual semantic correlations from seen classes; (2) synthesize unseen class features from semantics to train classifiers. In this paper, we identify spurious visual semantic correlations in existing generative ZSL worsened by scarce seen class samples and introduce two metrics to quantify spuriousness for seen and unseen classes. Furthermore, we point out a more critical bottleneck: existing unadaptive fully noised generators produce features disconnected from real test samples, which also leads to the spurious correlation. To enhance the visual-semantic correlations on both seen and unseen classes, we propose ZeroDiff++, a diffusion-based generative framework. In training, ZeroDiff++ uses (i) diffusion augmentation to produce diverse noised samples, (ii) supervised contrastive (SC) representations for instance level semantics, and (iii) multi view discriminators with Wasserstein mutual learning to assess generated features. At generation time, we introduce (iv) Diffusion-based Test time Adaptation (DiffTTA) to adapt the generator using pseudo label reconstruction, and (v) Diffusion-based Test time Generation (DiffGen) to trace the diffusion denoising path and produce partially synthesized features that connect real and generated data, and mitigates data scarcity further. Extensive experiments on three ZSL benchmarks demonstrate that ZeroDiff++ not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code would be available.

[45] MonoLoss: A Training Objective for Interpretable Monosemantic Representations

Ali Nasiri-Sarvi,Anh Tien Nguyen,Hassan Rivaz,Dimitris Samaras,Mahdi S. Hosseini

Main category: cs.CV

TL;DR: 本文提出MonoLoss,一种用于稀疏自编码器(SAEs)的新型损失函数,旨在提升表征的单义性(monosemanticity),通过高效计算的MonoScore指标作为训练信号,在多个视觉模型上显著提升单义性和类别纯度,并带来微调性能增益。

Details Motivation: 现有稀疏自编码器训练目标对单义性分解约束较弱,且主流单义性评估指标(如MonoScore)计算复杂度高(O(n²)),难以用于训练过程中的实时反馈。 Method: 1)推导出线性时间复杂度(O(n))的MonoScore单次遍历算法;2)基于该高效指标构建Monosemanticity Loss(MonoLoss)作为可插拔训练目标;3)在多种SAE架构(BatchTopK/TopK/JumpReLU)和视觉特征(CLIP/SigLIP2/ViT)上验证;4)将其作为辅助正则项用于ResNet-50和CLIP-ViT-B/32微调。 Result: 在OpenImagesV7上实现最高1200×评估与159×训练加速;MonoLoss显著提升MonoScore与类别纯度(最高从0.152→0.723);ImageNet-1K微调获得最高0.6%准确率提升,并产生更单义的激活模式。 Conclusion: MonoLoss是一种高效、通用且即插即用的单义性增强方法,能有效推动神经表征向可解释的单义特征演化,并在表示学习与模型微调中具备实际应用价值。 Abstract: Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs trained on CLIP, SigLIP2, and pretrained ViT features, using BatchTopK, TopK, and JumpReLU SAEs, MonoLoss increases MonoScore for most latents. MonoLoss also consistently improves class purity (the fraction of a latent's activating images belonging to its dominant class) across all encoder and SAE combinations, with the largest gain raising baseline purity from 0.152 to 0.723. Used as an auxiliary regularizer during ResNet-50 and CLIP-ViT-B/32 finetuning, MonoLoss yields up to 0.6\% accuracy gains on ImageNet-1K and monosemantic activating patterns on standard benchmark datasets. The code is publicly available at https://github.com/AtlasAnalyticsLab/MonoLoss.

[46] Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction

Lihe Liu,Xiaoxi Pan,Yinyin Yuan,Lulu Shang

Main category: cs.CV

TL;DR: 本文提出PathoSpatial框架,通过多级专家架构和任务引导的原型学习,整合配对的全切片图像(WSI)与空间转录组(ST)数据,实现可解释、判别性强的生存预后建模。

Details Motivation: 随着配对的全切片图像(WSI)与空间转录组(ST)队列扩大至人群规模,亟需能有效融合二者互补空间信号的原理性跨模态融合策略,以提升预后建模能力。 Method: 提出PathoSpatial框架:采用多级专家架构,结合任务引导的原型学习,自适应协调模态内无监督发现与跨模态有监督聚合;支持共配准WSI与ST数据的端到端联合学习。 Result: 在三阴性乳腺癌配对队列上,PathoSpatial在五个生存终点上均表现优异,性能优于或媲美主流单模态与多模态方法,并支持后验原型解释与分子风险分解。 Conclusion: PathoSpatial为可扩展、可解释的空间多组学-病理学融合提供了可行范式,验证了跨模态协同建模在精准预后中的潜力。 Abstract: Whole slide images (WSIs) enable weakly supervised prognostic modeling via multiple instance learning (MIL). Spatial transcriptomics (ST) preserves in situ gene expression, providing a spatial molecular context that complements morphology. As paired WSI-ST cohorts scale to population level, leveraging their complementary spatial signals for prognosis becomes crucial; however, principled cross-modal fusion strategies remain limited for this paradigm. To this end, we introduce PathoSpatial, an interpretable end-to-end framework integrating co-registered WSIs and ST to learn spatially informed prognostic representations. PathoSpatial uses task-guided prototype learning within a multi-level experts architecture, adaptively orchestrating unsupervised within-modality discovery with supervised cross-modal aggregation. By design, PathoSpatial substantially strengthens interpretability while maintaining discriminative ability. We evaluate PathoSpatial on a triple-negative breast cancer cohort with paired ST and WSIs. PathoSpatial delivers strong and consistent performance across five survival endpoints, achieving superior or comparable performance to leading unimodal and multimodal methods. PathoSpatial inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations, highlighting candidate prognostic factors. We present PathoSpatial as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion.

[47] Semantic-aware Adversarial Fine-tuning for CLIP

Jiacheng Zhang,Jinhao Li,Hanxun Huang,Sarah M. Erfani,Benjamin I. P. Rubinstein,Feng Liu

Main category: cs.CV

TL;DR: 本文提出语义感知对抗微调(SAFT)方法,通过语义集成攻击生成更鲁棒的对抗样本,提升CLIP在零样本分类任务中的对抗鲁棒性。

Details Motivation: 现有基于余弦相似度生成对抗样本的方法不足以充分衡量图像-文本对的语义相似性,导致微调后的CLIP图像编码器鲁棒性不足。 Method: 提出语义集成攻击:利用基础模型生成并精炼多条文本描述,最小化原始图像与该文本集合的平均相似度以生成语义感知对抗样本;进而用这些样本对CLIP图像编码器进行对抗微调(SAFT)。 Result: 在16个数据集上的实验表明,SAFT显著提升了零样本场景下的对抗鲁棒性,优于当前主流方法。 Conclusion: 语义丰富的文本描述比手工模板更能揭示CLIP的脆弱性,基于此设计的SAFT方法可有效增强其鲁棒性。 Abstract: Recent studies have shown that CLIP model's adversarial robustness in zero-shot classification tasks can be enhanced by adversarially fine-tuning its image encoder with adversarial examples (AEs), which are generated by minimizing the cosine similarity between images and a hand-crafted template (e.g., ''A photo of a {label}''). However, it has been shown that the cosine similarity between a single image and a single hand-crafted template is insufficient to measure the similarity for image-text pairs. Building on this, in this paper, we find that the AEs generated using cosine similarity may fail to fool CLIP when the similarity metric is replaced with semantically enriched alternatives, making the image encoder fine-tuned with these AEs less robust. To overcome this issue, we first propose a semantic-ensemble attack to generate semantic-aware AEs by minimizing the average similarity between the original image and an ensemble of refined textual descriptions. These descriptions are initially generated by a foundation model to capture core semantic features beyond hand-crafted templates and are then refined to reduce hallucinations. To this end, we propose Semantic-aware Adversarial Fine-Tuning (SAFT), which fine-tunes CLIP's image encoder with semantic-aware AEs. Extensive experiments show that SAFT outperforms current methods, achieving substantial improvements in zero-shot adversarial robustness across 16 datasets. Our code is available at: https://github.com/tmlr-group/SAFT.

[48] A Lightweight and Explainable DenseNet-121 Framework for Grape Leaf Disease Classification

Md. Ehsanul Haque,Md. Saymon Hosen Polash,Rakib Hasan Ovi,Aminul Kader Bulbul,Md Kamrul Siam,Tamim Hasan Saykat

Main category: cs.CV

TL;DR: 本文提出了一种基于优化DenseNet121的葡萄叶病害分类方法,结合领域特定预处理与Grad-CAM可解释性分析,在准确率(99.27%)、F1分数(99.28%)等指标上优于多种CNN基线模型,并具备低计算开销与高泛化能力,适用于 vineyard可持续管理中的实时病害识别。

Details Motivation: 葡萄病害(如细菌腐烂、霜霉病、白粉病)严重影响葡萄产量与品质,亟需早期精准识别;而现有YOLO类自动化方法计算成本高、缺乏可解释性,难以落地应用。 Method: 采用优化的DenseNet121模型,结合领域特定图像预处理(突出叶脉、边缘、病斑等特征),引入Grad-CAM进行可视化解释,并利用迁移学习和模型优化提升小样本与不平衡数据下的鲁棒性与推理效率。 Result: 在多项指标上达到SOTA:准确率99.27%,F1分数99.28%,特异度99.71%,Kappa系数98.86%,推理时间仅9秒;交叉验证平均准确率达99.12%;Grad-CAM证实模型关注生理相关区域。 Conclusion: 该框架兼具高精度、强泛化性、低计算成本与良好可解释性,为葡萄园可持续管理和实际部署提供了可靠、可扩展的智能诊断解决方案。 Abstract: Grapes are among the most economically and culturally significant fruits on a global scale, and table grapes and wine are produced in significant quantities in Europe and Asia. The production and quality of grapes are significantly impacted by grape diseases such as Bacterial Rot, Downy Mildew, and Powdery Mildew. Consequently, the sustainable management of a vineyard necessitates the early and precise identification of these diseases. Current automated methods, particularly those that are based on the YOLO framework, are often computationally costly and lack interpretability that makes them unsuitable for real-world scenarios. This study proposes grape leaf disease classification using Optimized DenseNet 121. Domain-specific preprocessing and extensive connectivity reveal disease-relevant characteristics, including veins, edges, and lesions. An extensive comparison with baseline CNN models, including ResNet18, VGG16, AlexNet, and SqueezeNet, demonstrates that the proposed model exhibits superior performance. It achieves an accuracy of 99.27%, an F1 score of 99.28%, a specificity of 99.71%, and a Kappa of 98.86%, with an inference time of 9 seconds. The cross-validation findings show a mean accuracy of 99.12%, indicating strength and generalizability across all classes. We also employ Grad-CAM to highlight disease-related regions to guarantee the model is highlighting physiologically relevant aspects and increase transparency and confidence. Model optimization reduces processing requirements for real-time deployment, while transfer learning ensures consistency on smaller and unbalanced samples. An effective architecture, domain-specific preprocessing, and interpretable outputs make the proposed framework scalable, precise, and computationally inexpensive for detecting grape leaf diseases.

[49] Human-Like Coarse Object Representations in Vision Models

Andrey Gizdov,Andrea Procopio,Yichen Li,Daniel Harari,Tomer Ullman

Main category: cs.CV

TL;DR: 本文探讨了分割模型是否能习得人类直觉物理中使用的粗略体块表征,发现模型与人类行为的对齐呈倒U型曲线,表明资源限制而非特定先验促使人类式粗粒度表征出现。

Details Motivation: 人类在直觉物理推理中使用粗糙、体积化的物体表征(平滑凹陷),而现有分割模型追求像素级精确掩码,可能与其不一致;本文旨在探究分割模型是否及何时能自发习得类人粗粒度物体表征。 Method: 采用时间至碰撞(TTC)行为范式,构建比较流程与对齐度量,并系统调节模型训练时长、规模及剪枝程度以改变其有效容量。 Result: 在所有调节条件下,模型与人类行为的对齐均呈现倒U型曲线:小/短训/剪枝模型欠分割为团块,大/全训模型过分割并产生边界抖动,仅中间粒度模型最匹配人类行为。 Conclusion: 人类式的粗粒度物体表征源于计算资源约束,而非特设认知偏置;可通过早期检查点、适度架构或轻度剪枝等简单手段诱导出适合物理推理的表征。 Abstract: Humans appear to represent objects for intuitive physics with coarse, volumetric bodies'' that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity'' best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.

[50] Insertion Network for Image Sequence Correspondence

Dingjie Su,Weixiang Hong,Benoit M. Dawant,Bennett A. Landman

Main category: cs.CV

TL;DR: 本文提出了一种基于序列上下文建模的2D图像序列对应方法,用于CT扫描中关键解剖切片的精确定位,通过切片级注意力机制建模插入过程,在监督设置下将定位误差从8.4mm降至5.4mm。

Details Motivation: 需要在3D医学影像中准确定位2D关键解剖切片,以支持诊断、自动配准和分割等任务;现有方法(如body part regression)忽略序列上下文信息。 Method: 构建一个插入网络,编码每张切片的上下文表征,并利用切片到切片的注意力机制建模将一张切片插入另一序列的合适位置的过程。 Result: 在体部CT数据上实验表明,该方法将监督设定下的切片定位误差从8.4 mm显著降低至5.4 mm,优于当前主流的body part回归方法。 Conclusion: 利用全序列上下文信息建模切片间对应关系,比单切片独立预测更有效,为医学影像内容导航提供了更准确的新范式。 Abstract: We propose a novel method for establishing correspondence between two sequences of 2D images. One particular application of this technique is slice-level content navigation, where the goal is to localize specific 2D slices within a 3D volume or determine the anatomical coverage of a 3D scan based on its 2D slices. This serves as an important preprocessing step for various diagnostic tasks, as well as for automatic registration and segmentation pipelines. Our approach builds sequence correspondence by training a network to learn how to insert a slice from one sequence into the appropriate position in another. This is achieved by encoding contextual representations of each slice and modeling the insertion process using a slice-to-slice attention mechanism. We apply this method to localize manually labeled key slices in body CT scans and compare its performance to the current state-of-the-art alternative known as body part regression, which predicts anatomical position scores for individual slices. Unlike body part regression, which treats each slice independently, our method leverages contextual information from the entire sequence. Experimental results show that the insertion network reduces slice localization errors in supervised settings from 8.4 mm to 5.4 mm, demonstrating a substantial improvement in accuracy.

[51] Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models

Ali Abbasi,Mehdi Taghipour,Rahmatollah Beheshti

Main category: cs.CV

TL;DR: 本文提出了一种名为Negation-Aware Selective Training (NAST) 的新方法,利用因果追踪效应(CTEs)指导视觉-语言模型在医学影像报告中准确识别否定语句,显著提升了模型对肯定与否定临床陈述的区分能力。

Details Motivation: 视觉-语言模型(VLMs)在临床报告中常无法区分肯定与否定陈述,而否定是医学文本中的关键语言现象,影响诊断准确性与患者安全。 Method: 构建放射学专用否定诊断基准和上下文临床否定数据集,并提出基于因果追踪效应(CTEs)的可解释性引导微调方法NAST,按各层对否定处理的因果贡献动态调节梯度更新。 Result: NAST显著提升模型对肯定/否定临床陈述的判别能力,且不损害通用视觉-语言对齐性能。 Conclusion: 将因果可解释性信号转化为优化规则,可实现面向高风险医疗场景的精准模型适配,为医学VLM的可靠性提升提供了新范式。 Abstract: Negation is a fundamental linguistic operation in clinical reporting, yet vision-language models (VLMs) frequently fail to distinguish affirmative from negated medical statements. To systematically characterize this limitation, we introduce a radiology-specific diagnostic benchmark that evaluates polarity sensitivity under controlled clinical conditions, revealing that common medical VLMs consistently confuse negated and non-negated findings. To enable learning beyond simple condition absence, we further construct a contextual clinical negation dataset that encodes structured claims and supports attribute-level negations involving location and severity. Building on these resources, we propose Negation-Aware Selective Training (NAST), an interpretability-guided adaptation method that uses causal tracing effects (CTEs) to modulate layer-wise gradient updates during fine-tuning. Rather than applying uniform learning rates, NAST scales each layer's update according to its causal contribution to negation processing, transforming mechanistic interpretability signals into a principled optimization rule. Experiments demonstrate improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment, highlighting the value of causal interpretability for targeted model adaptation in safety-critical medical settings. Code and resources are available at https://github.com/healthylaife/NAST.

[52] Matching of SAR and optical images based on transformation to shared modality

Alexey Borisov,Evgeny Myasnikov,Vladislav Myasnikov

Main category: cs.CV

TL;DR: 本文提出一种新方法,通过将光学与SAR图像转换为一种共享的新模态,再利用现有先进匹配模型(如RoMa)实现高精度跨模态图像配准,无需针对新模态重新训练,且在MultiSenGE数据集上优于现有方法。

Details Motivation: 光学与SAR图像因成像物理机制差异大,导致精确配准困难。 Method: 设计满足三条件的共享图像模态(通道数固定、变换后图像尽可能相似、非退化以保留关键特征),并在该模态上微调或直接使用预训练的RoMa模型进行匹配。 Result: 在MultiSenGE数据集上验证了该方法优于基于图像翻译和传统特征匹配的替代方案,匹配质量更高、适用性更强。 Conclusion: 所提共享模态转换框架可有效桥接光学与SAR图像差异,支持即插即用式复用通用图像匹配模型,提升跨模态遥感图像配准的性能与实用性。 Abstract: Significant differences in optical images and Synthetic Aperture Radar (SAR) images are caused by fundamental differences in the physical principles underlying their acquisition by Earth remote sensing platforms. These differences make precise image matching (co-registration) of these two types of images difficult. In this paper, we propose a new approach to image matching of optical and SAR images, which is based on transforming the images to a new modality. The new image modality is common to both optical and SAR images and satisfies the following conditions. First, the transformed images must have an equal pre-defined number of channels. Second, the transformed and co-registered images must be as similar as possible. Third, the transformed images must be non-degenerate, meaning they must preserve the significant features of the original images. To further match images transformed to this shared modality, we train the RoMa image matching model, which is one of the leading solutions for matching of regular digital photographs. We evaluated the proposed approach on the publicly available MultiSenGE dataset containing both optical and SAR images. We demonstrated its superiority over alternative approaches based on image translation between original modalities and various feature matching algorithms. The proposed solution not only provides better quality of matching, but is also more versatile. It enables the use of ready-made RoMa and DeDoDe models, pre-trained for regular images, without retraining for a new modality, while maintaining high-quality matching of optical and SAR images.

[53] LiDAR-Anchored Collaborative Distillation for Robust 2D Representations

Wonjun Jo,Hyunwoo Ha,Kim Ji-Yeon,Hawook Jeong,Tae-Hyun Oh

Main category: cs.CV

TL;DR: 本文提出了一种名为'协同蒸馏'的新型自监督学习方法,利用3D LiDAR数据作为自监督信号来提升2D图像编码器在恶劣天气和噪声条件下的鲁棒性,同时保持其原有能力,并增强3D感知能力。

Details Motivation: 预训练的2D图像编码器在恶劣天气和噪声条件下表现不佳,难以满足鲁棒视觉感知需求。 Method: 提出'协同蒸馏'方法,利用3D LiDAR数据作为自监督信号,指导2D图像编码器的学习过程。 Result: 该方法在多种下游任务和不同环境条件下均优于现有方法,展现出强泛化能力和增强的3D感知能力。 Conclusion: 所提方法提升了2D图像编码器在复杂真实场景中的实用性与适应性,尤其适用于自动驾驶等对鲁棒性要求高的视觉系统。 Abstract: As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbf{Collaborative Distillation}, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR's characteristics. This advancement highlights our method's practicality and adaptability in real-world scenarios.

[54] Geometric Stratification for Singular Configurations of the P3P Problem via Local Dual Space

Xueying Sun,Zijia Li,Nan Li

Main category: cs.CV

TL;DR: 本文研究P3P问题的奇异构型,利用局部对偶空间提出代数-计算框架,完成关于相机中心O重数μ的几何分层:μ≥2时O位于“危险圆柱”上;μ≥3时O位于危险圆柱三条母线(与第一Morley三角形或外接圆相关)之一上;μ≥4时O位于外接圆上,对应无穷多解。同时分析互补构型O′的几何分层。

Details Motivation: P3P问题中奇异构型影响解的唯一性和稳定性,需系统刻画其几何结构。 Method: 基于局部对偶空间构建代数-计算框架,进行几何分层分析。 Result: 给出了相机中心O按重数μ的完整几何分层,并揭示了互补构型O′的对应分层结构。 Conclusion: 该框架系统揭示了P3P奇异构型的内在几何规律,为理解解的多重性与稳定性提供了理论基础。 Abstract: This paper investigates singular configurations of the P3P problem. Using local dual space, a systematic algebraic-computational framework is proposed to give a complete geometric stratification for the P3P singular configurations with respect to the multiplicity $μ$ of the camera center $O$: for $μ\ge 2$, $O$ lies on the ``danger cylinder'', for $μ\ge 3$, $O$ lies on one of three generatrices of the danger cylinder associated with the first Morley triangle or the circumcircle, and for $μ\ge 4$, $O$ lies on the circumcircle which indeed corresponds to infinite P3P solutions. Furthermore, a geometric stratification for the complementary configuration $O^\prime$ associated with a singular configuration $O$ is studied as well: for $μ\ge 2$, $O^\prime$ lies on a deltoidal surface associated with the danger cylinder, and for $μ\ge 3$, $O^\prime$ lies on one of three cuspidal curves of the deltoidal surface.

[55] Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting

Haoran Zhu,Anna Choromanska

Main category: cs.CV

TL;DR: 本文提出AD-LiST-JEPA,一种基于JEPA框架、利用LiDAR数据自监督学习的自动驾驶世界模型,用于预测环境的时空演化,并在占用完成与预测(OCF)下游任务中验证了其表征质量。

Details Motivation: 自动驾驶需构建能刻画环境时空演化的世界模型以支持长期规划,且需可扩展的自监督学习方式;JEPA提供了一种无需人工标注、利用大量无标签数据学习世界模型的途径。 Method: 提出AD-LiST-JEPA,一种基于联合嵌入预测架构(JEPA)的自监督世界模型,以LiDAR数据为输入,预测未来时空状态;通过下游LiDAR占用完成与预测(OCF)任务评估所学表征质量。 Result: 概念验证实验表明,经JEPA预训练的编码器在OCF任务中表现优于未预训练模型。 Conclusion: AD-LiST-JEPA验证了JEPA框架在自动驾驶世界建模中的有效性,为自监督学习高质量时空表征提供了可行路径。 Abstract: Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textit{world models} that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textit{joint-embedding predictive architecture (JEPA)} enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbf{AD-LiST-JEPA}, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.

[56] PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis

Yuanbo Li,Dule Shu,Yanying Chen,Matt Klenk,Daniel Ritchie

Main category: cs.CV

TL;DR: PLLM是一个无需配对数据的CAD程序合成自训练框架,利用预训练CAD大模型和未标注3D形状数据,通过迭代采样、筛选与增强生成合成数据以微调模型。

Details Motivation: 现有CAD程序合成方法依赖稀缺且难以获取的配对形状-程序监督数据,限制了实际应用。 Method: 提出PLLM自训练框架:基于预训练CAD大模型,在无标签3D形状数据上迭代执行(1)采样候选程序,(2)筛选高保真执行结果,(3)增强构建合成程序-形状对,用于后续微调。 Result: 在将CAD-Recode适配至无标签ABC数据集的实验中,PLLM显著提升了生成程序的几何保真度与多样性。 Conclusion: PLLM有效缓解了CAD程序合成中对监督配对数据的依赖,为无监督/弱监督CAD建模提供了可行路径。 Abstract: Recovering Computer-Aided Design (CAD) programs from 3D geometries is a widely studied problem. Recent advances in large language models (LLMs) have enabled progress in CAD program synthesis, but existing methods rely on supervised training with paired shape-program data, which is often unavailable. We introduce PLLM, a self-training framework for CAD program synthesis from unlabeled 3D shapes. Given a pre-trained CAD-capable LLM and a shape dataset, PLLM iteratively samples candidate programs, selects high-fidelity executions, and augments programs to construct synthetic program-shape pairs for fine-tuning. We experiment on adapting CAD-Recode from DeepCAD to the unlabeled ABC dataset show consistent improvements in geometric fidelity and program diversity.

[57] The Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving

Jiabao Wang,Hongyu Zhou,Yuanbo Yang,Jiahao Shao,Yiyi Liao

Main category: cs.CV

TL;DR: 本文提出navdream基准,通过生成式像素对齐风格迁移构建视觉压力测试,分离外观变化与结构变化的影响;并设计基于冻结视觉基础模型(DINOv3)的通用感知接口,提取外观不变特征,实现零样本跨范式驾驶规划鲁棒泛化。

Details Motivation: 现有自动驾驶算法在分布外(OOD)条件下脆弱,且研究中未区分外观变化(如天气、光照)与结构场景变化,导致无法判断规划失败是源于道路几何复杂性还是单纯外观干扰。 Method: 构建navdream高保真鲁棒性基准,利用生成式像素对齐风格迁移创建外观扰动但几何几乎不变的测试集;提出基于冻结DINOv3的通用感知接口,提取外观不变特征作为规划器稳定输入。 Result: 实验表明现有规划算法在OOD外观条件下显著退化;所提接口在回归式、扩散式和打分式等多种规划范式上均实现卓越零样本泛化能力,插拔即用且无需微调。 Conclusion: 外观不变感知接口可有效解耦外观与结构因素,显著提升自动驾驶规划在真实多变环境中的鲁棒性与泛化性。 Abstract: Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.

[58] Unbiased Gradient Estimation for Event Binning via Functional Backpropagation

Jinze Chen,Wei Zhai,Han Han,Tiankai Ma,Yang Cao,Bin Li,Zheng-Jun Zha

Main category: cs.CV

TL;DR: 本文提出了一种用于事件相机数据中任意分箱函数的无偏梯度估计新框架,通过在反向传播中合成弱导数,提升事件驱动视觉任务的学习效率和性能。

Details Motivation: 现有事件数据处理中,分箱操作导致梯度截断或估计偏差,限制了基于事件的算法学习效率。 Method: 利用分部积分将分箱函数提升为泛函,导出其弱导数形式;通过重建采样得到的余切向量来近似余切函数,从而实现对平滑与非平滑目标均适用的无偏梯度估计。 Result: 在运动估计、光流和SLAM等任务上分别取得3.2%更低RMS误差、9.4%更低EPE、5.1%更低RMS误差,并加速收敛1.57倍。 Conclusion: 所提方法有效解决了事件数据分箱导致的梯度偏差问题,显著提升了多种下游事件视觉任务的性能与训练效率。 Abstract: Event-based vision encodes dynamic scenes as asynchronous spatio-temporal spikes called events. To leverage conventional image processing pipelines, events are typically binned into frames. However, binning functions are discontinuous, which truncates gradients at the frame level and forces most event-based algorithms to rely solely on frame-based features. Attempts to directly learn from raw events avoid this restriction but instead suffer from biased gradient estimation due to the discontinuities of the binning operation, ultimately limiting their learning efficiency. To address this challenge, we propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation while keeping the forward output unchanged. The key idea is to exploit integration by parts: lifting the target functions to functionals yields an integral form of the derivative of the binning function during backpropagation, where the cotangent function naturally arises. By reconstructing this cotangent function from the sampled cotangent vector, we compute weak derivatives that provably match long-range finite differences of both smooth and non-smooth targets. Experimentally, our method improves simple optimization-based egomotion estimation with 3.2\% lower RMS error and 1.57$\times$ faster convergence. On complex downstream tasks, we achieve 9.4\% lower EPE in self-supervised optical flow, and 5.1\% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception. Source code can be found at https://github.com/chjz1024/EventFBP.

[59] QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

Ke Xu,Yixin Wang,Zhongcheng Li,Hao Cui,Jinshui Hu,Xingyi Zhang

Main category: cs.CV

TL;DR: 本文提出QuEPT,一种高效的后训练量化方案,通过单次校准实现块级多比特误差重建,支持动态切换不同比特宽度及均匀/混合精度量化,并引入MB-ToMe与MB-CLoRA提升准确率与鲁棒性。

Details Motivation: Transformer架构的高存储与优化开销限制了弹性量化研究,尤其在大语言模型中;现有方法难以兼顾多比特部署灵活性与低开销需求。 Method: 提出QuEPT框架,包含块级单次校准、级联低秩适配器(MB-CLoRA)实现比特宽度动态适配,以及多比特令牌融合(MB-ToMe)增强跨比特宽度特征一致性。 Result: QuEPT在多个基准上达到或超越当前最优后训练量化方法性能,支持实时比特切换且无需重复优化。 Conclusion: QuEPT以极低开销实现了高性能、高灵活性的弹性多比特量化,为大模型高效部署提供了新范式。 Abstract: Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT

[60] Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Omer Faruk Deniz,Ruiyu Mao,Ruochen Li,Yapeng Tian,Latifur Khan

Main category: cs.CV

TL;DR: 本文提出Attention-Driven Self-Compression(ADSC),一种无需额外计算、兼容FlashAttention的视觉令牌压缩方法,通过在LLM深层利用注意力机制进行渐进式均匀下采样,在大幅降低FLOPs与KV缓存的同时几乎不损性能。

Details Motivation: 现有MLLM视觉令牌剪枝方法受限于编码器-投影器多样性或与FlashAttention不兼容;需更通用、高效且不破坏原架构的压缩策略。 Method: 提出ADSC方法:基于深层更承载视觉-文本信息的观察,在选定LLM层施加统一视觉令牌下采样,形成信息重组瓶颈;不引入评分、辅助模块或修改注意力机制。 Result: 在LLaVA-1.5上实现53.7% FLOPs与56.7%峰值KV缓存降低,保持98.2%原始性能;多基准上优于先前剪枝方法,高压缩比下鲁棒性显著更强。 Conclusion: ADSC是一种简单、通用、高效且与现代加速技术兼容的视觉令牌压缩范式,凸显LLM自身注意力机制作为压缩引导器的有效性。 Abstract: Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.

[61] ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models

Peijie Qiu,Hariharan Ramshankar,Arnau Ramisa,René Vidal,Amit Kumar K C,Vamsi Salaka,Rahul Bhagat

Main category: cs.CV

TL;DR: 本文提出ImageRAGTurbo,通过检索增强微调少步扩散模型,在保持低延迟的同时提升图像质量和提示对齐性。

Details Motivation: 现有少步扩散模型虽降低采样步数,但牺牲图像质量与提示对齐性,且训练开销大。 Method: 利用检索到的相关图文对,在UNet的H空间中通过可训练适配器和交叉注意力机制融合检索内容与目标提示进行条件生成。 Result: 在快速文本到图像生成任务上,相比现有方法,在不增加延迟的前提下生成高保真图像。 Conclusion: 检索增强的H空间编辑与适配器微调能有效平衡生成速度、质量与提示一致性。 Abstract: Diffusion models have emerged as the leading approach for text-to-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser's latent space ($\mathcal{H}$-space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the $\mathcal{H}$-space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.

[62] Multi-Task Learning with Additive U-Net for Image Denoising and Classification

Vikram Lakkavalli,Neelam Sinha

Main category: cs.CV

TL;DR: 本文提出AddUNet,通过用门控加性融合替代U-Net中的拼接跳过连接,在不增加模型复杂度的前提下,提升图像去噪及去噪为中心的多任务学习的训练稳定性与性能。

Details Motivation: 解决U-Net中拼接跳过连接导致的信息流不受控、多任务联合优化不稳定的问题,探索结构正则化对多任务学习的益处。 Method: 设计Additive U-Net(AddUNet),以门控加性融合替代传统拼接式跳跃连接,约束跳过路径容量并保持特征维度恒定;在单任务去噪和去噪-分类联合多任务设置下进行实验验证。 Result: AddUNet在单任务去噪和多任务学习中均取得有竞争力的重建性能,并显著提升训练稳定性;跳过权重呈现任务感知的层次化分布(浅层偏向重建、深层偏向判别),且重建性能在分类能力受限时仍稳健。 Conclusion: 简单的跳连结构约束可作为有效的架构正则化手段,促进稳定、可扩展的多任务学习,无需增加模型复杂度。 Abstract: We investigate additive skip fusion in U-Net architectures for image denoising and denoising-centric multi-task learning (MTL). By replacing concatenative skips with gated additive fusion, the proposed Additive U-Net (AddUNet) constrains shortcut capacity while preserving fixed feature dimensionality across depth. This structural regularization induces controlled encoder-decoder information flow and stabilizes joint optimization. Across single-task denoising and joint denoising-classification settings, AddUNet achieves competitive reconstruction performance with improved training stability. In MTL, learned skip weights exhibit systematic task-aware redistribution: shallow skips favor reconstruction, while deeper features support discrimination. Notably, reconstruction remains robust even under limited classification capacity, indicating implicit task decoupling through additive fusion. These findings show that simple constraints on skip connections act as an effective architectural regularizer for stable and scalable multi-task learning without increasing model complexity.

[63] CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

Marco Stricker,Masakazu Iwamura,Koichi Kise

Main category: cs.CV

TL;DR: 本文提出CloudyBigEarthNet (CBEN) 数据集,旨在评估和提升遥感模型在云覆盖条件下的鲁棒性;实验表明,现有光学+雷达融合方法在云图像上性能显著下降,而通过在含云数据上训练可大幅提升其云鲁棒性。

Details Motivation: 现有遥感机器学习方法常排除含云图像,导致在云覆盖(如自然灾害期间)等时间敏感场景中失效;亟需开发对云不敏感的鲁棒方法。 Method: 构建首个含云光学-雷达配对数据集CloudyBigEarthNet(CBEN),并在其上评估及改进SOTA光学+雷达融合方法,通过在含云光学数据上联合训练提升鲁棒性。 Result: SOTA方法在云图像上平均精度(AP)下降23–33个百分点;经含云数据训练后,在云测试集上相对提升17.2–28.7个百分点。 Conclusion: 单纯依赖雷达无法完全规避云影响,需在训练中显式引入云覆盖光学数据;CBEN为云鲁棒遥感建模提供了基准数据与实证支持。 Abstract: Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: https://github.com/mstricker13/CBEN

[64] IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models

Aarish Shah Mohsin,Mohammed Tayyab Ilyas Khan,Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Jiechao Gao

Main category: cs.CV

TL;DR: 本文提出IndicFairFace数据集,旨在解决视觉语言模型(VLMs)中对印度群体的地理与表征偏差问题,通过平衡印度28个邦和8个联邦属地的面部图像,并结合后处理去偏方法(Iterative Nullspace Projection),有效缓解CLIP类模型中的地理偏差,同时几乎不损害检索性能。

Details Motivation: 现有公平性数据集将印度视为单一类别,忽视其国内巨大的地域与文化多样性,导致表征与地理偏差;需构建能反映印度内部地理多样性的高质量公平数据集。 Method: 构建包含14,400张图像、覆盖印度全部28邦及8联邦属地、按地域与性别均衡采样的IndicFairFace数据集;基于该数据集量化主流CLIP类VLM的地理偏差;采用后处理的Iterative Nullspace Projection方法进行去偏;评估去偏对跨数据集检索准确率的影响。 Result: 成功构建首个面向印度地理多样性的公平人脸基准IndicFairFace;实证发现主流CLIP模型存在显著地域偏差;经去偏后,地理偏差显著降低,且在多个基准数据集上的平均检索准确率下降<1.5%。 Conclusion: IndicFairFace是首个用于研究印度语境下VLM地理偏差的基准;所提去偏方法兼顾公平性与功能性,为多地域细粒度公平性研究提供新范式。 Abstract: Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.

[65] Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

Wooseok Jeon,Seunghyun Shin,Dongmin Shin,Hae-Gon Jeon

Main category: cs.CV

TL;DR: 本文提出Motion Prior Distillation (MPD)方法,通过在推理时将前向路径的运动残差蒸馏到后向路径中,缓解双向生成路径间的运动不一致问题,从而提升图像到视频生成中关键帧间插帧的时序一致性。

Details Motivation: 现有推理时采样策略(如并行或交替融合前向/后向路径)因各自依赖不同条件帧的运动先验,导致生成路径间运动不匹配,产生时序不连续和视觉伪影。 Method: 提出Motion Prior Distillation(MPD),在推理阶段将前向路径的运动残差信息蒸馏注入后向路径,同时避免对末端条件路径进行去噪以减少路径歧义。 Result: 在标准基准上实现定量性能提升,并通过大量用户研究验证了其在实际场景中生成更时序连贯插帧结果的有效性。 Conclusion: MPD是一种简单而有效的推理时蒸馏技术,能显著缓解双向路径运动失配问题,提升I2V模型在生成插帧任务中的时序一致性与视觉质量。 Abstract: Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.

[66] Channel-Aware Probing for Multi-Channel Imaging

Umar Marikkar,Syed Sameed Husain,Muhammad Awais,Sara Atito

Main category: cs.CV

TL;DR: 本文提出Channel-Aware Probing (CAP)方法,通过独立特征编码(IFE)和解耦池化(DCP)利用多通道成像(MCI)数据的通道多样性,在冻结预训练编码器的前提下显著提升探针性能,缩小与全量微调的差距。

Details Motivation: 现有MCI视觉编码器难以跨不同通道配置复用;冻结编码器的探针研究不足,且直接迁移其他领域策略效果差甚至不如从头训练。 Method: 提出Channel-Aware Probing(CAP),包含Independent Feature Encoding(IFE)对各通道单独编码,以及Decoupled Pooling(DCP)先通道内池化再跨通道聚合。 Result: 在三个MCI基准上,CAP持续优于默认探针协议,性能媲美从头微调,并大幅缩小与全量微调的性能差距。 Conclusion: CAP有效挖掘MCI数据内在通道多样性,为冻结预训练MCI编码器的下游任务提供高效、鲁棒的探针范式。 Abstract: Training and evaluating vision encoders on Multi-Channel Imaging (MCI) data remains challenging as channel configurations vary across datasets, preventing fixed-channel training and limiting reuse of pre-trained encoders on new channel settings. Prior work trains MCI encoders but typically evaluates them via full fine-tuning, leaving probing with frozen pre-trained encoders comparatively underexplored. Existing studies that perform probing largely focus on improving representations, rather than how to best leverage fixed representations for downstream tasks. Although the latter problem has been studied in other domains, directly transferring those strategies to MCI yields weak results, even worse than training from scratch. We therefore propose Channel-Aware Probing (CAP), which exploits the intrinsic inter-channel diversity in MCI datasets by controlling feature flow at both the encoder and probe levels. CAP uses Independent Feature Encoding (IFE) to encode each channel separately, and Decoupled Pooling (DCP) to pool within channels before aggregating across channels. Across three MCI benchmarks, CAP consistently improves probing performance over the default probing protocol, matches fine-tuning from scratch, and largely reduces the gap to full fine-tuning from the same MCI pre-trained checkpoints. Code can be found in https://github.com/umarikkar/CAP.

[67] ART3mis: Ray-Based Textual Annotation on 3D Cultural Objects

Vasileios Arampatzakis,Vasileios Sevetlidis,Fotis Arnaoutoglou,Athanasios Kalogeras,Christos Koulamas,Aris Lalos,Chairi Kiourt,George Ioannakis,Anestis Koutsoudis,George Pavlidis

Main category: cs.CV

TL;DR: 本文介绍了ART3mis,一个面向文化遗产领域的通用、用户友好的3D对象文本标注工具,支持直接在表面进行交互式标注,适用于无3D技术背景的从业人员。

Details Motivation: 考古学家和文化遗产专家需要超越简单3D可视化、具备高级功能(如区域元数据标注)的应用,而现有方案多局限于特定领域,缺乏通用性与易用性。 Method: 提出ART3mis工具,采用用户驱动、直接在3D表面标注的方式,支持实时处理高细节文物模型,并以JSON格式存储多区域文本标注。 Result: 实现了对复杂3D文化对象的实时交互式标注,支持非技术人员轻松完成3D模型的分割与文本注释。 Conclusion: ART3mis是一个通用、易用、面向实际文化遗产工作流的3D标注工具,填补了专业需求与技术门槛之间的鸿沟。 Abstract: Beyond simplistic 3D visualisations, archaeologists, as well as cultural heritage experts and practitioners, need applications with advanced functionalities. Such as the annotation and attachment of metadata onto particular regions of the 3D digital objects. Various approaches have been presented to tackle this challenge, most of which achieve excellent results in the domain of their application. However, they are often confined to that specific domain and particular problem. In this paper, we present ART3mis - a general-purpose, user-friendly, interactive textual annotation tool for 3D objects. Primarily attuned to aid cultural heritage conservators, restorers and curators with no technical skills in 3D imaging and graphics, the tool allows for the easy handling, segmenting and annotating of 3D digital replicas of artefacts. ART3mis applies a user-driven, direct-on-surface approach. It can handle detailed 3D cultural objects in real-time and store textual annotations for multiple complex regions in JSON data format.

[68] VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Qiuchen Wang,Shihang Wang,Yu Zeng,Qiang Zhang,Fanrui Zhang,Zhuoning Guo,Bosi Zhang,Wenxuan Huang,Lin Chen,Zehui Chen,Pengjun Xie,Ruixue Ding

Main category: cs.CV

TL;DR: 本文提出VimRAG框架,通过构建动态有向无环图建模多模态推理过程,并引入图调制视觉记忆编码与图引导策略优化,显著提升多模态检索增强推理性能。

Details Motivation: 传统RAG方法依赖线性交互历史,在处理长上下文、尤其是信息稀疏但token密集的视觉数据迭代推理任务时表现不佳,亟需更有效的多模态推理架构。 Method: 提出VimRAG框架:1)将推理过程建模为动态有向无环图以结构化代理状态与多模态证据;2)设计图调制视觉记忆编码机制,依据节点拓扑位置评估重要性并动态分配高分辨率token;3)采用图引导策略优化,通过剪枝冗余动作节点实现细粒度信用分配。 Result: 在多个多模态RAG基准测试中持续达到SOTA性能。 Conclusion: VimRAG通过图结构化记忆与动态token分配机制,有效提升了多模态检索增强推理能力,为复杂视觉-语言协同推理提供了新范式。 Abstract: Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba-NLP/VRAG.

[69] SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences

Ruipeng Wang,Langkun Zhong,Miaowei Wang

Main category: cs.CV

TL;DR: 本文提出SPRig框架,通过跨帧一致性损失实现姿态不变的蒙皮绑定,解决了序列数据(如动物动捕或AIGC生成网格)缺乏标准T-pose时传统绑定方法导致的拓扑不一致问题。

Details Motivation: 现有最先进绑定方法依赖标准T-pose作为规范姿态,但对缺乏该姿态的序列数据(如动物动捕、视频/AIGC生成网格)失效,逐帧应用会导致姿态敏感和跨帧拓扑不一致。 Method: 提出SPRig通用微调框架,在现有绑定模型基础上引入跨帧一致性损失,以学习姿态不变的绑定;并设计新的排列不变稳定性评估协议进行验证。 Result: 实验表明SPRig在时间稳定性上达到SOTA水平,能从具挑战性的序列中生成连贯绑定,并显著减少基线方法中的伪影。 Conclusion: SPRig是一种有效提升序列网格绑定跨帧一致性和鲁棒性的通用微调框架,适用于无标准rest pose的动态几何数据。 Abstract: State-of-the-art rigging methods assume a canonical rest pose--an assumption that fails for sequential data (e.g., animal motion capture or AIGC/video-derived mesh sequences) that lack the T-pose. Applied frame-by-frame, these methods are not pose-invariant and produce topological inconsistencies across frames. Thus We propose SPRig, a general fine-tuning framework that enforces cross-frame consistency losses to learn pose-invariant rigs on top of existing models. We validate our approach on rigging using a new permutation-invariant stability protocol. Experiments demonstrate SOTA temporal stability: our method produces coherent rigs from challenging sequences and dramatically reduces the artifacts that plague baseline methods. The code will be released publicly upon acceptance.

[70] Synthetic Craquelure Generation for Unsupervised Painting Restoration

Jana Cuch-Guillén,Antonio Agudo,Raül Pérez-Gonzalo

Main category: cs.CV

TL;DR: 本文提出了一种无需像素级标注的绘画细裂纹(craquelure)检测与修复框架,结合合成数据生成、形态学检测、LoRA微调的SegFormer模型及各向异性扩散修复,实现了高保真、零样本的古画修复。

Details Motivation: 文化遗产保护亟需非侵入式数字修复方法,但细裂纹与复杂笔触混杂,且缺乏像素级标注数据,导致现有方法难以精准识别与恢复细裂纹。 Method: 构建领域专用的贝塞尔轨迹驱动的合成裂纹生成器;融合经典形态学检测器与LoRA微调的SegFormer网络;采用检测图作为空间先验输入,并设计掩码混合损失与logit调整机制聚焦裂纹区域优化;最终以细化掩码引导各向异性扩散完成内容重建。 Result: 在零样本设置下显著优于现有摄影修复模型,能准确检测并修复细裂纹,同时完整保留原始笔触纹理。 Conclusion: 该全无标注框架有效解决了细裂纹修复中数据稀缺与结构混淆难题,为文化遗产数字化保护提供了可推广、高保真的新范式。 Abstract: Cultural heritage preservation increasingly demands non-invasive digital methods for painting restoration, yet identifying and restoring fine craquelure patterns from complex brushstrokes remains challenging due to scarce pixel-level annotations. We propose a fully annotation-free framework driven by a domain-specific synthetic craquelure generator, which simulates realistic branching and tapered fissure geometry using Bézier trajectories. Our approach couples a classical morphological detector with a learning-based refinement module: a SegFormer backbone adapted via Low-Rank Adaptation (LoRA). Uniquely, we employ a detector-guided strategy, injecting the morphological map as an input spatial prior, while a masked hybrid loss and logit adjustment constrain the training to focus specifically on refining candidate crack regions. The refined masks subsequently guide an Anisotropic Diffusion inpainting stage to reconstruct missing content. Experimental results demonstrate that our pipeline significantly outperforms state-of-the-art photographic restoration models in zero-shot settings, while faithfully preserving the original paint brushwork.

[71] ReBA-Pred-Net: Weakly-Supervised Regional Brain Age Prediction on MRI

Shuai Shao,Yan Wang,Shu Jiang,Shiyuan Zhao,Xinzhe Luo,Di Yang,Jiangtao Wang,Yutong Bai,Jianguo Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为ReBA-Pred-Net的教师-学生框架,用于细粒度区域脑龄(ReBA)估计,并引入HCS和NDC两个间接指标评估其统计与事实一致性。

Details Motivation: 现有全脑年龄(WBA)估计方法过于粗略,难以支持疾病表征及发育/老化模式研究,因相关变化多具区域性而非全脑性;亟需鲁棒且泛化性强的区域脑龄(ReBA)估计模型。 Method: 提出教师-学生框架ReBA-Pred-Net:教师网络生成软ReBA指导学生网络预测;引入临床先验一致性约束(功能相同脑区的年龄变化应相似)。评估采用两个新指标:健康对照相似性(HCS)和神经疾病相关性(NDC)。 Result: 在多个骨干网络上实验验证了该方法在统计一致性(HCS)和事实一致性(NDC)两方面均优于基线方法。 Conclusion: ReBA-Pred-Net实现了更可靠、更具临床可解释性的区域脑龄估计,为脑健康评估、疾病机制研究和个性化干预提供了新工具。 Abstract: Brain age has become a prominent biomarker of brain health. Yet most prior work targets whole brain age (WBA), a coarse paradigm that struggles to support tasks such as disease characterization and research on development and aging patterns, because relevant changes are typically region-selective rather than brain-wide. Therefore, robust regional brain age (ReBA) estimation is critical, yet a widely generalizable model has yet to be established. In this paper, we propose the Regional Brain Age Prediction Network (ReBA-Pred-Net), a Teacher-Student framework designed for fine-grained brain age estimation. The Teacher produces soft ReBA to guide the Student to yield reliable ReBA estimates with a clinical-prior consistency constraint (regions within the same function should change similarly). For rigorous evaluation, we introduce two indirect metrics: Healthy Control Similarity (HCS), which assesses statistical consistency by testing whether regional brain-age-gap (ReBA minus chronological age) distributions align between training and unseen HC; and Neuro Disease Correlation (NDC), which assesses factual consistency by checking whether clinically confirmed patients show elevated brain-age-gap in disease-associated regions. Experiments across multiple backbones demonstrate the statistical and factual validity of our method.

[72] Towards reconstructing experimental sparse-view X-ray CT data with diffusion models

Nelas J. Thomsen,Xinyuan Wang,Felix Lucka,Ezgi Demircan-Tureyen

Main category: cs.CV

TL;DR: 本文研究了基于扩散模型的图像生成器在稀疏视角X射线CT重建中的应用,发现训练数据与真实数据间的域偏移(domain shift)和前向模型失配对重建性能有显著影响;严重域偏移会导致模型崩溃和幻觉,但多样性高的先验优于单一匹配先验;前向模型失配引发伪影,可通过退火似然调度缓解。

Details Motivation: 扩散模型在合成数据上表现优异,但在真实实验数据(如物理CT扫描)上的适用性尚不明确,需探究域偏移和前向模型失配的影响。 Method: 使用物理Shepp-Logan仿体采集真实CT数据,并在不同程度域偏移的合成图像上训练扩散先验;采用分解式扩散采样(Decomposed Diffusion Sampling)处理从易到难的稀疏视角CT数据,引入退火似然调度应对前向模型失配。 Result: 域偏移影响呈非单调性:严重失配导致崩溃与幻觉,适度多样性先验反而优于精确匹配先验;前向模型失配使采样偏离先验流形、产生伪影,退火似然调度可缓解该问题并提升计算效率。 Conclusion: 合成数据上的性能增益不能直接迁移到实验数据,未来工作必须以真实世界基准进行验证。 Abstract: Diffusion-based image generators are promising priors for ill-posed inverse problems like sparse-view X-ray Computed Tomography (CT). As most studies consider synthetic data, it is not clear whether training data mismatch (``domain shift'') or forward model mismatch complicate their successful application to experimental data. We measured CT data from a physical phantom resembling the synthetic Shepp-Logan phantom and trained diffusion priors on synthetic image data sets with different degrees of domain shift towards it. Then, we employed the priors in a Decomposed Diffusion Sampling scheme on sparse-view CT data sets with increasing difficulty leading to the experimental data. Our results reveal that domain shift plays a nuanced role: while severe mismatch causes model collapse and hallucinations, diverse priors outperform well-matched but narrow priors. Forward model mismatch pulls the image samples away from the prior manifold, which causes artifacts but can be mitigated with annealed likelihood schedules that also increase computational efficiency. Overall, we demonstrate that performance gains do not immediately translate from synthetic to experimental data, and future development must validate against real-world benchmarks.

[73] Towards complete digital twins in cultural heritage with ART3mis 3D artifacts annotator

Dimitrios Karamatskos,Vasileios Arampatzakis,Vasileios Sevetlidis,Stavros Nousias,Athanasios Kalogeras,Christos Koulamas,Aris Lalos,George Pavlidis

Main category: cs.CV

TL;DR: This paper introduces ART3mis, a web-based tool for annotating 3D digital artifacts with textual metadata, designed for non-technical cultural heritage professionals and compliant with W3C Web Annotation standards.

Details Motivation: Archaeologists and cultural heritage professionals need advanced tools for annotating and attaching metadata to specific regions of 3D digital artifacts, beyond basic visualization—existing tools lack generalization and interoperability. Method: Development of ART3mis, a general-purpose, interactive, web-based annotation tool for 3D objects, aligned with the W3C Web Annotation Data Model. Result: A user-friendly, feature-rich tool enabling non-technical users (e.g., conservators, restorers, curators) to easily handle, segment, and annotate 3D digital replicas while supporting communication, distribution, and reuse of annotated information. Conclusion: ART3mis fills a gap by providing a generalized, interoperable, and accessible solution for textual annotation of 3D cultural heritage artifacts, enhancing collaboration and knowledge sharing in the field. Abstract: Archaeologists, as well as specialists and practitioners in cultural heritage, require applications with additional functions, such as the annotation and attachment of metadata to specific regions of the 3D digital artifacts, to go beyond the simplistic three-dimensional (3D) visualization. Different strategies addressed this issue, most of which are excellent in their particular area of application, but their capacity is limited to their design's purpose; they lack generalization and interoperability. This paper introduces ART3mis, a general-purpose, user-friendly, feature-rich, interactive web-based textual annotation tool for 3D objects. Moreover, it enables the communication, distribution, and reuse of information as it complies with the W3C Web Annotation Data Model. It is primarily designed to help cultural heritage conservators, restorers, and curators who lack technical expertise in 3D imaging and graphics, handle, segment, and annotate 3D digital replicas of artifacts with ease.

[74] PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion

Hong-Phuc Lai,Phong Nguyen,Anh Tran

Main category: cs.CV

TL;DR: PixelRush 是首个无需调优的高效高分辨率文本到图像生成框架,通过改进的补丁式去噪与无缝融合策略,在约20秒内生成4K图像,比现有方法快10–35倍,同时保持高质量。

Details Motivation: 预训练扩散模型受限于原生训练分辨率,而现有训练-free高分辨率方法计算开销大(单张4K图需5分钟以上),亟需高效且高质量的解决方案。 Method: 基于补丁推理范式,提出无需多次反演与重生成的低步数补丁去噪;设计无缝融合策略缓解补丁拼接伪影;引入噪声注入机制抑制过平滑。 Result: 在4K图像生成任务中实现约20秒/张的推理速度,较SOTA方法加速10×–35×,同时视觉保真度更优。 Conclusion: PixelRush 证明了无需模型微调和高计算代价也能实现快速、高质量的高分辨率文本到图像生成,为实际应用提供了新范式。 Abstract: Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10$\times$ to 35$\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.

[75] Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting

Xiaowen Zhang,Zijie Yue,Yong Luo,Cairong Zhao,Qijun Chen,Miaojing Shi

Main category: cs.CV

TL;DR: 本文提出WS-COC,首个基于多模态大语言模型(MLLM)的类别无关弱监督目标计数框架,仅需图像级计数标签,通过对话式范围判定、计数排序优化和全局-局部融合策略,在多个数据集上达到甚至超越全监督方法性能。

Details Motivation: 现有弱监督计数方法大多局限于单类别(如人),且全监督方法依赖昂贵的点级标注;亟需一种能处理多类别、仅需图像级计数标签的高效弱监督框架。 Method: 提出三种策略:1)分而辨对话调优——引导MLLM通过多轮对话逐步缩小计数范围;2)比而排计数优化——训练MLLM对多图按计数大小进行相对排序;3)全局-局部计数增强——融合局部与全局预测以提升密集场景性能。 Result: 在FSC-147、CARPK、PUCPR+和ShanghaiTech数据集上,WS-COC性能匹配或超越多种先进全监督方法,同时大幅降低标注成本。 Conclusion: WS-COC验证了MLLM在弱监督、类别无关目标计数中的有效性,为低标注成本的通用计数提供了新范式。 Abstract: Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at https://github.com/viscom-tongji/WS-COC.

[76] GSM-GS: Geometry-Constrained Single and Multi-view Gaussian Splatting for Surface Reconstruction

Xiao Ren,Yu Liu,Ning An,Jian Cheng,Xin Qiao,He Kong

Main category: cs.CV

TL;DR: 本文提出GSM-GS框架,通过单视图自适应子区域加权约束与多视图空间结构优化,提升3D高斯泼溅在复杂表面细节重建中的精度与一致性。

Details Motivation: 3D高斯泼溅虽训练快、渲染质量高,但其无序高斯点云导致复杂微结构高频细节丢失,重建精度受限。 Method: 提出GSM-GS框架:单视图阶段利用图像梯度划分纹理丰富/贫乏区域,结合深度差异特征进行自适应滤波与双分支约束;多视图阶段引入几何引导的跨视角点云关联与动态权重采样,构建帧间3D法向结构约束。 Result: 在公开数据集上实验表明,该方法在渲染质量与几何重建精度两方面均达到竞争性性能。 Conclusion: GSM-GS有效缓解了3D高斯泼溅在细节保持与多视图一致性方面的固有缺陷,为高质量神经辐射场替代方案提供了新思路。 Abstract: Recently, 3D Gaussian Splatting has emerged as a prominent research direction owing to its ultrarapid training speed and high-fidelity rendering capabilities. However, the unstructured and irregular nature of Gaussian point clouds poses challenges to reconstruction accuracy. This limitation frequently causes high-frequency detail loss in complex surface microstructures when relying solely on routine strategies. To address this limitation, we propose GSM-GS: a synergistic optimization framework integrating single-view adaptive sub-region weighting constraints and multi-view spatial structure refinement. For single-view optimization, we leverage image gradient features to partition scenes into texture-rich and texture-less sub-regions. The reconstruction quality is enhanced through adaptive filtering mechanisms guided by depth discrepancy features. This preserves high-weight regions while implementing a dual-branch constraint strategy tailored to regional texture variations, thereby improving geometric detail characterization. For multi-view optimization, we introduce a geometry-guided cross-view point cloud association method combined with a dynamic weight sampling strategy. This constructs 3D structural normal constraints across adjacent point cloud frames, effectively reinforcing multi-view consistency and reconstruction fidelity. Extensive experiments on public datasets demonstrate that our method achieves both competitive rendering quality and geometric reconstruction. See our interactive project page

[77] Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation

Yichen Zhao,Zelin Peng,Piao Yang,Xiaokang Yang,Wei Shen

Main category: cs.CV

TL;DR: 本文提出MMRad-IVL-22K数据集,支持真正视觉-语言交错推理,显著提升胸片报告生成的临床准确性与质量。

Details Motivation: 放射诊断是视觉检查与语言推理反复交错的过程,而现有LVLMs多仅做单次视觉分析,依赖易幻觉的纯文本思维链(CoT);引入坐标等伪视觉信息仍无法保留纹理、密度等关键视觉细节。 Method: 构建首个面向胸片解读的原生交错视觉语言推理大规模数据集MMRad-IVL-22K(21,994条诊断轨迹),模拟放射科医生‘看—想—再看’的循环流程,每步推理均配视觉依据(如区域高亮)与文本描述。 Result: 在闭源LVLM上,多模态CoT引导的报告生成较纯文本CoT在RadGraph指标上提升6%;在7个开源LVLM上,经MMRad-IVL-22K微调的模型在推理一致性与报告质量上均优于通用及医学专用模型。 Conclusion: 高保真交错式视觉语言证据是可靠医学AI不可或缺的核心组件,MMRad-IVL-22K为实现类放射科医生推理范式提供了关键数据基础。 Abstract: Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6\% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at https://github.com/qiuzyc/thinking_like_a_radiologist.

[78] RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads

Vijayasri Iyer,Maahin Rathinagiriswaran,Jyothikamalesh S

Main category: cs.CV

TL;DR: Roadscapes is a new multitask multimodal dataset with up to 9,000 manually verified images from diverse Indian road environments, supporting object grounding, reasoning, and scene understanding via rule-based QA generation.

Details Motivation: To advance visual scene understanding for autonomous driving in unstructured, diverse environments—particularly underrepresented Indian road conditions. Method: Collected and manually annotated up to 9,000 images from varied Indian road scenes; applied rule-based heuristics to infer scene attributes and generate QA pairs; provided baselines using vision-language models on image QA tasks. Result: A publicly available multitask multimodal dataset (Roadscapes) with rich annotations, scene attribute inference, QA pairs, and initial vision-language model baselines. Conclusion: Roadscapes fills a gap in diverse, real-world road scene datasets and provides a scalable foundation for multimodal scene understanding research in challenging, underrepresented settings. Abstract: Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.

[79] RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training

Yunshuang Nie,Bingqian Lin,Minzhe Niu,Kun Xiang,Jianhua Han,Guowei Huang,Xingyue Quan,Hang Xu,Bokui Chen,Xiaodan Liang

Main category: cs.CV

TL;DR: 本文提出RADAR框架,用于在无需微调的情况下高效评估多模态大语言模型(MLLMs)在预训练阶段感知与推理能力的不对称发展,包含软判别分数指标和15K+样本的多模态混合基准。

Details Motivation: 现有评估方法依赖监督微调、成本高,且缺乏能解耦量化感知与推理能力的预训练阶段评价指标和大规模、目标对齐的基准。 Method: 提出RADAR框架:(1)软判别分数(Soft Discrimination Score),不依赖微调即可稳健追踪能力演进;(2)含15K+样本的多模态混合基准,支持零样本下对感知与推理能力的全面评估。 Result: 利用RADAR揭示了不同数据量、模型规模和预训练策略下MLLMs感知与推理能力发展的显著不对称性。 Conclusion: 预训练能力瓶颈需从解耦视角诊断,RADAR为针对性优化提供依据,推动MLLM高效发展。 Abstract: Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model's perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs' perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at https://github.com/Nieysh/RADAR.

[80] Robustness of Object Detection of Autonomous Vehicles in Adverse Weather Conditions

Fox Pettersen,Hong Zhu

Main category: cs.CV

TL;DR: 本文提出了一种通过数据增强生成合成恶劣天气/光照条件数据,以评估自动驾驶中目标检测模型鲁棒性的方法,用平均首次失效系数(AFFC)量化鲁棒性,并发现Faster R-CNN最稳健,YOLO变体次之;同时验证了针对恶劣条件的合成数据训练可提升鲁棒性,但存在收益递减与遗忘现象。

Details Motivation: 随着自动驾驶技术发展,亟需在不同环境条件下确定安全运行阈值,保障公共安全;现有目标检测模型在恶劣天气和光照下的鲁棒性缺乏系统、定量的评估方法。 Method: 采用七种数据增强算子(模拟雾、雨、雪、暗光、强光、眩光、阴影)生成多强度等级的合成恶劣条件数据,逐步增加强度直至模型失效,以各图像的首次失效强度计算平均首次失效系数(AFFC)作为鲁棒性度量指标。 Result: Faster R-CNN整体AFFC达71.9%,显著高于YOLOv5s和YOLOv11s(约43%);该方法被证实可行、有效且高效;针对恶劣条件的合成数据训练可提升鲁棒性,但过训会导致收益递减和性能遗忘。 Conclusion: 所提AFFC评估框架为自动驾驶目标检测模型在真实复杂环境中的可靠性提供了可量化、可比较的鲁棒性评测手段;模型选型与针对性训练策略需权衡鲁棒性增益与过拟合风险。 Abstract: As self-driving technology advances toward widespread adoption, determining safe operational thresholds across varying environmental conditions becomes critical for public safety. This paper proposes a method for evaluating the robustness of object detection ML models in autonomous vehicles under adverse weather conditions. It employs data augmentation operators to generate synthetic data that simulates different severance degrees of the adverse operation conditions at progressive intensity levels to find the lowest intensity of the adverse conditions at which the object detection model fails. The robustness of the object detection model is measured by the average first failure coefficients (AFFC) over the input images in the benchmark. The paper reports an experiment with four object detection models: YOLOv5s, YOLOv11s, Faster R-CNN, and Detectron2, utilising seven data augmentation operators that simulate weather conditions fog, rain, and snow, and lighting conditions of dark, bright, flaring, and shadow. The experiment data show that the method is feasible, effective, and efficient to evaluate and compare the robustness of object detection models in various adverse operation conditions. In particular, the Faster R-CNN model achieved the highest robustness with an overall average AFFC of 71.9% over all seven adverse conditions, while YOLO variants showed the AFFC values of 43%. The method is also applied to assess the impact of model training that targets adverse operation conditions using synthetic data on model robustness. It is observed that such training can improve robustness in adverse conditions but may suffer from diminishing returns and forgetting phenomena (i.e., decline in robustness) if overtrained.

[81] Adaptive Scaling with Geometric and Visual Continuity of completed 3D objects

Jelle Vermandere,Maarten Bassier,Maarten Vergauwen

Main category: cs.CV

TL;DR: 本文提出了一种面向部件的缩放框架,将静态符号距离场(SDF)转换为可编辑、结构一致的对象,支持无失真缩放与变形,并通过重复策略保持周期性几何结构。

Details Motivation: 现有物体补全网络生成的静态SDF无法灵活缩放或形变,限制其在室内重设计、仿真和数字内容创作等需要对象操纵场景中的应用。 Method: 基于先进补全模型输出的SDF与Texture Fields,自动进行部件分割,定义用户可控缩放区域,并对SDF、颜色和部件索引进行平滑插值;引入基于重复的策略以处理大尺度形变并保留重复几何模式。 Result: 在Matterport3D和ShapeNet数据集上的实验表明,该方法能有效克服补全SDF的刚性问题,在视觉质量上优于全局缩放和朴素选择性缩放,尤其适用于复杂形状和重复结构。 Conclusion: 所提部分感知缩放框架显著提升了SDF补全结果的可编辑性与结构一致性,为三维内容创作提供了实用、鲁棒的变形工具。 Abstract: Object completion networks typically produce static Signed Distance Fields (SDFs) that faithfully reconstruct geometry but cannot be rescaled or deformed without introducing structural distortions. This limitation restricts their use in applications requiring flexible object manipulation, such as indoor redesign, simulation, and digital content creation. We introduce a part-aware scaling framework that transforms these static completed SDFs into editable, structurally coherent objects. Starting from SDFs and Texture Fields generated by state-of-the-art completion models, our method performs automatic part segmentation, defines user-controlled scaling zones, and applies smooth interpolation of SDFs, color, and part indices to enable proportional and artifact-free deformation. We further incorporate a repetition-based strategy to handle large-scale deformations while preserving repeating geometric patterns. Experiments on Matterport3D and ShapeNet objects show that our method overcomes the inherent rigidity of completed SDFs and is visually more appealing than global and naive selective scaling, particularly for complex shapes and repetitive structures.

[82] Reliable Thinking with Images

Haobin Li,Yutong Yang,Yijie Lin,Dai Xiang,Mouxing Yang,Xi Peng

Main category: cs.CV

TL;DR: 本文提出了一种名为Reliable Thinking with Images (RTWI)的新方法,用于解决多模态大语言模型(MLLMs)中Thinking with Images(TWI)推理过程中因视觉线索和文本推理错误导致的Noisy Thinking(NT)问题。RTWI通过统一的文本中心方式评估视觉线索与文本链式推理的可靠性,并采用鲁棒过滤与投票模块防止错误传播,显著提升了模型在多个基准上的性能。

Details Motivation: 现有Thinking with Images(TWI)方法假设图文交错的推理链完全正确,但在实际复杂多模态理解中该假设常被违反,导致错误累积、性能下降;本文揭示并研究了这一实际但被忽视的问题——Noisy Thinking(NT)。 Method: 提出Reliable Thinking with Images(RTWI),以文本为中心统一估计视觉线索与文本CoT的可靠性,并引入鲁棒过滤模块与投票模块,抑制NT对最终答案的污染。 Result: 在七个基准上进行了大量实验,验证了RTWI在应对Noisy Thinking问题上的有效性,显著提升了MLLMs的推理鲁棒性与准确性。 Conclusion: RTWI为提升多模态大语言模型在真实场景下的可靠推理能力提供了新思路,强调需建模并缓解图文交错推理中的噪声问题,而非仅依赖理想化假设。 Abstract: As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.

[83] EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition

Xiao Wang,Xingxing Xiong,Jinfeng Gao,Xufeng Lou,Bo Jiang,Si-bao Chen,Yaowei Wang,Yonghong Tian

Main category: cs.CV

TL;DR: 本文提出EPRBench——首个面向事件流视觉地点识别(VPR)的高质量基准数据集,并构建了一个结合大语言模型(LLM)与事件流的多模态可解释VPR新范式。

Details Motivation: 现有可见光相机VPR在低照度、过曝、高速运动等场景下不稳定,而事件相机具有优势;但当前缺乏专用的事件流VPR数据集和评测基准,且缺少语义与语言模态融合的研究支持。 Method: 1)构建EPRBench数据集(10K事件序列、65K事件帧,含多视角/天气/光照条件);2)引入LLM生成+人工精修的场景文本描述;3)在15种SOTA VPR算法上系统评测;4)提出基于LLM引导的多模态融合VPR框架,包含文本驱动的空间注意力token选择、跨模态特征融合与多尺度表征学习。 Result: 所提方法在EPRBench上实现高精度地点识别,同时输出可解释的推理过程,显著提升模型透明性与可解释性;数据集与代码将开源。 Conclusion: EPRBench填补了事件流VPR领域基准缺失的空白,提出的LLM-事件多模态融合范式为神经形态感知与大模型协同提供了新思路,推动可解释、语义增强的VPR发展。 Abstract: Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on https://github.com/Event-AHU/Neuromorphic_ReID

[84] Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis

Runzhou Liu,Hailey Weingord,Sejal Mittal,Prakhar Dungarwal,Anusha Nandula,Bo Ni,Samyadeep Basu,Hongjie Chen,Nesreen K. Ahmed,Li Li,Jiayi Zhang,Koustava Goswami,Subhojyoti Mukherjee,Branislav Kveton,Puneet Mathur,Franck Dernoncourt,Yue Zhao,Yu Wang,Ryan A. Rossi,Zhengzhong Tu,Hongru Du

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大语言模型(MLLM)的细粒度图像编辑评估框架,将评估分解为12个可解释维度,并构建了融合人工标注与MLLM评判的新基准,验证其与人类判断高度一致且优于传统指标。

Details Motivation: 传统图像编辑评估指标粒度粗、可解释性差,难以反映人类感知和编辑意图,尤其在可控性、编辑定位性和指令忠实性方面存在不足。 Method: 提出基于MLLM-as-a-Judge的细粒度评估框架,将评估解耦为图像保持性、编辑质量、指令保真度三大类共12个可解释因子;构建含人工标注、MLLM评估、模型输出及传统指标的综合基准;通过大规模人工研究验证一致性。 Result: MLLM评判器在细粒度层面与人类判断高度对齐;传统指标(如LPIPS、CLIP-Score)无法有效区分过编辑或语义不准结果,而MLLM评判更直观、信息更丰富,适用于离线与在线评估场景。 Conclusion: 该工作建立了首个细粒度、人类验证的图像编辑评估基准与理论分解体系,证实MLLM作为细粒度评判器是研究、比较和改进图像编辑方法的实用基础。 Abstract: Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.

[85] Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos

Jieyun Bai,Zihao Zhou,Yitong Tang,Jie Gan,Zhuonan Liang,Jianan Fan,Lisa B. Mcguire,Jillian L. Clarke,Weidong Cai,Jacaueline Spurway,Yubo Tang,Shiye Wang,Wenda Shen,Wangwang Yu,Yihao Li,Philippe Zhang,Weili Jiang,Yongjie Li,Salem Muhsin Ali Binqahal Al Nasim,Arsen Abzhanov,Numan Saeed,Mohammad Yaqub,Zunhui Xian,Hongxing Lin,Libin Lan,Jayroop Ramesh,Valentin Bacher,Mark Eid,Hoda Kalabizadeh,Christian Rupprecht,Ana I. L. Namburete,Pak-Hei Yeung,Madeleine K. Wyburd,Nicola K. Dinsdale,Assanali Serikbey,Jiankai Li,Sung-Liang Chen,Zicheng Hu,Nana Liu,Yian Deng,Wei Hu,Cong Tan,Wenfeng Zhang,Mai Tuyet Nhi,Gregor Koehler,Rapheal Stock,Klaus Maier-Hein,Marawan Elbatel,Xiaomeng Li,Saad Slimani,Victor M. Campello,Benard Ohene-Botwe,Isaac Khobo,Yuxin Huang,Zhenyan Han,Hongying Hou,Di Qiu,Zheng Zheng,Gongning Luo,Dong Ni,Yaosheng Lu,Karim Lekadir,Shuo Li

Main category: cs.CV

TL;DR: 本文介绍了Intrapartum Ultrasound Grand Challenge(IUGC),旨在推动资源有限地区产时超声自动生物测量的发展,提出多任务框架并发布迄今最大产时超声视频数据集,分析了8支参赛队伍方法并指出当前技术仍处于早期阶段。

Details Motivation: 解决低收入和中等收入国家因缺乏 trained sonographers 而难以常规开展产时超声监测的问题,降低围产期死亡率。 Method: 构建临床导向的多任务自动测量框架(含标准切面分类、胎儿头部-耻骨联合分割与生物测量),发布含774段视频(68,106帧)的多中心产时超声数据集,并从预处理、数据增强、学习策略、模型架构和后处理五方面系统分析8支团队提交方案。 Result: 参赛方案展现出有希望的性能,但在精度、鲁棒性和临床适用性方面仍存在明显瓶颈;benchmark结果揭示领域尚处早期,距大规模临床部署仍有距离。 Conclusion: IUGC推动了产时超声自动化研究进展,公开数据集与基准方案促进可复现研究;未来需在算法可靠性、跨中心泛化性及临床整合方面深入探索。 Abstract: A substantial proportion (45\%) of maternal deaths, neonatal deaths, and stillbirths occur during the intrapartum phase, with a particularly high burden in low- and middle-income countries. Intrapartum biometry plays a critical role in monitoring labor progression; however, the routine use of ultrasound in resource-limited settings is hindered by a shortage of trained sonographers. To address this challenge, the Intrapartum Ultrasound Grand Challenge (IUGC), co-hosted with MICCAI 2024, was launched. The IUGC introduces a clinically oriented multi-task automatic measurement framework that integrates standard plane classification, fetal head-pubic symphysis segmentation, and biometry, enabling algorithms to exploit complementary task information for more accurate estimation. Furthermore, the challenge releases the largest multi-center intrapartum ultrasound video dataset to date, comprising 774 videos (68,106 frames) collected from three hospitals, providing a robust foundation for model training and evaluation. In this study, we present a comprehensive overview of the challenge design, review the submissions from eight participating teams, and analyze their methods from five perspectives: preprocessing, data augmentation, learning strategy, model architecture, and post-processing. In addition, we perform a systematic analysis of the benchmark results to identify key bottlenecks, explore potential solutions, and highlight open challenges for future research. Although encouraging performance has been achieved, our findings indicate that the field remains at an early stage, and further in-depth investigation is required before large-scale clinical deployment. All benchmark solutions and the complete dataset have been publicly released to facilitate reproducible research and promote continued advances in automatic intrapartum ultrasound biometry.

[86] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

Sayan Deb Sarkar,Rémi Pautrat,Ondrej Miksik,Marc Pollefeys,Iro Armeni,Mahdi Rad,Mihai Dusmanu

Main category: cs.CV

TL;DR: 本文提出了一种基于视频编解码器原语(运动矢量和残差)的轻量级视频语言模型方法,以解决现有关键帧采样导致的信息丢失和计算开销大的问题;通过轻量Transformer编码器和对齐预训练策略,在大幅降低时延和Token消耗的同时,保持或超越了14个视频理解基准上的性能。

Details Motivation: 现有VideoLMs依赖稀疏关键帧采样,易丢失宏观事件与微观细节,且逐帧全图编码计算开销大。 Method: 利用视频编解码器原语(运动矢量、残差)替代全帧图像编码,设计轻量Transformer编码器聚合这些原语,并通过预训练策略将其表征与图像编码器嵌入对其,以加速端到端微调收敛。 Result: 相比标准VideoLMs,首Token延迟降低最多86%,Token使用量减少最多93%;在14个涵盖问答、时序推理、长视频理解和空间场景理解的基准上,性能持平或更优。 Conclusion: 基于编解码器原语的轻量建模路径是高效、高保真视频语言理解的有效范式,兼顾效率与性能。 Abstract: Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

[87] Deep-Learning Atlas Registration for Melanoma Brain Metastases: Preserving Pathology While Enabling Cohort-Level Analyses

Nanna E. Wielenberg,Ilinca Popp,Oliver Blanck,Lucas Zander,Jan C. Peeken,Stephanie E. Combs,Anca-Ligia Grosu,Dimos Baltas,Tobias Fechter

Main category: cs.CV

TL;DR: 本文提出了一种无需病灶掩膜的可微分深度学习形变配准框架,用于将黑色素瘤脑转移(MBM)患者的MRI图像配准至公共图谱,有效处理病灶导致的解剖结构缺失,并在多中心数据上验证了其高精度与病灶体积保持能力,揭示了MBM在灰白质交界区和皮层的显著空间偏好。

Details Motivation: 黑色素瘤脑转移(MBM)具有空间异质性,且不同中心MRI协议和解剖变异大,传统配准方法依赖病灶掩膜或预处理,难以支持稳健、可重复的队列级分析。 Method: 提出一种端到端可微分的深度学习形变配准框架;采用基于距离变换解剖标签的前向模型相似性度量以应对病灶导致的解剖对应缺失,并引入体积保持正则化项保障形变合理性;在209例来自三个中心的MBM患者数据上进行验证。 Result: 配准精度高(DSC 0.89–0.92,HD 6.79–7.60 mm,ASSD 0.63–0.77 mm),且保持转移灶体积;空间分析发现MBM显著富集于大脑皮层和壳核、贫乏于白质,且高度集中于灰白质交界区;校正体积后,未见任何动脉供血区有显著转移富集。 Conclusion: 该框架实现了无需病灶掩膜的病理脑MRI鲁棒图谱配准,支持多中心可重复研究;结果证实并细化了MBM的空间播散规律,尤其强调灰白质交界区的关键作用;开源实现便于拓展至其他脑肿瘤及神经系统疾病。 Abstract: Melanoma brain metastases (MBM) are common and spatially heterogeneous lesions, complicating cohort-level analyses due to anatomical variability and differing MRI protocols. We propose a fully differentiable, deep-learning-based deformable registration framework that aligns individual pathological brains to a common atlas while preserving metastatic tissue without requiring lesion masks or preprocessing. Missing anatomical correspondences caused by metastases are handled through a forward-model similarity metric based on distance-transformed anatomical labels, combined with a volume-preserving regularization term to ensure deformation plausibility. Registration performance was evaluated using Dice coefficient (DSC), Hausdorff distance (HD), average symmetric surface distance (ASSD), and Jacobian-based measures. The method was applied to 209 MBM patients from three centres, enabling standardized mapping of metastases to anatomical, arterial, and perfusion atlases. The framework achieved high registration accuracy across datasets (DSC 0.89-0.92, HD 6.79-7.60 mm, ASSD 0.63-0.77 mm) while preserving metastatic volumes. Spatial analysis demonstrated significant over-representation of MBM in the cerebral cortex and putamen, under-representation in white matter, and consistent localization near the gray-white matter junction. No arterial territory showed increased metastasis frequency after volume correction. This approach enables robust atlas registration of pathological brain MRI without lesion masks and supports reproducible multi-centre analyses. Applied to MBM, it confirms and refines known spatial predilections, particularly preferential seeding near the gray-white matter junction and cortical regions. The publicly available implementation facilitates reproducible research and extension to other brain tumours and neurological pathologies.

[88] Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation

Hongbo Jiang,Jie Li,Xinqi Cai,Tianyu Xie,Yunhang Shen,Pingyang Dai,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出MLLMEmbed-ReID框架,将多模态大语言模型(MLLM)统一用于跨模态重识别(CM-ReID),通过指令微调与分层LoRA策略在云端构建强表征能力模型,并设计基于主成分映射与特征关系保持的新型知识蒸馏方法,实现高性能轻量边缘部署。

Details Motivation: 现有CM-ReID系统依赖碎片化的专用云模型,而MLLM虽具统一潜力,却缺乏端到端适配和面向边缘部署的有效知识蒸馏方法。 Method: 1)将基础MLLM适配为云端CM-ReID模型,采用指令提示引导生成RGB、红外、草图、文本的统一嵌入空间,并以分层LoRA-SFT策略进行高效微调;2)提出基于教师特征低秩特性的蒸馏方法,包含主成分映射损失(保留关键信息)和特征关系损失(保持结构)。 Result: 边缘模型在多个视觉CM-ReID基准上达到SOTA;云端模型在全部CM-ReID基准上表现最优;整体框架实现了MLLM级智能在资源受限设备上的有效部署。 Conclusion: MLLMEmbed-ReID提供了一种完整的云边协同方案,首次成功将MLLM统一建模能力与轻量边缘部署相结合,推动CM-ReID实用化落地。 Abstract: Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher's feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.

[89] Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

Wenhui Liao,Hongliang Li,Pengyu Xie,Xinyu Cai,Yufan Shen,Yi Xin,Qi Qin,Shenglong Ye,Tianbin Li,Ming Hu,Junjun He,Yihao Liu,Wenhai Wang,Min Dou,Bin Fu,Botian Shi,Yu Qiao,Lianwen Jin

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、高效的文档解析加速方法,通过轻量级草稿模型与VLM验证模型协同并行解码,并结合页面区域划分与阅读顺序组装,显著提升长文档解析速度。

Details Motivation: 现有基于视觉语言模型(VLM)的端到端文档解析方法虽语义建模强、泛化好,但因自回归生成长token序列导致推理延迟高,尤其面对长文档和复杂版式时效率低下。 Method: 受推测解码启发,采用轻量级文档解析流程作为草稿模型批量预测token,由准确VLM并行验证;同时将每页按布局划分为独立区域,各区域并行执行草稿-验证解码,最终按自然阅读顺序组装结果。 Result: 在OmniDocBench基准上,对dots.ocr模型实现2.42倍无损加速;在长文档解析任务中最高达4.89倍加速。 Conclusion: 该训练免费、结构感知的并行推测解码策略,有效缓解VLM文档解析的延迟瓶颈,兼顾效率与精度,具备实用价值与可扩展性。 Abstract: Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.

[90] Detecting Object Tracking Failure via Sequential Hypothesis Testing

Alejandro Monroy Muñoz,Rajeev Verma,Alexander Timans

Main category: cs.CV

TL;DR: 本文提出一种基于序贯假设检验的实时目标跟踪可靠性评估方法,通过e-process形式化实现对跟踪失败的快速检测,并严格控制误报率,无需额外训练且模型无关。

Details Motivation: 现有实时目标跟踪系统缺乏形式化的安全保证机制,通常仅依赖启发式置信度指标判断跟踪可靠性,难以满足实际应用中对可信赖性的要求。 Method: 将目标跟踪建模为序贯假设检验问题,利用e-process构建统计检验过程,分别设计监督式(利用真值)和无监督式(仅用跟踪内部信息)两种变体。 Result: 在四个视频基准数据集上验证了该方法对两种主流跟踪模型的有效性,能快速检测跟踪失败并严格控制误报率,计算开销低且无需额外训练。 Conclusion: 所提序贯检验方法为实时跟踪系统提供了具有统计保证、轻量高效且模型无关的安全保障机制。 Abstract: Real-time online object tracking in videos constitutes a core task in computer vision, with wide-ranging applications including video surveillance, motion capture, and robotics. Deployed tracking systems usually lack formal safety assurances to convey when tracking is reliable and when it may fail, at best relying on heuristic measures of model confidence to raise alerts. To obtain such assurances we propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time. Leveraging recent advancements in the field, our sequential test (formalized as an e-process) quickly identifies when tracking failures set in whilst provably containing false alerts at a desired rate, and thus limiting potentially costly re-calibration or intervention steps. The approach is computationally light-weight, requires no extra training or fine-tuning, and is in principle model-agnostic. We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information, and demonstrate its effectiveness for two established tracking models across four video benchmarks. As such, sequential testing can offer a statistically grounded and efficient mechanism to incorporate safety assurances into real-time tracking systems.

[91] MASAR: Motion-Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting

Mohammed Amine Bencheikh Lehocine,Julian Schmidt,Frank Moosmann,Dikshant Gupta,Fabian Flohr

Main category: cs.CV

TL;DR: 本文提出MASAR框架,通过对象中心的时空机制联合编码外观与运动特征,实现3D检测与轨迹预测的端到端联合优化,在nuScenes数据集上显著提升预测精度。

Details Motivation: 传统自动驾驶系统中感知与预测模块通过手工设计的边界框接口连接,限制信息流动并传播误差;现有端到端方法未能充分利用外观与运动线索的协同作用。 Method: 提出MASAR——一种完全可微的、兼容任意基于Transformer的3D检测器的联合3D检测与轨迹预测框架;采用对象中心的时空机制联合编码外观与运动特征,并通过预测并修正过去轨迹来建模长期时序依赖。 Result: 在nuScenes数据集上,minADE和minFDE指标提升超20%,同时保持稳健的检测性能。 Conclusion: MASAR有效增强了外观与运动线索的协同建模能力,验证了‘回溯过去以预测未来’思路在联合感知与预测任务中的有效性。 Abstract: Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of "looking backward to look forward", and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at https://github.com/aminmed/MASAR.

[92] Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li,Hengrui Zhang,Meng-Hao Guo,Wenzhao Gao,Shaoyong Jia,Shaohui Jiao,Qibin Hou,Ming-Ming Cheng

Main category: cs.CV

TL;DR: 本文提出ASID-1M数据集、ASID-Verify标注流水线和ASID-Captioner模型,以提升视频理解中细粒度音视频描述的准确性和可靠性。

Details Motivation: 现有视频指令数据缺乏细粒度组织和可靠标注,难以支持通用视频理解任务。 Method: 构建百万级结构化音视频指令数据集ASID-1M;设计可扩展的数据清洗流程ASID-Verify,确保语义与时间一致性;训练基于监督微调的ASID-Captioner模型。 Result: ASID-Captioner在七项音视频理解基准上显著提升细粒度描述质量,降低幻觉,增强指令遵循能力,性能达开源模型SOTA,并媲美Gemini-3-Pro。 Conclusion: 结构化、高质量的音视频指令数据与专用模型协同优化,是提升通用视频理解能力的关键路径。 Abstract: Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

[93] Multimodal Classification via Total Correlation Maximization

Feng Yu,Xiangyu Wu,Yang Yang,Jianfeng Lu

Main category: cs.CV

TL;DR: 本文提出TCMax方法,通过最大化多模态特征与标签间的总相关性(Total Correlation)来缓解模态竞争问题,提升多模态分类性能。

Details Motivation: 现有联合学习易导致模态过拟合或忽视,性能常低于单模态学习;缺乏从信息论角度对联合学习与单模态学习关系的深入分析。 Method: 基于互信息神经估计(MINE),提出总相关性神经估计(TCNE)以推导总相关性的下界,并设计无超参损失函数TCMax,通过变分界优化最大化总相关性,同时实现特征对齐以捕获模态间交互。 Result: TCMax在多个实验中显著优于当前最优的联合学习与单模态学习方法。 Conclusion: 最大化多模态特征与标签间的总相关性可有效缓解模态竞争、增强模态协同,是提升多模态学习鲁棒性与性能的有效信息论路径。 Abstract: Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at https://github.com/hubaak/TCMax.

[94] DynaGuide: A Generalizable Dynamic Guidance Framework for Unsupervised Semantic Segmentation

Boujemaa Guermazi,Riadh Ksantini,Naimul Khan

Main category: cs.CV

TL;DR: DynaGuide是一种无监督图像分割框架,通过结合零样本全局伪标签与轻量级CNN局部边界优化,并采用动态多成分损失函数,在无需目标域真实标签的情况下实现高精度分割,达到SOTA性能。

Details Motivation: 现有无监督图像分割方法难以兼顾全局语义结构与细粒度边界精度,且依赖标注数据,在标注稀缺场景下应用受限。 Method: 提出DynaGuide框架:1)采用零样本模型(如DiffSeg、SegFormer)生成全局伪标签;2)用从头训练的轻量CNN进行局部边界细化;3)设计包含特征相似性、Huber平滑空间连续性(含对角关系)和语义对齐的动态多成分损失函数;4)全程无需目标域真实标签,支持即插即用多种引导源。 Result: 在BSD500、PASCAL VOC2012和COCO数据集上达到SOTA:mIoU分别提升17.5%、3.1%和11.66%;具备模块化设计、强泛化能力和低计算开销。 Conclusion: DynaGuide为真实场景下的无监督图像分割提供了可扩展、实用且高效的解决方案。 Abstract: Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine-grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual-guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo-labels from zero-shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high-precision segmentations. At the heart of DynaGuide is a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo-labels. Unlike prior approaches, DynaGuide trains entirely without ground-truth labels in the target domain and supports plug-and-play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state-of-the-art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real-world settings. Code available at: https://github.com/RyersonMultimediaLab/DynaGuide

[95] Learning Image-based Tree Crown Segmentation from Enhanced Lidar-based Pseudo-labels

Julius Pesonen,Stefan Rua,Josef Taher,Niko Koivumäki,Xiaowei Yu,Eija Honkavaara

Main category: cs.CV

TL;DR: 本文提出了一种利用机载激光扫描(ALS)数据生成伪标签,并结合SAM 2模型增强伪标签质量,从而训练深度学习模型对RGB和多光谱图像中单木树冠进行分割与分离的方法,无需人工标注即可获得高性能、领域专用的分割模型。

Details Motivation: 自动从航拍影像中分离单木树冠具有挑战性(如纹理复杂、树冠重叠),而人工标注成本高,亟需低成本、高精度的标注方案。 Method: 利用ALS数据生成初始伪标签,并通过零样本实例分割模型SAM 2对其进行增强优化;基于增强后的伪标签训练深度学习模型,实现RGB与多光谱影像中的单木树冠分割与分离。 Result: 所提方法在相同任务上显著优于面向通用领域的现有模型,且完全避免了人工标注成本。 Conclusion: ALS衍生伪标签结合SAM 2增强是一种高效、低成本构建领域专用树冠分割模型的新范式,为遥感生态监测提供了可扩展的技术路径。 Abstract: Mapping individual tree crowns is essential for tasks such as maintaining urban tree inventories and monitoring forest health, which help us understand and care for our environment. However, automatically separating the crowns from each other in aerial imagery is challenging due to factors such as the texture and partial tree crown overlaps. In this study, we present a method to train deep learning models that segment and separate individual trees from RGB and multispectral images, using pseudo-labels derived from aerial laser scanning (ALS) data. Our study shows that the ALS-derived pseudo-labels can be enhanced using a zero-shot instance segmentation model, Segment Anything Model 2 (SAM 2). Our method offers a way to obtain domain-specific training annotations for optical image-based models without any manual annotation cost, leading to segmentation models which outperform any available models which have been targeted for general domain deployment on the same task.

[96] FedHENet: A Frugal Federated Learning Framework for Heterogeneous Environments

Alejandro Dopico-Castro,Oscar Fontenla-Romero,Bertha Guijarro-Berdiñas,Amparo Alonso-Betanzos,Iván Pérez Digón

Main category: cs.CV

TL;DR: FedHENet是一种基于同态加密的联邦学习框架,通过固定预训练特征提取器、仅学习单层输出并单轮聚合客户端知识,实现高效、稳定且隐私安全的图像分类。

Details Motivation: 现有联邦学习方法依赖昂贵的迭代深度网络优化,且共享梯度仍存在隐私泄露风险;同时超参数调优带来显著碳足迹。 Method: 扩展FedHEONN框架至图像分类任务,采用固定预训练特征提取器,仅在客户端本地解析求解单个输出层权重,并利用同态加密在单轮通信中安全聚合知识。 Result: 在保持与迭代式联邦学习基线相当准确率的同时,展现出更优的稳定性及最高达70%的能效提升,且无需超参数调优。 Conclusion: FedHENet提供了一种高效、低能耗、高稳定性且真正超参数无关的隐私保护联邦图像分类新范式。 Abstract: Federated Learning (FL) enables collaborative training without centralizing data, essential for privacy compliance in real-world scenarios involving sensitive visual information. Most FL approaches rely on expensive, iterative deep network optimization, which still risks privacy via shared gradients. In this work, we propose FedHENet, extending the FedHEONN framework to image classification. By using a fixed, pre-trained feature extractor and learning only a single output layer, we avoid costly local fine-tuning. This layer is learned by analytically aggregating client knowledge in a single round of communication using homomorphic encryption (HE). Experiments show that FedHENet achieves competitive accuracy compared to iterative FL baselines while demonstrating superior stability performance and up to 70\% better energy efficiency. Crucially, our method is hyperparameter-free, removing the carbon footprint associated with hyperparameter tuning in standard FL. Code available in https://github.com/AlejandroDopico2/FedHENet/

[97] Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images

Yuhao Chen,Gautham Vinod,Siddeshwar Raghavan,Talha Ibn Mahmud,Bruce Coburn,Jinge Ma,Fengqing Zhu,Jiangpeng He

Main category: cs.CV

TL;DR: 本文提出Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images数据集,将食物份量估计重构为隐式尺度下的单目多食物3D重建问题,强调无显式标尺条件下的几何推理,并在MetaFood 2025挑战赛中验证了几何重建方法相较视觉语言模型更准确鲁棒。

Details Motivation: 现有饮食评估方法依赖单图分析或外观推断(如VLM),缺乏显式几何推理且对尺度模糊敏感,难以适应真实进餐场景。 Method: 构建首个面向隐式尺度单目多食物3D重建的基准数据集,移除物理参考和度量标注,仅提供盘子、餐具等上下文对象,要求算法从隐式线索和先验知识中推断尺度;鼓励基于几何重建的解决方案。 Result: 实验表明,几何重建方法在体积估计(MAPE=0.21)和几何精度(L1 Chamfer距离=5.7)上优于强视觉语言基线,更具准确性和鲁棒性。 Conclusion: 将食物份量估计建模为隐式尺度3D重建问题更符合真实场景需求,几何驱动方法是提升饮食评估性能的有效路径。 Abstract: We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.

[98] Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

Florinel-Alin Croitoru,Vlad Hondru,Radu Tudor Ionescu,Nicu Sebe,Mubarak Shah

Main category: cs.CV

TL;DR: 本文提出Curriculum-DPO++,在Curriculum-DPO基础上引入数据级与模型级双层级课程学习:通过渐进式解冻UNet层和动态提升LoRA秩来逐步增强模型容量,并改进偏好对排序策略,在九个基准上显著提升文本对齐、美学质量和人类偏好得分。

Details Motivation: 现有DPO和RLHF方法未考虑不同图像偏好对的学习难度差异,导致优化次优;尤其在文本到图像生成中,需更细粒度的难度感知训练策略。 Method: 提出Curriculum-DPO++:1)数据级——沿用并改进Curriculum-DPO的图像对难度排序;2)模型级——动态扩展模型容量:a)逐阶段解冻UNet可训练层,b)按进度递增LoRA低秩矩阵的秩;二者协同实现从简单到复杂的渐进式学习。 Result: 在九个文本到图像生成基准上,Curriculum-DPO++在文本对齐性、图像美学评分及人类偏好投票三项指标上均超越Curriculum-DPO及其他SOTA偏好优化方法。 Conclusion: 模型容量应随训练进程动态增长以匹配数据难度演化;Curriculum-DPO++验证了数据级与模型级课程学习联合设计的有效性,为偏好优化提供了新范式。 Abstract: Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.

[99] A Calibrated Memorization Index (MI) for Detecting Training Data Leakage in Generative MRI Models

Yash Deo,Yan Jia,Toni Lassila,Victoria J Hodge,Alejandro F Frang,Chenghao Qian,Siyuan Kang,Ibrahim Habli

Main category: cs.CV

TL;DR: 本文提出了一种校准的逐样本度量方法,用于检测图像生成模型在医学MRI图像生成中对训练数据的记忆与复制现象,通过MRI基础模型提取特征、多层白化最近邻相似性聚合,并映射为有界指标ONI和MI,在多个MRI数据集上实现了近乎完美的样本级复制检测。

Details Motivation: 图像生成模型可能复现训练数据,尤其在医学图像生成中引发隐私风险,亟需可靠的方法检测数据记忆与复制。 Method: 提出基于MRI基础模型提取图像特征,聚合多层白化最近邻相似性,并映射为有界指标——过拟合/新颖性指数(ONI)与记忆指数(MI)。 Result: 在三个含可控复制比例及常规增强的MRI数据集上,该指标稳健检测复制行为,跨数据集一致性更优;样本级复制检测接近完美。 Conclusion: 所提校准指标ONI与MI可有效、鲁棒且一致地量化生成图像对训练数据的记忆程度,为医学图像生成中的隐私评估提供了新工具。 Abstract: Image generative models are known to duplicate images from the training data as part of their outputs, which can lead to privacy concerns when used for medical image generation. We propose a calibrated per-sample metric for detecting memorization and duplication of training data. Our metric uses image features extracted using an MRI foundation model, aggregates multi-layer whitened nearest-neighbor similarities, and maps them to a bounded \emph{Overfit/Novelty Index} (ONI) and \emph{Memorization Index} (MI) scores. Across three MRI datasets with controlled duplication percentages and typical image augmentations, our metric robustly detects duplication and provides more consistent metric values across datasets. At the sample level, our metric achieves near-perfect detection of duplicates.

[100] SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery

Chunming Li,Shidong Wang,Tong Xin,Haofeng Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Spectral-Interpretable and -Enhanced Transformer(SIEFormer)的新方法,通过谱分析重新诠释ViT中的注意力机制,并增强特征适应性,尤其适用于广义类别发现(GCD)任务。该方法包含隐式和显式两个谱分支:隐式分支利用图拉普拉斯建模token局部结构关系,并引入可自适应频带滤波的BaF层;显式分支则在值特征上应用傅里叶变换,在频域中学习调制参数后逆变换重构特征。实验表明其在多个图像识别任务上达到SOTA性能。

Details Motivation: 解决Vision Transformer在广义类别发现(GCD)等复杂任务中特征表达能力与可解释性不足的问题,引入谱分析提升注意力机制的物理意义与适应性。 Method: 提出SIEFormer模型,含两个谱分析分支:1)隐式分支使用多种图拉普拉斯建模token局部结构,并设计Band-adaptive Filter(BaF)实现自适应频带滤波;2)显式分支对value特征进行傅里叶变换,在频域用可学习参数调制后逆变换,构建Maneuverable Filtering Layer(MFL)。两分支联合优化。 Result: 在多个图像识别基准(尤其是GCD任务)上取得SOTA性能;消融实验与可视化验证了各模块有效性及谱增强机制的合理性。 Conclusion: 谱分析可有效提升ViT的可解释性与泛化能力;隐式与显式双谱分支协同建模,为Transformer架构提供了新的特征增强范式,尤其适用于无监督/弱监督下的类别发现任务。 Abstract: This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attention mechanism within Vision Transformer (ViT) and enhance feature adaptability, with particular emphasis on challenging Generalized Category Discovery (GCD) tasks. The proposed SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT, enabling joint optimization. The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens, along with a novel Band-adaptive Filter (BaF) layer that can flexibly perform both band-pass and band-reject filtering. The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens by applying the Fourier transform to the input ``value" features, modulating the transformed signal with a set of learnable parameters in the frequency domain, and then performing an inverse Fourier transform to obtain the enhanced features. Extensive experiments reveal state-of-the-art performance on multiple image recognition datasets, reaffirming the superiority of our approach through ablation studies and visualizations.

[101] Universal Transformation of One-Class Classifiers for Unsupervised Anomaly Detection

Declan McIntosh,Alexandra Branzan Albu

Main category: cs.CV

TL;DR: 本文提出一种数据集折叠方法,将基于单类分类器的异常检测器转化为完全无监督方法,无需修改原有模型,仅通过算法选择训练子集,在多个数据集上达到SOTA性能。

Details Motivation: 现有异常检测方法多基于单类分类假设,但实际训练数据中常存在标签噪声,影响模型鲁棒性;需在不依赖干净标签的前提下提升无监督异常检测性能。 Method: 提出数据集折叠(dataset folding)方法:利用多个独立训练的单类分类器,基于‘异常稀少且异质’的弱假设,迭代筛选并剔除训练集中的潜在异常样本,从而构建更纯净的训练子集,实现无监督转化。 Result: 在MVTec AD、ViSA和MVTec Loco AD等主流数据集上达到无监督异常检测的SOTA性能;首次实现了逻辑型异常检测器的无监督化,并能无缝继承单类分类器的后续改进。 Conclusion: 该方法为连接单类学习与无监督异常检测提供了通用桥梁,显著提升实用性与泛化性,且不增加模型复杂度或训练开销。 Abstract: Detecting anomalies in images and video is an essential task for multiple real-world problems, including industrial inspection, computer-assisted diagnosis, and environmental monitoring. Anomaly detection is typically formulated as a one-class classification problem, where the training data consists solely of nominal values, leaving methods built on this assumption susceptible to training label noise. We present a dataset folding method that transforms an arbitrary one-class classifier-based anomaly detector into a fully unsupervised method. This is achieved by making a set of key weak assumptions: that anomalies are uncommon in the training dataset and generally heterogeneous. These assumptions enable us to utilize multiple independently trained instances of a one-class classifier to filter the training dataset for anomalies. This transformation requires no modifications to the underlying anomaly detector; the only changes are algorithmically selected data subsets used for training. We demonstrate that our method can transform a wide variety of one-class classifier anomaly detectors for both images and videos into unsupervised ones. Our method creates the first unsupervised logical anomaly detectors by transforming existing methods. We also demonstrate that our method achieves state-of-the-art performance for unsupervised anomaly detection on the MVTec AD, ViSA, and MVTec Loco AD datasets. As improvements to one-class classifiers are made, our method directly transfers those improvements to the unsupervised domain, linking the domains.

[102] Realistic Face Reconstruction from Facial Embeddings via Diffusion Models

Dong Han,Yong Li,Joachim Denzler

Main category: cs.CV

TL;DR: 本文提出了一种名为FEM的通用框架,利用Kolmogorov-Arnold网络(KAN)和预训练的身份保持扩散模型,从面部嵌入中重建高分辨率真实人脸图像,以评估当前FR及隐私保护型FR(PPFR)系统的隐私泄露风险。

Details Motivation: 现有PPFR系统虽强调隐私保护,但缺乏对高保真度人脸重建攻击下隐私风险的深入验证,尤其是从嵌入中恢复高分辨率人脸的能力尚未被充分研究。 Method: 提出Face Embedding Mapping(FEM)框架,结合Kolmogorov-Arnold Network(KAN)与预训练Identity-Preserving扩散模型,实现从(部分/受保护)嵌入到高质量人脸图像的逆向重建,并在多种FR/PPFR系统上进行攻击验证。 Result: 实验表明:1)重建人脸可成功通过多个真实世界FR系统验证;2)FEM对部分嵌入和受保护嵌入仍具鲁棒重建能力;3)该方法可作为FR/PPFR系统隐私安全性评估的有效工具。 Conclusion: FEM揭示了当前FR与PPFR系统在嵌入层面仍存在显著隐私泄露隐患,强调需重新审视嵌入表示的安全性设计,并为隐私评估提供了新范式。 Abstract: With the advancement of face recognition (FR) systems, privacy-preserving face recognition (PPFR) systems have gained popularity for their accurate recognition, enhanced facial privacy protection, and robustness to various attacks. However, there are limited studies to further verify privacy risks by reconstructing realistic high-resolution face images from embeddings of these systems, especially for PPFR. In this work, we propose the face embedding mapping (FEM), a general framework that explores Kolmogorov-Arnold Network (KAN) for conducting the embedding-to-face attack by leveraging pre-trained Identity-Preserving diffusion model against state-of-the-art (SOTA) FR and PPFR systems. Based on extensive experiments, we verify that reconstructed faces can be used for accessing other real-word FR systems. Besides, the proposed method shows the robustness in reconstructing faces from the partial and protected face embeddings. Moreover, FEM can be utilized as a tool for evaluating safety of FR and PPFR systems in terms of privacy leakage. All images used in this work are from public datasets.

[103] LongStream: Long-Sequence Streaming Autoregressive Visual Geometry

Chong Cheng,Xianda Chen,Tao Xie,Wei Yin,Weiqiang Ren,Qian Zhang,Xiaoyuang Guo,Hao Wang

Main category: cs.CV

TL;DR: LongStream提出了一种去耦合尺度的流式视觉几何模型,通过关键帧相对位姿预测、正交尺度学习和缓存一致性训练,解决了长序列流式3D重建中的注意力衰减、尺度漂移和缓存污染问题,实现了千米级序列的稳定度量重建。

Details Motivation: 现有自回归模型在处理长序列流式3D重建时存在锚定首帧导致的注意力衰减、尺度漂移和外推误差等问题。 Method: 1)放弃首帧锚定,改为预测关键帧相对位姿;2)引入正交尺度学习,解耦几何与尺度估计;3)提出缓存一致性训练与周期性缓存刷新以解决Transformer缓存问题。 Result: LongStream在千米级序列上实现稳定、度量尺度的3D重建,速度达18 FPS,并达到当前最优性能。 Conclusion: LongStream通过三项关键技术突破,有效解决了长序列流式3D重建的核心挑战,显著提升了重建稳定性与尺度一致性。 Abstract: Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: https://3dagentworld.github.io/longstream/

[104] Monocular Markerless Motion Capture Enables Quantitative Assessment of Upper Extremity Reachable Workspace

Seth Donahue,J. D. Peiffer,R. Tyler Richardson,Yishan Zhong,Shaun Q. Y. Tan,Benoit Marteau,Stephanie R. Russo,May D. Wang,R. James Cotton,Ross Chafetz

Main category: cs.CV

TL;DR: 本研究验证了一种基于单目相机和AI驱动的无标记运动捕捉(MMC)技术来量化上肢可达工作空间(UERW)的临床可行方法,结果表明正面单目相机配置与金标准标记式动捕系统具有高度一致性,具备临床实用潜力。

Details Motivation: 推动AI驱动的单目无标记运动捕捉技术在临床环境中用于上肢功能评估,降低传统多相机标记系统的技术门槛和使用成本。 Method: 9名健康成人完成标准化UERW任务(VR引导下触达虚拟球面目标),同步采集:1)8台FLIR相机视频;2)标记式动捕数据(作为参考标准);从中选取正面与偏置两个单目视角进行AI驱动MMC分析,并比较其与参考标准在各八分象限中可达工作空间百分比的偏差。 Result: 正面单目相机配置表现出极小平均偏差(0.61±0.12%),而偏置视角显著低估(−5.66±0.45%);正面配置在前侧工作空间一致性最高;这是首个针对UERW任务的单目MMC系统验证研究。 Conclusion: 正面单目AI-MMC方案可有效、可靠地评估UERW,尤其适用于前侧空间,具备临床推广价值,有望实现便捷、低成本的上肢活动度定量评估。 Abstract: To validate a clinically accessible approach for quantifying the Upper Extremity Reachable Workspace (UERW) using a single (monocular) camera and Artificial Intelligence (AI)-driven Markerless Motion Capture (MMC) for biomechanical analysis. Objective assessment and validation of these techniques for specific clinically oriented tasks are crucial for their adoption in clinical motion analysis. AI-driven monocular MMC reduces the barriers to adoption in the clinic and has the potential to reduce the overhead for analysis of this common clinical assessment. Nine adult participants with no impairments performed the standardized UERW task, which entails reaching targets distributed across a virtual sphere centered on the torso, with targets displayed in a VR headset. Movements were simultaneously captured using a marker-based motion capture system and a set of eight FLIR cameras. We performed monocular video analysis on two of these video camera views to compare a frontal and offset camera configurations. The frontal camera orientation demonstrated strong agreement with the marker-based reference, exhibiting a minimal mean bias of $0.61 \pm 0.12$ \% reachspace reached per octanct (mean $\pm$ standard deviation). In contrast, the offset camera view underestimated the percent workspace reached ($-5.66 \pm 0.45$ \% reachspace reached). Conclusion: The findings support the feasibility of a frontal monocular camera configuration for UERW assessment, particularly for anterior workspace evaluation where agreement with marker-based motion capture was highest. The overall performance demonstrates clinical potential for practical, single-camera assessments. This study provides the first validation of monocular MMC system for the assessment of the UERW task. By reducing technical complexity, this approach enables broader implementation of quantitative upper extremity mobility assessment.

[105] FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

Mingzhi Sheng,Zekai Gu,Peng Li,Cheng Lin,Hao-Xiang Guo,Ying-Cong Chen,Yuan Liu

Main category: cs.CV

TL;DR: 本文提出FlexAM框架,通过新型3D控制信号(点云表示)实现外观与运动的解耦,提升视频生成中的控制能力与泛化性。

Details Motivation: 现有视频生成方法依赖模糊或任务特定的控制信号,缺乏通用性和鲁棒性;作者认为外观与运动的解耦是更根本、可扩展的解决路径。 Method: 提出FlexAM统一框架,基于新型3D点云控制信号,引入多频位置编码、深度感知位置编码和可调控制机制,以实现外观与运动的有效解耦。 Result: 在I2V/V2V编辑、相机控制、空间物体编辑等广泛任务上均取得优于现有方法的性能。 Conclusion: 外观-运动解耦是提升视频生成可控性与泛化性的有效范式,FlexAM验证了该思路的可行性与优越性。 Abstract: Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of "appearance" and "motion" provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.

[106] Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

Aadarsh Sahoo,Georgia Gkioxari

Main category: cs.CV

TL;DR: 本文提出了对话式图像分割(CIS)任务及ConverSeg基准,旨在支持功能性和物理性推理(如安全存放刀具),并设计了ConverSeg-Net模型与AI数据引擎以生成高质量提示-掩码对,显著提升现有语言引导分割模型在该新任务上的性能。

Details Motivation: 现有指代图像定位工作局限于类别和空间查询,忽视了功能性与物理性推理需求,无法满足真实场景中基于意图的图像分割任务。 Method: 提出ConverSeg基准,涵盖实体、空间关系、意图、功能、安全性与物理推理;构建ConverSeg-Net模型,融合强分割先验与语言理解能力;开发AI驱动的数据引擎,无需人工监督即可生成提示-掩码对。 Result: 实验证明当前语言引导分割模型在CIS任务上表现不足,而ConverSeg-Net在ConverSeg基准上取得显著提升,同时在现有语言引导分割基准上保持强性能。 Conclusion: 对话式图像分割是语言引导视觉理解的重要扩展方向,ConverSeg基准与ConverSeg-Net为推动该领域发展提供了坚实基础与有效工具。 Abstract: Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/