Table of Contents
cs.CL [Back]
[1] Multi-Personality Generation of LLMs at Decoding-time
Rongxin Chen,Yunfan Li,Yige Yuan,Bingbing Xu,Huawei Shen
Main category: cs.CL
TL;DR: 本文提出了一种新的多角色生成框架(MPG),在解码时通过利用单维度模型中的隐式密度比来实现灵活的多角色控制,无需依赖多维模型或额外训练。同时设计了基于推测性块级拒绝采样的方法(SCR),以降低计算开销并保持高质量生成。实验表明该方法在MBTI人格和角色扮演任务上性能提升达16%-18%。
Details
Motivation: 现有基于重训练的方法成本高、扩展性差,而解码时方法常依赖外部模型或启发式规则,灵活性和鲁棒性受限。因此需要一种更高效、灵活且无需额外训练的多角色生成方法。 Method: 提出Multi-Personality Generation (MPG)框架,利用单维度模型中的隐式密度比作为‘免费资源’,将多角色生成重构为目标策略下的采样问题;设计Speculative Chunk-level based Rejection sampling (SCR)算法,以块为单位并行生成和验证响应,减少计算开销。 Result: 在MBTI人格识别和角色扮演任务上,MPG相比基线方法性能提升16%-18%,且无需额外训练或多维标注数据,计算效率更高。 Conclusion: MPG提供了一种高效、灵活且可扩展的多角色生成方案,通过解码时组合单维度模型实现高质量多属性控制,具有实际应用潜力。 Abstract: Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a "free lunch" to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at https://github.com/Libra117/MPG .[2] Rethinking LLM Human Simulation: When a Graph is What You Need
Joseph Suh,Suhong Moon,Serina Chang
Main category: cs.CL
TL;DR: 提出了一种基于图神经网络的轻量级模型GEMS,用于离散选择的人类行为模拟,在准确率相当或更好的情况下显著优于大语言模型,具有更高的效率和可解释性。
Details
Motivation: 探索在人类行为模拟任务中是否必须使用大语言模型,还是更小、领域特定的模型(如图神经网络)即可胜任。 Method: 将离散选择模拟任务建模为图上的链接预测问题,提出Graph-basEd Models for human Simulation (GEMS),结合关系知识并在需要时引入语言表示。 Result: 在三个仿真数据集的多个关键场景中,GEMS达到或超过了大语言模型的准确性,同时模型规模小三个数量级,具备更高效率、可解释性和透明度。 Conclusion: 图神经网络可以作为大语言模型在人类行为模拟中的高效、轻量且可解释的替代方案。 Abstract: Large language models (LLMs) are increasingly used to simulate humans, with applications ranging from survey prediction to decision-making. However, are LLMs strictly necessary, or can smaller, domain-grounded models suffice? We identify a large class of simulation problems in which individuals make choices among discrete options, where a graph neural network (GNN) can match or surpass strong LLM baselines despite being three orders of magnitude smaller. We introduce Graph-basEd Models for human Simulation (GEMS), which casts discrete choice simulation tasks as a link prediction problem on graphs, leveraging relational knowledge while incorporating language representations only when needed. Evaluations across three key settings on three simulation datasets show that GEMS achieves comparable or better accuracy than LLMs, with far greater efficiency, interpretability, and transparency, highlighting the promise of graph-based modeling as a lightweight alternative to LLMs for human simulation. Our code is available at https://github.com/schang-lab/gems.[3] IG-Pruning: Input-Guided Block Pruning for Large Language Models
Kangyu Qiao,Shaolei Zhang,Yang Feng
Main category: cs.CL
TL;DR: 提出了一种输入感知的块级剪枝方法IG-Pruning,通过语义聚类和L0优化动态选择Transformer层掩码,显著优于静态深度剪枝方法。
Details
Motivation: 现有的深度剪枝方法依赖固定的块掩码,导致在不同任务和输入下性能不佳,难以满足实际部署中对高效推理的需求。 Method: IG-Pruning包含两个阶段:首先通过语义聚类和L0优化发现多样化的掩码候选集;然后在推理时实现无需大量训练的高效动态剪枝。 Result: 实验结果表明,IG-Pruning在多个任务上 consistently 优于最先进的静态深度剪枝方法,尤其适用于资源受限的部署场景。 Conclusion: IG-Pruning通过输入感知的动态掩码选择,实现了更灵活、高效的LLM推理剪枝,提升了剪枝模型的适应性和性能。 Abstract: With the growing computational demands of large language models (LLMs), efficient inference has become increasingly critical for practical deployment. Depth pruning has emerged as a promising approach for reducing the computational costs of large language models by removing transformer layers. However, existing methods typically rely on fixed block masks, which can lead to suboptimal performance across different tasks and inputs. In this paper, we propose IG-Pruning, a novel input-aware block-wise pruning method that dynamically selects layer masks at inference time. Our approach consists of two stages: (1) Discovering diverse mask candidates through semantic clustering and L0 optimization, and (2) Implementing efficient dynamic pruning without the need for extensive training. Experimental results demonstrate that our method consistently outperforms state-of-the-art static depth pruning methods, making it particularly suitable for resource-constrained deployment scenarios.[4] Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Jonathan Liu,Haoling Qiu,Jonathan Lasko,Damianos Karakos,Mahsa Yarmohammadi,Mark Dredze
Main category: cs.CL
TL;DR: 本文提出了一种自动化生成和评估医疗领域大语言模型(LLM)响应的基础设施,旨在探测LLM在涉及人口统计学因素时的幻觉、遗漏和偏见问题。研究发现不同LLM评估者之间一致性较低,建议使用多个LLM作为评估器以提高结果的可泛化性,并提倡报告LLM间一致性指标以增强透明度。
Details
Motivation: 医疗场景中的聊天机器人必须在包含非医学因素(如人口统计信息)的情况下提供一致建议,但现有LLM普遍存在幻觉、遗漏和偏见问题,因此需要系统性方法来识别其失效条件。 Method: 构建了一个包含自动查询生成和多LLM-as-a-judge评估的框架:1)通过采样患者人口统计、病史、疾病和写作风格生成真实问题;2)利用LLM-as-a-judge和代理工作流检测幻觉与遗漏,并使用分类器检测治疗建议偏差。 Result: 实验显示LLM标注者间一致性低(平均Cohen's Kappa κ=0.118),仅特定(回答-评估)LLM组合对性别、种族和写作风格差异表现出显著敏感性。 Conclusion: 建议在缺乏真实标签的研究中使用多个LLM进行评估,避免得出不可泛化的结论,并应公开LLM间一致性指标以提升研究透明度。 Abstract: Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa $\kappa=0.118$), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.[5] LTD-Bench: Evaluating Large Language Models by Letting Them Draw
Liuhao Lin,Ke Li,Zihan Xu,Yuchen Shi,Yulei Qin,Yan Zhang,Xing Sun,Rongrong Ji
Main category: cs.CL
TL;DR: LTD-Bench 是一个创新的评估基准,通过可视化输出(如点阵图或可执行代码)将大语言模型的空间推理能力直观化,揭示了当前模型在语言与空间概念双向映射上的根本缺陷。
Details
Motivation: 现有大语言模型的评估方法依赖于不透明的数值指标,难以反映模型在空间推理方面的真实能力,尤其影响需要物理世界理解的应用。因此,亟需一种更直观、可观察的评估方式。 Method: 提出 LTD-Bench,包含生成任务(测试空间想象)和识别任务(评估空间感知),要求模型通过点阵或代码生成绘图,并设置三个递进难度层级,全面评估语言到空间及空间到语言的双向映射能力。 Result: 实验表明,即使在传统基准上表现优异的先进模型,在LTD-Bench上仍暴露出严重的空间推理缺陷,特别是在语言与空间概念的双向转换上表现不佳。同时,其可视化输出支持对模型行为进行诊断分析。 Conclusion: LTD-Bench 揭示了当前大语言模型作为‘真实世界模型’的根本局限性,强调了引入直观、可视化评估范式的重要性,为未来模型改进和能力分析提供了新路径。 Abstract: Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with complementary generation tasks (testing spatial imagination) and recognition tasks (assessing spatial perception) across three progressively challenging difficulty levels, methodically evaluating both directions of the critical language-spatial mapping. Our extensive experiments with state-of-the-art models expose an alarming capability gap: even LLMs achieving impressive results on traditional benchmarks demonstrate profound deficiencies in establishing bidirectional mappings between language and spatial concept--a fundamental limitation that undermines their potential as genuine world models. Furthermore, LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.[6] Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation
Wongyu Kim,Hochang Lee,Sanghak Lee,Yoonsung Kim,Jaehyun Park
Main category: cs.CL
TL;DR: 提出M-Solomon,一种可自适应决定是否进行查询增强的通用多模态嵌入模型,通过在训练时区分需增强和无需增强的查询,并引入合成机制,有效提升效果并降低延迟。
Details
Motivation: 现有基于大语言模型的查询增强方法对所有查询都进行增强,导致嵌入延迟高,且可能损害部分查询的性能;同时缺乏在多模态环境下的探索。 Method: 将训练集查询分为需增强和无需增强两类,利用多模态大语言模型(MLLM)生成合适的增强内容,并设计自适应机制:对需增强的查询生成/augment前缀,否则生成/embed,实现按需增强。 Result: 实验表明,M-Solomon显著优于无增强基线和始终增强基线,在保持高性能的同时大幅降低嵌入延迟。 Conclusion: M-Solomon通过自适应查询增强机制,在多模态环境下实现了更高效、灵活的查询处理,兼顾性能与效率。 Abstract: Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.[7] LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context
Yudong Li,Zhongliang Yang,Kejiang Chen,Wenxuan Wang,Tianxin Zhang,Sifang Wan,Kecheng Wang,Haitian Li,Xu Wang,Lefan Cheng,Youdan Yang,Baocheng Chen,Ziyu Liu,Yufei Sun,Liyan Wu,Wenya Wen,Xingchi Gu,Peiru Yang
Main category: cs.CL
TL;DR: 本文提出了一个针对中文大语言模型应用场景的动态安全基准LiveSecBench,涵盖合法性、伦理、事实性、隐私、对抗鲁棒性和推理安全六个维度,并计划持续更新以应对新兴威胁。当前版本已评估18个模型,结果公开在官网。
Details
Motivation: 现有安全评测基准多基于英文语境,缺乏针对中文语言和社会法律框架的动态安全评估体系,难以应对不断演变的安全威胁。 Method: 构建了一个基于中国法律与社会背景的六维安全评测框架,包括合法性、伦理、事实性、隐私、对抗鲁棒性和推理安全,并通过定期更新机制纳入新型威胁(如文生图安全和智能体安全)。 Result: LiveSecBench (v251030) 已完成对18个中文大模型的安全评估,发布了公开可查的排行榜,初步描绘了中文AI模型的安全现状。 Conclusion: LiveSecBench为中文大模型提供了一个动态、系统且可扩展的安全评测平台,有助于推动AI在中文场景下的安全发展与标准化。 Abstract: In this work, we propose LiveSecBench, a dynamic and continuously updated safety benchmark specifically for Chinese-language LLM application scenarios. LiveSecBench evaluates models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) rooted in the Chinese legal and social frameworks. This benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors, such as the planned inclusion of Text-to-Image Generation Safety and Agentic Safety in the next update. For now, LiveSecBench (v251030) has evaluated 18 LLMs, providing a landscape of AI safety in the context of Chinese language. The leaderboard is publicly accessible at https://livesecbench.intokentech.cn/.[8] AyurParam: A State-of-the-Art Bilingual Language Model for Ayurveda
Mohd Nauman,Sravan Gvm,Vijay Devane,Shyam Pawar,Viraj Thakur,Kundeshwar Pundalik,Piyush Sawarkar,Rohit Saluja,Maunendra Desarkar,Ganesh Ramakrishnan
Main category: cs.CL
TL;DR: AyurParam-2.9B 是一个针对阿育吠陀医学领域微调的双语专用语言模型,在专业任务上超越同规模开源模型,展现出领域适配与高质量监督对专业化AI的重要性。
Details
Motivation: 主流大语言模型在处理需要深厚文化、语言和专业知识的特定领域(如传统医学阿育吠陀)时表现不佳,缺乏准确解读和应用能力。 Method: 基于 Param-1-2.9B 模型,使用涵盖经典文献和临床指南、包含英印双语问答的高质量标注数据集进行微调,强调事实准确性与指令清晰性。 Result: 在 BhashaBench-Ayur 基准上,AyurParam-2.9B 超过所有同参数规模(1.5–3B)的开源指令微调模型,性能媲美甚至优于更大模型。 Conclusion: 在专业化医疗知识场景中,真实的领域适应和高质量监督对于构建可靠、文化契合的AI系统至关重要。 Abstract: Current large language models excel at broad, general-purpose tasks, but consistently underperform when exposed to highly specialized domains that require deep cultural, linguistic, and subject-matter expertise. In particular, traditional medical systems such as Ayurveda embody centuries of nuanced textual and clinical knowledge that mainstream LLMs fail to accurately interpret or apply. We introduce AyurParam-2.9B, a domain-specialized, bilingual language model fine-tuned from Param-1-2.9B using an extensive, expertly curated Ayurveda dataset spanning classical texts and clinical guidance. AyurParam's dataset incorporates context-aware, reasoning, and objective-style Q&A in both English and Hindi, with rigorous annotation protocols for factual precision and instructional clarity. Benchmarked on BhashaBench-Ayur, AyurParam not only surpasses all open-source instruction-tuned models in its size class (1.5--3B parameters), but also demonstrates competitive or superior performance compared to much larger models. The results from AyurParam highlight the necessity for authentic domain adaptation and high-quality supervision in delivering reliable, culturally congruent AI for specialized medical knowledge.[9] AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Aashray Reddy,Andrew Zagula,Nicholas Saban
Main category: cs.CL
TL;DR: 本文提出了AutoAdv,一种无需训练的自动化多轮越狱攻击框架,通过模式管理、温度调整和两阶段重写策略,在六轮内对Llama-3.1-8B实现高达95%的攻击成功率,显著高于单轮攻击。
Details
Motivation: 现有大语言模型的安全评估多集中于单轮交互,而实际攻击往往是多轮、自适应的,因此需要更贴近真实场景的多轮攻击评估方法。 Method: 提出AutoAdv框架,结合三种自适应机制:模式管理器学习成功攻击模式以优化后续提示,温度管理器根据失败模式动态调整采样参数,以及两阶段重写策略来伪装并逐步优化有害请求。 Result: 在Llama-3.1-8B上六轮内达到95%攻击成功率,比单轮基线提升24%;在GPT-4o-mini、Qwen3-235B、Mistral-7B等多个模型上验证了多轮攻击的优越性,揭示当前安全机制在多轮场景下的持续脆弱性。 Conclusion: 针对单轮交互优化的对齐策略在多轮对话中鲁棒性不足,亟需开发能够应对自适应多轮攻击的防御机制。 Abstract: Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.[10] Merging Continual Pretraining Models for Domain-Specialized LLMs: A Case Study in Finance
Kentaro Ueda,François Portet,Hirohiko Suwa,Keiichi Yasumoto
Main category: cs.CL
TL;DR: 本文研究了在金融领域中通过合并持续预训练(CPT)专家模型来构建多技能大语言模型的方法,提出了三阶段评估框架,并比较了三种模型融合方法,发现任务算术和TIES方法表现良好,为利用现有模型资产构建专业化LLM提供了原则性指导。
Details
Motivation: 大语言模型在通用任务上表现优异,但在金融等专业领域因缺乏领域知识、数学推理和多语言处理能力而受限,且多技能联合训练成本高且不稳定,因此需要探索更有效的模型融合方法。 Method: 提出一种三阶段评估框架(知识恢复、互补性、涌现能力),在涵盖18个任务的综合性金融基准上评估三种CPT模型融合方法:任务算术(Task Arithmetic)、TIES和DARE-TIES,并使用金融、数学和日语专家模型进行实验。 Result: 合并专家模型与基础模型可恢复CPT过程中丢失的通用知识;合并多个专家能提升性能并产生跨领域涌现能力;任务算术性能强但对超参数敏感,TIES更鲁棒;模型相似性与融合成功相关,但涌现能力依赖更复杂因素。 Conclusion: 这是首次对CPT模型融合的系统性研究,建立了评估框架并验证了融合策略的有效性,为从现有模型资产构建多技能金融大模型提供了可行路径和实践指导。 Abstract: While LLMs excel at general tasks, they struggle in specialized domains like finance, requiring diverse skills in domain knowledge, mathematical reasoning, and multilingual processing. Merging domain-specific Continual Pre-training (CPT) "experts" offers a practical alternative to costly and unstable multi-skill training. However, unlike established Supervised Fine-Tuning (SFT) model-based merging, CPT model merging remains largely unexplored. We address this gap by creating financial LLMs from experts in finance, math, and Japanese. We propose a three-stage evaluation focusing on knowledge recovery, complementarity, and emergence, and assess three merging methods (Task Arithmetic, TIES, and DARE-TIES) on a comprehensive financial benchmark curated from 18 tasks across 8 established datasets. Results show that merging an expert with its base model recovers general knowledge lost during CPT, while merging experts improves performance and can yield emergent cross-domain skills. Among the methods, Task Arithmetic performs strongly but is hyperparameter-sensitive, whereas TIES is more robust. Our findings also suggest that while model similarity correlates with merging success, emergent skills depend on more complex factors. This work presents the first foundational analysis of CPT model merging, establishing a principled framework and providing clear guidance for building multi-skill LLMs from existing assets.[11] Prompting for Policy: Forecasting Macroeconomic Scenarios with Synthetic LLM Personas
Giulia Iadisernia,Carolina Camassa
Main category: cs.CL
TL;DR: 本研究评估了基于角色的提示是否能提升大语言模型(LLM)在宏观经济预测任务中的表现,发现GPT-4o与人类专家预测精度相近,但角色提示并未带来显著优势。
Details
Motivation: 探究角色提示是否能够提升大语言模型在宏观经济预测中的准确性,并比较模型与人类专家的表现差异。 Method: 使用PersonaHub语料库中的2,368个经济学相关角色,通过提示GPT-4o复现欧洲央行专业预测者调查,覆盖2013-2025年50个季度,对比人类专家和100种无角色提示基线预测在四个变量和四个预测时滞上的表现。 Result: GPT-4o与人类预测者准确率相近,差异统计显著但实际影响小;在2024-2025年样本外数据上仍保持竞争力;消融实验显示角色提示对预测性能无显著提升。 Conclusion: GPT-4o在提供相关上下文数据的情况下,可在宏观经济预测中达到与人类专家相媲美的精度,且角色提示可省略以降低计算成本,表明多样化提示产生高度同质化的预测结果。 Abstract: We evaluate whether persona-based prompting improves Large Language Model (LLM) performance on macroeconomic forecasting tasks. Using 2,368 economics-related personas from the PersonaHub corpus, we prompt GPT-4o to replicate the ECB Survey of Professional Forecasters across 50 quarterly rounds (2013-2025). We compare the persona-prompted forecasts against the human experts panel, across four target variables (HICP, core HICP, GDP growth, unemployment) and four forecast horizons. We also compare the results against 100 baseline forecasts without persona descriptions to isolate its effect. We report two main findings. Firstly, GPT-4o and human forecasters achieve remarkably similar accuracy levels, with differences that are statistically significant yet practically modest. Our out-of-sample evaluation on 2024-2025 data demonstrates that GPT-4o can maintain competitive forecasting performance on unseen events, though with notable differences compared to the in-sample period. Secondly, our ablation experiment reveals no measurable forecasting advantage from persona descriptions, suggesting these prompt components can be omitted to reduce computational costs without sacrificing accuracy. Our results provide evidence that GPT-4o can achieve competitive forecasting accuracy even on out-of-sample macroeconomic events, if provided with relevant context data, while revealing that diverse prompts produce remarkably homogeneous forecasts compared to human panels.[12] Smart-Hiring: An Explainable end-to-end Pipeline for CV Information Extraction and Job Matching
Kenza Khelkhal,Dihia Lanasri
Main category: cs.CL
TL;DR: 本文提出了一种名为Smart-Hiring的端到端自然语言处理(NLP)流程,用于自动提取简历中的结构化信息,并通过语义匹配将候选人与职位描述进行匹配,具有模块化、可解释性强和高准确性的特点。
Details
Motivation: 招聘过程中手动筛选大量简历耗时费力,且容易出错并受到人为偏见影响,因此需要一种自动化、高效且公平的解决方案。 Method: 结合文档解析、命名实体识别和上下文文本嵌入技术,将简历和职位描述编码到共享向量空间中,计算候选者与职位之间的相似度得分。系统采用模块化设计,支持对提取结果和匹配依据的可视化检查。 Result: 在涵盖多个专业领域的真实数据集上进行了实验,验证了该方法的鲁棒性和可行性,系统在保持高可解释性的同时实现了具有竞争力的匹配精度。 Conclusion: Smart-Hiring为招聘分析提供了一个可扩展且实用的NLP框架,展示了在偏差缓解、公平性建模和大规模部署数据驱动招聘方案方面的潜力。 Abstract: Hiring processes often involve the manual screening of hundreds of resumes for each job, a task that is time and effort consuming, error-prone, and subject to human bias. This paper presents Smart-Hiring, an end-to-end Natural Language Processing (NLP) pipeline de- signed to automatically extract structured information from unstructured resumes and to semantically match candidates with job descriptions. The proposed system combines document parsing, named-entity recognition, and contextual text embedding techniques to capture skills, experience, and qualifications. Using advanced NLP technics, Smart-Hiring encodes both resumes and job descriptions in a shared vector space to compute similarity scores between candidates and job postings. The pipeline is modular and explainable, allowing users to inspect extracted entities and matching rationales. Experiments were conducted on a real-world dataset of resumes and job descriptions spanning multiple professional domains, demonstrating the robustness and feasibility of the proposed approach. The system achieves competitive matching accuracy while preserving a high degree of interpretability and transparency in its decision process. This work introduces a scalable and practical NLP frame- work for recruitment analytics and outlines promising directions for bias mitigation, fairness-aware modeling, and large-scale deployment of data-driven hiring solutions.[13] The Analysis of Lexical Errors in Machine Translation from English into Romanian
Angela Stamatie
Main category: cs.CL
TL;DR: 该研究分析了谷歌翻译在将英文新冠相关信息翻译成罗马尼亚语时的词汇错误,旨在通过改进词汇选择来提高机器翻译质量。
Details
Motivation: 提升谷歌翻译在官方医疗信息等关键文本中的准确性和流畅性,减少翻译错误。 Method: 对230篇由英文翻译为罗马尼亚语的文本进行系统的错误分析,重点关注词汇层面的错误。 Result: 识别出多种词汇错误类型,揭示了当前机器翻译在专业和官方文本翻译中的局限性。 Conclusion: 通过针对性改进词汇选择机制,可有效提升机器翻译在专业领域的翻译质量。 Abstract: The research explores error analysis in the performance of translating by Machine Translation from English into Romanian, and it focuses on lexical errors found in texts which include official information, provided by the World Health Organization (WHO), the Gavi Organization, by the patient information leaflet (the information about the active ingredients of the vaccines or the medication, the indications, the dosage instructions, the storage instructions, the side effects and warning, etc.). All of these texts are related to Covid-19 and have been translated by Google Translate, a multilingual Machine Translation that was created by Google. In the last decades, Google has actively worked to develop a more accurate and fluent automatic translation system. This research, specifically focused on improving Google Translate, aims to enhance the overall quality of Machine Translation by achieving better lexical selection and by reducing errors. The investigation involves a comprehensive analysis of 230 texts that have been translated from English into Romanian.[14] Next Token Knowledge Tracing: Exploiting Pretrained LLM Representations to Decode Student Behaviour
Max Norris,Kobi Gal,Sahan Bulathwela
Main category: cs.CL
TL;DR: 提出Next Token Knowledge Tracing (NTKT) 方法,将知识追踪转化为基于大语言模型的下一个词预测任务,充分利用问题文本内容,显著提升预测性能并改善冷启动场景的泛化能力。
Details
Motivation: 现有知识追踪模型通常忽略题目文本内容,仅依赖回答正确性和元数据,限制了对学生知识状态的准确建模和预测性能。 Method: 将知识追踪重构为下一个词预测任务,使用预训练大语言模型同时建模学生历史交互序列和题目文本内容。 Result: 在多个实验中显著优于最先进的神经知识追踪模型,并在冷启动问题和用户上表现出更强的泛化能力。 Conclusion: 题目文本内容对知识追踪至关重要,利用大语言模型的预训练表示可更有效地建模学生学习过程。 Abstract: Modelling student knowledge is a key challenge when leveraging AI in education, with major implications for personalised learning. The Knowledge Tracing (KT) task aims to predict how students will respond to educational questions in learning environments, based on their prior interactions. Existing KT models typically use response correctness along with metadata like skill tags and timestamps, often overlooking the question text, which is an important source of pedagogical insight. This omission poses a lost opportunity while limiting predictive performance. We propose Next Token Knowledge Tracing (NTKT), a novel approach that reframes KT as a next-token prediction task using pretrained Large Language Models (LLMs). NTKT represents both student histories and question content as sequences of text, allowing LLMs to learn patterns in both behaviour and language. Our series of experiments significantly improves performance over state-of-the-art neural KT models and generalises much better to cold-start questions and users. These findings highlight the importance of question content in KT and demonstrate the benefits of leveraging pretrained representations of LLMs to model student learning more effectively.[15] CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency
Ehsan Aghazadeh,Ahmad Ghasemi,Hedyeh Beyhaghi,Hossein Pishro-Nik
Main category: cs.CL
TL;DR: 本文提出了 Confidence-Guided Early Stopping (CGES),一种基于贝叶斯框架的自适应推理方法,利用标量置信信号动态决定何时停止采样,在显著减少大模型调用次数的同时保持与多数投票相当的准确性。
Details
Motivation: 大语言模型在测试时通常需多次调用并采用多数投票,但该方法调用次数固定,且当正确答案稀少时效果不佳。因此需要一种更高效、自适应的推理策略。 Method: CGES 利用来自 token 概率或奖励模型的标量置信信号,构建候选答案的后验分布,并在某候选答案的后验质量超过阈值时自适应停止采样,形成一种贝叶斯推理框架。 Result: 在五个推理基准上,CGES 平均减少了约 69% 的模型调用次数(例如从 16.0 减少到 4.9),同时准确率与自一致性方法相差不超过 0.06 个百分点。 Conclusion: CGES 是一种高效且理论可解释的推理策略,能够在几乎不损失准确性的前提下大幅降低大模型的推理成本,适用于需要多轮采样的场景。 Abstract: Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency strategy (arXiv:2203.11171) requires a fixed number of calls and can fail when the correct answer is rare. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers using scalar confidence signals derived from token probabilities or reward models. CGES adaptively halts sampling once the posterior mass of a candidate exceeds a threshold. We provide theoretical guarantees for both perfectly calibrated confidences and realistic noisy confidence signals. Across five reasoning benchmarks, CGES reduces the average number of model calls by about 69 percent (for example, from 16.0 to 4.9) while matching the accuracy of self-consistency within 0.06 percentage points.[16] The Realignment Problem: When Right becomes Wrong in LLMs
Aakash Sen Sharma,Debdeep Sanyal,Vivek Srivastava,Shirish Karande,Murari Mandal
Main category: cs.CL
TL;DR: 本文提出了TRACE框架,通过程序化的方法解决大语言模型在动态政策变化下的对齐问题,实现高效、低成本且不损害模型性能的重新对齐。
Details
Motivation: 现有大语言模型对齐方法静态、脆弱且难以维护,无法适应不断变化的社会规范和政策,导致出现“对齐-现实差距”。大规模重新标注成本高昂,传统去学习方法又过于粗糙,影响模型性能。 Method: 提出TRACE框架,将重新对齐视为程序化的策略应用问题:通过新策略对现有偏好数据进行分类,利用对齐影响分数识别关键冲突,并采用混合优化策略选择性地反转、丢弃或保留偏好,同时保护模型性能。 Result: 在多种模型(如Qwen2.5-7B、Gemma-2-9B、Llama-3.1-8B)和数据集(包括合成基准和PKU-SafeRLHF)上验证了TRACE的有效性,能够在复杂政策变化下实施新原则的同时不降低通用能力。 Conclusion: TRACE为维持大语言模型的持续对齐提供了一种可扩展、动态且经济高效的范式,推动了可持续且负责任的人工智能部署。 Abstract: The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a programmatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.[17] Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis, Solution, and Interpretation
Renfei Dang,Peng Hu,Changjiang Gao,Shujian Huang
Main category: cs.CL
TL;DR: 本研究通过构建Biography-Reasoning数据集,发现大语言模型在特定知识类型完全为新知识时微调会显著增加事实性幻觉,且高陌生度比新知识比例更易引发幻觉;提出KnownPatch方法,通过在训练后期引入少量已知知识样本有效缓解该问题,并通过注意力分析揭示了新知识学习导致模型对关键实体关注减少、进而诱发幻觉传播的机制。
Details
Motivation: 现有研究未深入探讨大语言模型在微调过程中因引入新知识而产生事实性幻觉的具体表现及内在机制,本文旨在填补这一空白。 Method: 构建受控数据集Biography-Reasoning,对多种知识类型和两类任务(知识问答与知识推理)进行细粒度分析;提出KnownPatch方法,在训练后期注入少量已知知识以缓解幻觉;通过注意力机制分析探究幻觉成因。 Result: 发现特定知识类型的全新性显著提升幻觉倾向,且该效应可跨知识类型影响问答任务;注意力分析显示新知识学习削弱了模型对问题关键实体的关注,导致过度依赖上下文并促进幻觉传播;KnownPatch方法有效恢复注意力模式并提升性能。 Conclusion: 大语言模型微调中特定知识类型的高陌生度是诱发事实性幻觉的关键因素,KnownPatch能有效缓解该问题,同时揭示了注意力转移在幻觉生成与传播中的核心作用。 Abstract: Previous studies show that introducing new knowledge during large language models (LLMs) fine-tuning can lead to the generation of erroneous output when tested on known information, thereby triggering factual hallucinations. However, existing studies have not deeply investigated the specific manifestations and underlying mechanisms of these hallucinations. Our work addresses this gap by designing a controlled dataset Biography-Reasoning, and conducting a fine-grained analysis across multiple knowledge types and two task types, including knowledge question answering (QA) and knowledge reasoning tasks. We find that when fine-tuned on a dataset in which a specific knowledge type consists entirely of new knowledge, LLMs exhibit significantly increased hallucination tendencies. This suggests that the high unfamiliarity of a particular knowledge type, rather than the overall proportion of new knowledge, is a stronger driver of hallucinations, and these tendencies can even affect other knowledge types in QA tasks. To mitigate such factual hallucinations, we propose KnownPatch, which patches a small number of known knowledge samples in the later stages of training, effectively alleviating new-knowledge-induced hallucinations. Through attention analysis, we find that learning new knowledge reduces the model's attention to key entities in the question, thus causing excessive focus on the surrounding context, which may increase the risk of hallucination. Moreover, the attention pattern can propagate to similar contexts, facilitating the spread of hallucinations to textually similar questions. Our method effectively mitigates the disruption of new knowledge learning to the model's attention on key entities, accompanied by improved performance.[18] Optimal Singular Damage: Efficient LLM Inference in Low Storage Regimes
Mohammadsajad Alipour,Mohammad Mohammadi Amiri
Main category: cs.CL
TL;DR: 本文提出了一种名为“最优奇异损伤”(optimal singular damage)的方法,用于高效存储大语言模型微调后的参数更新,通过结合低秩近似与稀疏化并保留最重要的奇异成分,在相同内存预算下实现了优于单独使用低秩或稀疏方法的存储效率和模型精度。
Details
Motivation: 由于大语言模型微调后的参数更新存储成本高,而现有方法(如纯低秩或稀疏化)可能丢失关键信息,因此需要一种更高效的存储方案。 Method: 利用微调更新具有低秩且稀疏的特性,提出‘最优奇异损伤’方法,通过对低秩近似结果进行选择性稀疏化,并基于奇异向量的交错重要性保留最具影响力的参数成分。 Result: 实验表明,在相同内存预算下,该方法相比单独使用低秩近似或稀疏化能显著提升存储效率和模型准确性。 Conclusion: 结合低秩与稀疏化的选择性压缩策略能有效平衡存储开销与模型性能,为大模型微调后的参数存储提供了更优解决方案。 Abstract: Large language models (LLMs) are increasingly prevalent across diverse applications. However, their enormous size limits storage and processing capabilities to a few well-resourced stakeholders. As a result, most applications rely on pre-trained LLMs, fine-tuned for specific tasks. However, even storing the fine-tuned versions of these models remains a significant challenge due to the wide range of tasks they address. Recently, studies show that fine-tuning these models primarily affects a small fraction of parameters, highlighting the need for more efficient storage of fine-tuned models. This paper focuses on efficient storage of parameter updates in pre-trained models after fine-tuning. To address this challenge, we leverage the observation that fine-tuning updates are both low-rank and sparse, which can be utilized for storage efficiency. However, using only low-rank approximation or sparsification may discard critical singular components that enhance model expressivity. We first observe that given the same memory budget, sparsified low-rank approximations with larger ranks outperform standard low-rank approximations with smaller ranks. Building on this, we propose our method, optimal singular damage, that selectively sparsifies low-rank approximated updates by leveraging the interleaved importance of singular vectors, ensuring that the most impactful components are retained. We demonstrate through extensive experiments that our proposed methods lead to significant storage efficiency and superior accuracy within the same memory budget compared to employing the low-rank approximation or sparsification individually.[19] PragExTra: A Multilingual Corpus of Pragmatic Explicitation in Translation
Doreen Osmelak,Koel Dutta Chowdhury,Uliana Sentsova,Cristina España-Bonet,Josef van Genabith
Main category: cs.CL
TL;DR: 本文介绍了PragExTra,首个用于语用显化现象的多语言语料库和检测框架,通过主动学习与人工标注结合的方法识别并优化显化案例,实验证明该方法在多个语言上显著提升分类器性能,推动文化感知的机器翻译发展。
Details
Motivation: 语用显化在翻译理论中被广泛讨论,但缺乏计算建模;因此需要一个可量化、跨语言的框架来自动检测这一现象。 Method: 基于TED-Multi和Europarl构建覆盖八种语言对的语料库,通过空对齐识别候选显化实例,并采用主动学习结合人工标注进行精细化筛选,进而训练分类器进行检测。 Result: 实体级和系统级显化最为常见;主动学习使分类器准确率提升7-8个百分点,最高达到0.88准确率和0.82 F1值。 Conclusion: PragExTra首次将语用显化建模为可测量、跨语言的现象,为实现文化感知的机器翻译提供了基础。 Abstract: Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PragExTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus covers eight language pairs from TED-Multi and Europarl and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidate explicitation cases through null alignments and refined using active learning with human annotation. Our results show that entity and system-level explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages. PragExTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation. Keywords: translation, multilingualism, explicitation[20] AI Diffusion in Low Resource Language Countries
Amit Misra,Syed Waqas Zamir,Wassim Hamidouche,Inbal Becker-Reshef,Juan Lavista Ferres
Main category: cs.CL
TL;DR: 语言资源匮乏导致低资源语言国家在人工智能采用上落后约20%,语言可及性是AI公平扩散的重要独立障碍。
Details
Motivation: 研究前沿大语言模型在低资源语言上的表现不佳是否因降低AI实用性而减缓了这些国家的AI采用。 Method: 使用加权回归模型,从社会经济和人口因素中分离出语言的影响。 Result: 发现低资源语言国家的AI用户比例比基线低约20%。 Conclusion: 语言可及性是影响AI公平扩散的一个显著且独立的障碍。 Abstract: Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.[21] Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning
Bowen Jin,TJ Collins,Donghan Yu,Mert Cemri,Shenao Zhang,Mengyu Li,Jay Tang,Tian Qin,Zhiyang Xu,Jiarui Lu,Guoli Yin,Jiawei Han,Zirui Wang
Main category: cs.CL
TL;DR: 提出了一种基于强化学习的集中式多LLM框架CoRL,通过控制器LLM在多预算条件下协调专家模型,实现性能与推理成本的高效权衡。
Details
Motivation: 不同大语言模型在不同领域表现互补且推理成本各异,现有去中心化多模型系统导致推理成本不可控,需设计更高效的协作框架。 Method: 将多模型协调问题建模为具有双重目标的强化学习:最大化任务性能并最小化推理成本;采用集中式架构,由控制器LLM根据预算动态选择调用专家模型。 Result: 在四个基准任务上实验表明,CoRL在高预算下超过最强专家模型,在低预算下仍保持良好性能,实现了成本可控的高效多模型协作。 Conclusion: 集中式多LLM框架结合强化学习能有效平衡性能与成本,支持多预算下的自适应行为,提升多代理系统的可扩展性和成本效益。 Abstract: Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.[22] Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval
Hung-Ting Chen,Xiang Liu,Shauli Ravfogel,Eunsol Choi
Main category: cs.CL
TL;DR: 本文提出了一种新的检索器架构AMER,通过自回归生成多个查询向量来解决传统单向量检索在多模态相关文档分布下的局限性,在合成和真实数据集上均表现出显著性能提升。
Details
Motivation: 现有文本检索器通常仅生成一个查询向量,难以覆盖查询的多种相关解释(即多模态分布),尤其当目标文档嵌入差异较大时表现受限。 Method: 提出Autoregressive Multi-Embedding Retriever (AMER),采用自回归方式生成多个查询向量,并利用所有预测向量进行文档检索。 Result: 在合成数据上,AMER能完美捕捉多目标分布,性能优于单向量模型4倍;在真实多答案数据集上,AMER相比基线模型取得4%和21%的相对增益,且在目标文档嵌入差异较大的子集上增益更显著。 Conclusion: 使用多查询向量进行检索具有显著优势,AMER为未来检索模型的设计提供了新方向。 Abstract: Most text retrievers generate \emph{one} query vector to retrieve relevant documents. Yet, the conditional distribution of relevant documents for the query may be multimodal, e.g., representing different interpretations of the query. We first quantify the limitations of existing retrievers. All retrievers we evaluate struggle more as the distance between target document embeddings grows. To address this limitation, we develop a new retriever architecture, \emph{A}utoregressive \emph{M}ulti-\emph{E}mbedding \emph{R}etriever (AMER). Our model autoregressively generates multiple query vectors, and all the predicted query vectors are used to retrieve documents from the corpus. We show that on the synthetic vectorized data, the proposed method could capture multiple target distributions perfectly, showing 4x better performance than single embedding model. We also fine-tune our model on real-world multi-answer retrieval datasets and evaluate in-domain. AMER presents 4 and 21\% relative gains over single-embedding baselines on two datasets we evaluate on. Furthermore, we consistently observe larger gains on the subset of dataset where the embeddings of the target documents are less similar to each other. We demonstrate the potential of using a multi-query vector retriever and open up a new direction for future work.[23] MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
Qianhao Yuan,Jie Lou,Zichao Li,Jiawei Chen,Yaojie Lu,Hongyu Lin,Le Sun,Debing Zhang,Xianpei Han
Main category: cs.CL
TL;DR: 本文提出了MemSearcher,一种通过维护紧凑记忆并结合当前轮次信息来提升搜索代理效率与准确性的方法,并引入多上下文GRPO框架进行端到端优化,在多个基准上显著超越基线模型。
Details
Motivation: 传统搜索代理在长上下文和信息丢失之间存在权衡,导致可扩展性受限,本文旨在解决这一效率与准确性之间的矛盾。 Method: 提出MemSearcher代理工作流,迭代更新精简记忆,并将当前问题与记忆融合以生成推理轨迹、执行搜索和更新记忆;同时设计多上下文GRPO强化学习框架,联合优化推理、搜索策略和记忆管理。 Result: 在七个公开基准上,MemSearcher相比强基线平均提升11%(3B模型)和12%(7B模型),3B模型甚至超过7B基线,且上下文长度稳定,计算开销更低。 Conclusion: MemSearcher通过平衡信息完整性与效率,实现了更高准确性和更低计算成本,有效提升了多轮搜索代理的可扩展性。 Abstract: Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher[24] Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities
Amanda Bertsch,Adithya Pratapa,Teruko Mitamura,Graham Neubig,Matthew R. Gormley
Main category: cs.CL
TL;DR: Oolong是一个新的长上下文推理基准,包含合成和真实世界对话任务,要求模型对文本片段进行细粒度分析并聚合结果以回答分布性问题,现有前沿模型在128K上下文长度下表现不佳,准确率均低于50%。
Details
Motivation: 随着模型上下文长度的增加,模型是否能有效利用全部上下文仍存疑;现有长上下文评测多依赖上下文检索,忽略了大部分上下文信息,未能全面评估模型的长上下文推理能力。 Method: 提出Oolong基准,分为Oolong-synth(可消融的自然语言合成任务)和Oolong-real(真实对话数据推理任务),任务要求模型在上下文中进行分类、计数,并推理时间与用户关系。 Result: 即使是最先进的模型(如GPT-5、Claude-Sonnet-4、Gemini-2.5-Pro)在128K上下文长度下两个任务集上的准确率均低于50%。 Conclusion: 当前模型在处理需要细粒度分析和聚合推理的长上下文任务时仍有显著不足,Oolong为推动此类能力的发展提供了有效评测工具。 Abstract: As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.cs.CV [Back]
[25] iFlyBot-VLA Technical Report
Yuan Zhang,Chenyu Xue,Wenjie Xu,Chao Ji,Jiajia wu,Jia Pan
Main category: cs.CV
TL;DR: 本文提出了iFlyBot-VLA,一种基于新框架训练的大规模视觉-语言-动作(VLA)模型,通过双层动作表示框架和混合训练策略提升3D感知与推理能力,在LIBERO基准和真实场景中均表现出色。
Details
Motivation: 为了提升视觉-语言模型在机器人操作任务中的动作生成能力,需要更好地对齐语言、视觉与动作的表征空间,并增强模型的3D感知与推理能力。 Method: 提出了一种新的VLA训练框架,包括:1)基于大规模人类与机器人操作视频预训练的潜在动作模型;2)双层动作表示框架,联合监督VLM与动作专家;3)融合机器人轨迹数据与通用及空间QA数据集的混合训练策略。VLM被训练以预测两种动作形式:来自潜在动作模型的隐式高层意图(潜动作)和通过频域变换获得的显式低层动态(结构化离散动作token)。 Result: 在LIBERO Franka基准上的实验结果表明所提方法优于现有方法,真实世界评估显示iFlyBot-VLA在多种复杂操作任务中达到有竞争力的成功率。 Conclusion: 该研究表明,通过双层动作监督和混合数据训练可有效对齐视觉、语言与动作表征,使VLM能直接参与动作生成,推动了VLA模型在机器人操作中的应用。 Abstract: We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community[26] Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound
Edoardo Conti,Riccardo Rosati,Lorenzo Federici,Adriano Mancini,Maria Chiara Fiorentin
Main category: cs.CV
TL;DR: 本研究首次系统评估了在低类间变异条件下,基础模型在胎儿超声成像中的表现,提出FetalUS-188K多中心基准数据集,并证明领域自适应预训练对识别相似解剖结构至关重要。
Details
Motivation: 现有视觉基础模型在医学领域具有良好的迁移能力,但在区分解剖结构高度相似的胎儿脑标准切面(如TT、TV、TC)时性能未知,亟需系统评估其在低类间变异条件下的判别能力。 Method: 整合所有公开的胎儿超声数据集,构建包含18.8万张标注图像的FetalUS-188K多中心基准;采用DINOv3进行自监督预训练,比较在胎儿超声数据上预训练与从自然图像权重初始化两种方案,并通过线性探测和全微调评估模型性能。 Result: 在胎儿超声数据上预训练的模型显著优于基于自然图像初始化的模型,加权F1分数最高提升20%;领域自适应预训练有助于保留区分TV等中间切面所需的关键回声和结构特征。 Conclusion: 通用基础模型在低类间变异条件下泛化能力不足,针对特定领域(如胎儿超声)的预训练对于获得鲁棒且临床可靠的表征至关重要。 Abstract: Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes--transthalamic (TT), transventricular (TV), and transcerebellar (TC)--which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging.[27] Assessing the value of Geo-Foundational Models for Flood Inundation Mapping: Benchmarking models for Sentinel-1, Sentinel-2, and Planetscope for end-users
Saurabh Kaushik,Lalit Maurya,Elizabeth Tellman,ZhiJie Zhang
Main category: cs.CV
TL;DR: 本文评估了三种地理基础模型(GFMs)Prithvi 2.0、Clay V1.5、DOFA 和 UViT 在洪水淹没制图中的性能,并与传统模型如U-Net进行比较。结果表明,GFMs在多数传感器下表现优于传统模型,其中Clay在精度和计算效率上均表现最佳。
Details
Motivation: 尽管GFMs在提取时空信息方面具有潜力,但其相对于传统模型的性能优势尚不明确,缺乏跨传感器和数据场景的系统性比较。因此,需要开展系统评估以指导用户选择合适的模型。 Method: 研究选取了三种GFMs(Prithvi 2.0、Clay V1.5、DOFA、UViT)与TransNorm、U-Net及Attention U-Net进行对比,使用PlanetScope、Sentinel-1和Sentinel-2数据,在多个区域进行留一区域交叉验证,并开展少样本实验,评估各模型在mIoU指标、细节保留能力和推理速度方面的表现。 Result: 所有GFMs性能相近,差异仅2-5%。Clay在PlanetScope(0.79 mIoU)和Sentinel-2(0.70)上表现最优,Prithvi在Sentinel-1上领先(0.57)。在五区域留一法中,Clay整体略优;在19个站点中比U-Net高4%。少样本设置下,Clay用仅5张图像即达0.64 mIoU,显著优于Prithvi(0.24)和DOFA(0.35)。Clay参数量最小(26M),推理速度比Prithvi快3倍,比DOFA快2倍。 Conclusion: GFMs相比传统U-Net在洪水制图中提供小幅到中等的精度提升,同时降低计算成本和标注需求,Clay在性能、效率和少样本学习方面综合最优,是更具实用价值的选择。 Abstract: Geo-Foundational Models (GFMs) enable fast and reliable extraction of spatiotemporal information from satellite imagery, improving flood inundation mapping by leveraging location and time embeddings. Despite their potential, it remains unclear whether GFMs outperform traditional models like U-Net. A systematic comparison across sensors and data availability scenarios is still lacking, which is an essential step to guide end-users in model selection. To address this, we evaluate three GFMs, Prithvi 2.0, Clay V1.5, DOFA, and UViT (a Prithvi variant), against TransNorm, U-Net, and Attention U-Net using PlanetScope, Sentinel-1, and Sentinel-2. We observe competitive performance among all GFMs, with only 2-5% variation between the best and worst models across sensors. Clay outperforms others on PlanetScope (0.79 mIoU) and Sentinel-2 (0.70), while Prithvi leads on Sentinel-1 (0.57). In leave-one-region-out cross-validation across five regions, Clay shows slightly better performance across all sensors (mIoU: 0.72(0.04), 0.66(0.07), 0.51(0.08)) compared to Prithvi (0.70(0.05), 0.64(0.09), 0.49(0.13)) and DOFA (0.67(0.07), 0.64(0.04), 0.49(0.09)) for PlanetScope, Sentinel-2, and Sentinel-1, respectively. Across all 19 sites, leave-one-region-out cross-validation reveals a 4% improvement by Clay compared to U-Net. Visual inspection highlights Clay's superior ability to retain fine details. Few-shot experiments show Clay achieves 0.64 mIoU on PlanetScope with just five training images, outperforming Prithvi (0.24) and DOFA (0.35). In terms of computational time, Clay is a better choice due to its smaller model size (26M parameters), making it ~3x faster than Prithvi (650M) and 2x faster than DOFA (410M). Contrary to previous findings, our results suggest GFMs offer small to moderate improvements in flood mapping accuracy at lower computational cost and labeling effort compared to traditional U-Net.[28] Locally-Supervised Global Image Restoration
Benjamin Walder,Daniel Toader,Robert Nuster,Günther Paltauf,Peter Burgholzer,Gregor Langer,Lukas Krainer,Markus Haltmeier
Main category: cs.CV
TL;DR: 提出一种基于学习的框架,利用图像分布的多重不变性,在固定、确定性采样模式下实现从不完整测量中重建图像,显著减少对完全标注数据的需求。
Details
Motivation: 传统监督方法需要完全采样的真实数据,而自监督方法通常依赖随机采样;本文旨在解决具有固定、确定性且本质上不完整采样模式下的图像重建问题。 Method: 通过利用底层图像分布的多种不变性,结合学习型框架进行图像重建,使在不完整测量下(如固定采样模式)的性能接近全监督方法。 Result: 在光声显微镜(PAM)的光学分辨率图像超分辨率任务中验证了该方法,结果具有竞争力或更优,同时显著减少了对真实数据的需求。 Conclusion: 所提方法能在采样严重受限的情况下实现高质量图像重建,为实际应用中数据获取受限的问题提供了有效解决方案。 Abstract: We address the problem of image reconstruction from incomplete measurements, encompassing both upsampling and inpainting, within a learning-based framework. Conventional supervised approaches require fully sampled ground truth data, while self-supervised methods allow incomplete ground truth but typically rely on random sampling that, in expectation, covers the entire image. In contrast, we consider fixed, deterministic sampling patterns with inherently incomplete coverage, even in expectation. To overcome this limitation, we exploit multiple invariances of the underlying image distribution, which theoretically allows us to achieve the same reconstruction performance as fully supervised approaches. We validate our method on optical-resolution image upsampling in photoacoustic microscopy (PAM), demonstrating competitive or superior results while requiring substantially less ground truth data.[29] Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images
Tuan Truong,Guillermo Jimenez Perez,Pedro Osorio,Matthias Lenga
Main category: cs.CV
TL;DR: 本研究系统评估了三种大型多模态模型(GPT-4o、Gemini 2.5 Flash 和 Qwen 2.5 7B)在医学影像中检测受保护健康信息(PHI)的性能,比较了纯文本分析与结合OCR和语义分析的两种管道配置,发现LMM在OCR效果上优于传统方法,但在整体PHI检测准确性上提升不一致,尤其在复杂印记模式下表现更优。
Details
Motivation: 准确检测医学影像中的PHI对保护患者隐私和符合监管要求至关重要,现有基于OCR和命名实体识别的方法存在局限,而新兴的大型多模态模型为文本提取和语义分析提供了新机遇,亟需系统评估其在该任务中的实际效能。 Method: 采用三种主流闭源和开源大型多模态模型(GPT-4o、Gemini 2.5 Flash、Qwen 2.5 7B),设计并对比两种处理流程:仅文本分析流程与融合OCR及语义分析的联合流程,在包含不同文本可读性和印记复杂度的医学图像数据集上进行系统性基准测试。 Result: 实验结果表明,LMM在OCR性能上显著优于EasyOCR等传统模型(WER: 0.03–0.05,CER: 0.02–0.03),但这一优势并未一致转化为更高的PHI整体检测准确率;在复杂印记测试用例中检测性能提升最明显;而在文本清晰、对比度良好的情况下,不同流程使用强LMM时结果相近。 Conclusion: 尽管大型多模态模型在OCR任务中表现出色,但其在PHI检测中的整体收益取决于具体应用场景和文本特征;研究建议根据实际操作约束(如计算资源、延迟要求)选择合适的LMM,并提出一种可扩展、模块化的部署架构以优化实际应用。 Abstract: The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.[30] StrengthSense: A Dataset of IMU Signals Capturing Everyday Strength-Demanding Activities
Zeyu Yang,Clayton Souza Leite,Yu Xiao
Main category: cs.CV
TL;DR: 本文介绍了StrengthSense,一个包含11种力量需求型活动和2种非力量需求型活动的公开IMU数据集,用于推动人类活动识别算法和健康监测应用的发展。
Details
Motivation: 缺乏捕捉力量需求型活动的综合性数据集,限制了相关研究和技术发展。 Method: 通过10个IMU传感器在29名健康受试者身上采集数据,并使用视频记录进行标注,对IMU估计关节角度与视频提取角度进行对比分析以验证数据准确性。 Result: 成功构建并验证了一个高质量、多传感器的公开数据集StrengthSense,支持多种活动的精确监测。 Conclusion: StrengthSense数据集为开发力量相关的人类活动识别算法和健康监测工具提供了可靠资源。 Abstract: Tracking strength-demanding activities with wearable sensors like IMUs is crucial for monitoring muscular strength, endurance, and power. However, there is a lack of comprehensive datasets capturing these activities. To fill this gap, we introduce \textit{StrengthSense}, an open dataset that encompasses IMU signals capturing 11 strength-demanding activities, such as sit-to-stand, climbing stairs, and mopping. For comparative purposes, the dataset also includes 2 non-strength demanding activities. The dataset was collected from 29 healthy subjects utilizing 10 IMUs placed on limbs and the torso, and was annotated using video recordings as references. This paper provides a comprehensive overview of the data collection, pre-processing, and technical validation. We conducted a comparative analysis between the joint angles estimated by IMUs and those directly extracted from video to verify the accuracy and reliability of the sensor data. Researchers and developers can utilize \textit{StrengthSense} to advance the development of human activity recognition algorithms, create fitness and health monitoring applications, and more.[31] Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis
Soham Joshi,Shwet Kamal Mishra,Viswanath Gopalakrishnan
Main category: cs.CV
TL;DR: 提出了一种自动化合成和验证大规模文本视觉问答(text-VQA)数据集的端到端管道,利用OCR、感兴趣区域检测、字幕生成和问题生成等技术,生成约72K QA对,基于44K图像。
Details
Motivation: 创建大规模text-VQA数据库依赖人工标注,费时且困难,亟需一种可扩展的自动化方法来合成基于场景文本的问答对。 Method: 结合OCR识别、感兴趣区域检测、字幕生成和问题生成等多个模型与算法,构建一个集成化流程,实现QA对的自动合成与验证。 Result: 成功构建了包含约72K QA对的大规模text-VQA数据集,基于44K图像,实现了高质量且可扩展的数据生成。 Conclusion: 该管道是首个能够自动合成并验证大规模text-VQA数据集的方法,显著降低了对人工标注的依赖,推动了text-VQA领域的发展。 Abstract: Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.[32] Markerless Augmented Reality Registration for Surgical Guidance: A Multi-Anatomy Clinical Accuracy Study
Yue Yang,Fabian Necker,Christoph Leuze,Michelle Chen,Andrey Finegersh,Jake Lee,Vasu Divi,Bruce Daniel,Brian Hargreaves,Jie Ying Wu,Fred M Baik
Main category: cs.CV
TL;DR: 本文提出并临床评估了一种仅使用深度信息、无需标记的增强现实(AR)配准流程,用于头戴式显示器,在真实手术环境中对小尺寸或低曲率解剖结构实现了约3-4毫米的中位误差,接近中等风险临床任务的容错阈值。
Details
Motivation: 现有的AR导航系统通常依赖体外标记或高曲率解剖特征进行配准,限制了其在小面积或低曲率区域(如足部、耳部)的应用。本文旨在开发一种无需标记、仅基于深度传感器的配准方法,提升AR技术在复杂外科场景中的适用性和临床实用性。 Method: 在HoloLens 2上,利用Articulated HAnd Tracking (AHAT)深度相机获取患者表面深度数据,并通过(i)深度偏差校正、(ii)短暂的人工初始化、(iii)全局与局部配准,将其与CT-derived皮肤网格对齐。采用AR追踪工具验证表面追踪误差,并与CT真值对比‘皮肤-骨骼’相对距离。在7次实际手术中进行目标试验(足、耳、小腿),每例采集500多个数据点。 Result: 术前验证显示AR追踪距离与CT真值高度一致(腿:中位|Δd| 0.78 mm,RMSE 0.97 mm;足:0.80 mm,1.20 mm)。临床应用中,整体中位误差为3.9 mm,按部位分别为:足部3.2 mm、耳部4.3 mm、小腿5.3 mm;5 mm误差覆盖率分别为92–95%、84–90%、72–86%。足部与小腿误差差异显著(Δ中位数~1.1 mm,p < 0.001)。 Conclusion: 该深度-only、无标记AR配准流程在真实手术环境下对小或低曲率解剖结构实现了亚5mm的高精度配准,结合人工引导初始化与多级注册策略,提升了无标记AR导航的临床可行性,具有广泛应用于整形与重建手术的潜力。 Abstract: Purpose: In this paper, we develop and clinically evaluate a depth-only, markerless augmented reality (AR) registration pipeline on a head-mounted display, and assess accuracy across small or low-curvature anatomies in real-life operative settings. Methods: On HoloLens 2, we align Articulated HAnd Tracking (AHAT) depth to Computed Tomography (CT)-derived skin meshes via (i) depth-bias correction, (ii) brief human-in-the-loop initialization, (iii) global and local registration. We validated the surface-tracing error metric by comparing "skin-to-bone" relative distances to CT ground truth on leg and foot models, using an AR-tracked tool. We then performed seven intraoperative target trials (feet x2, ear x3, leg x2) during the initial stage of fibula free-flap harvest and mandibular reconstruction surgery, and collected 500+ data per trial. Results: Preclinical validation showed tight agreement between AR-traced and CT distances (leg: median |Delta d| 0.78 mm, RMSE 0.97 mm; feet: 0.80 mm, 1.20 mm). Clinically, per-point error had a median of 3.9 mm. Median errors by anatomy were 3.2 mm (feet), 4.3 mm (ear), and 5.3 mm (lower leg), with 5 mm coverage 92-95%, 84-90%, and 72-86%, respectively. Feet vs. lower leg differed significantly (Delta median ~1.1 mm; p < 0.001). Conclusion: A depth-only, markerless AR pipeline on HMDs achieved ~3-4 mm median error across feet, ear, and lower leg in live surgical settings without fiducials, approaching typical clinical error thresholds for moderate-risk tasks. Human-guided initialization plus global-to-local registration enabled accurate alignment on small or low-curvature targets, improving the clinical readiness of markerless AR guidance.[33] From Instance Segmentation to 3D Growth Trajectory Reconstruction in Planktonic Foraminifera
Huahua Lin,Xiaohao Cai,Mark Nixon,James M. Mulqueeney,Thomas H. G. Ezard
Main category: cs.CV
TL;DR: 本研究提出了一种端到端的自动化管道,结合实例分割与专用腔室排序算法,从高分辨率CT扫描中重建浮游有孔虫三维生长轨迹,显著减少人工干预并保持生物学准确性。
Details
Motivation: 现有腔室追踪方法依赖耗时且主观的手动分割,缺乏自动化手段来准确重建有孔虫的生长轨迹,限制了大规模生态与发育研究。 Method: 结合计算机视觉中的实例分割技术与专有的腔室排序算法,对高分辨率CT图像进行处理,实现腔室的自动分割与生长顺序重建,并评估不同分割方法对下游分析的影响。 Result: 在专家标注数据集上验证表明,该管道显著减少人工工作量,具备生物学意义的准确性;尽管小腔室存在欠分割问题,但排序算法仍能稳健重建发育轨迹。 Conclusion: 该研究首次实现了有孔虫生长分析的全自动化与可重复性,为未来大规模、数据驱动的生态与演化研究提供了可靠工具。 Abstract: Planktonic foraminifera, marine protists characterized by their intricate chambered shells, serve as valuable indicators of past and present environmental conditions. Understanding their chamber growth trajectory provides crucial insights into organismal development and ecological adaptation under changing environments. However, automated tracing of chamber growth from imaging data remains largely unexplored, with existing approaches relying heavily on manual segmentation of each chamber, which is time-consuming and subjective. In this study, we propose an end-to-end pipeline that integrates instance segmentation, a computer vision technique not extensively explored in foraminifera, with a dedicated chamber ordering algorithm to automatically reconstruct three-dimensional growth trajectories from high-resolution computed tomography scans. We quantitatively and qualitatively evaluate multiple instance segmentation methods, each optimized for distinct spatial features of the chambers, and examine their downstream influence on growth-order reconstruction accuracy. Experimental results on expert-annotated datasets demonstrate that the proposed pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Although segmentation models exhibit under-segmentation in smaller chambers due to reduced voxel fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm remains robust, achieving consistent reconstruction of developmental trajectories even under partial segmentation. This work provides the first fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing a foundation for large-scale, data-driven ecological studies.[34] Fast Measuring Pavement Crack Width by Cascading Principal Component Analysis
Zhicheng Wang,Junbiao Pang
Main category: cs.CV
TL;DR: 提出了一种结合PCA和鲁棒PCA的级联框架,用于高效提取路面裂缝宽度,相较于现有技术在计算效率和测量精度上表现更优。
Details
Motivation: 由于裂缝边界形态复杂且传统方法效果有限,同时需要从任意像素位置快速测量,因此需要更精确的裂缝宽度量化方法。 Method: 该方法包括三个步骤:首先使用现有的检测算法进行初始裂缝分割生成二值图像,然后通过PCA确定准平行裂缝的主要方向轴,最后利用RPCA提取不规则裂缝几何形状的主传播轴(MPA)。 Result: 在三个公开数据集上的实验表明,该方法在计算效率和测量精度方面均优于现有的最先进方法。 Conclusion: 所提出的级联PCA-RPCA框架能够有效应对复杂裂缝形态和快速测量需求,显著提升了路面裂缝宽度测量的准确性和效率。 Abstract: Accurate quantification of pavement crack width plays a pivotal role in assessing structural integrity and guiding maintenance interventions. However, achieving precise crack width measurements presents significant challenges due to: (1) the complex, non-uniform morphology of crack boundaries, which limits the efficacy of conventional approaches, and (2) the demand for rapid measurement capabilities from arbitrary pixel locations to facilitate comprehensive pavement condition evaluation. To overcome these limitations, this study introduces a cascaded framework integrating Principal Component Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from digital images. The proposed methodology comprises three sequential stages: (1) initial crack segmentation using established detection algorithms to generate a binary representation, (2) determination of the primary orientation axis for quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations were conducted across three publicly available datasets, demonstrating that the proposed approach achieves superior performance in both computational efficiency and measurement accuracy compared to existing state-of-the-art techniques.[35] Autobiasing Event Cameras for Flickering Mitigation
Mehdi Sefidgar Dilmaghani,Waseem Shariff,Cian Ryan,Joe Lemley,Peter Corcoran
Main category: cs.CV
TL;DR: 本文提出了一种基于CNN的自主调节事件相机偏置的方法,有效抑制了25 Hz至500 Hz范围内的闪烁效应,无需额外硬件或软件滤波。
Details
Motivation: 事件相机在光照快速变化时易受闪烁影响,传统方法依赖额外硬件或软件,限制了其在复杂环境中的应用。 Method: 利用卷积神经网络(CNN)在空间域检测闪烁,并动态调整事件相机的偏置参数以抑制闪烁。 Result: 在不同光照和频率条件下测试,YOLO人脸检测置信度提升,检测到人脸的帧率增加;在光照充足和低光条件下,平均梯度(反映闪烁程度)分别下降38.2%和53.6%。 Conclusion: 该方法通过自适应偏置调节显著提升了事件相机在复杂光照环境下的性能,具有广泛的应用潜力。 Abstract: Understanding and mitigating flicker effects caused by rapid variations in light intensity is critical for enhancing the performance of event cameras in diverse environments. This paper introduces an innovative autonomous mechanism for tuning the biases of event cameras, effectively addressing flicker across a wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on additional hardware or software for flicker filtering, our approach leverages the event cameras inherent bias settings. Utilizing a simple Convolutional Neural Networks -CNNs, the system identifies instances of flicker in a spatial space and dynamically adjusts specific biases to minimize its impact. The efficacy of this autobiasing system was robustly tested using a face detector framework under both well-lit and low-light conditions, as well as across various frequencies. The results demonstrated significant improvements: enhanced YOLO confidence metrics for face detection, and an increased percentage of frames capturing detected faces. Moreover, the average gradient, which serves as an indicator of flicker presence through edge detection, decreased by 38.2 percent in well-lit conditions and by 53.6 percent in low-light conditions. These findings underscore the potential of our approach to significantly improve the functionality of event cameras in a range of adverse lighting scenarios.[36] Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models
Jinhwan Seo,Yoonki Cho,Junhyug Noh,Sung-eui Yoon
Main category: cs.CV
TL;DR: 本文提出了一种用于解决ICCV 2025感知测试挑战中基于视频的问答(GVQA)任务的框架,通过三阶段流水线实现,并引入触发时刻和CORTEX提示来提升性能。
Details
Motivation: GVQA任务需要模型具备对视频内容进行复杂推理、视觉定位答案以及时间上追踪目标对象的能力,现有方法在这些方面仍有不足。 Method: 将GVQA任务分解为三个阶段:视频推理与问答、时空定位和追踪;利用提出的CORTEX提示生成触发时刻,以确定目标对象最可见的一帧作为定位和追踪的锚点。 Result: 在GVQA任务上实现了0.4968的HOTA分数,相比去年冠军的0.2704有显著提升。 Conclusion: 所提出的框架通过引入触发时刻和三阶段流水线,在GVQA任务上取得了显著性能提升,验证了其有效性。 Abstract: In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.[37] MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation
Jiawen Liu,Yuanbo Zeng,Jiaming Liang,Yizhen Yang,Yiheng Zhang,Enhui Cai,Xiaoqi Sheng,Hongmin Cai
Main category: cs.CV
TL;DR: 本文提出了一种用于视网膜血管分割的新型网络MM-UNet,通过引入Morph Mamba卷积层和反向选择性状态引导模块,提升了对细小分支结构的感知能力和边界识别精度,在DRIVE和STARE数据集上取得了优于现有方法的F1分数。
Details
Motivation: 视网膜血管具有极细且分支复杂的结构,全局形态在不同图像中变化大,传统分割方法难以同时保证精度与鲁棒性。 Method: 提出了MM-UNet,包含Morph Mamba卷积层(增强分支拓扑感知)和反向选择性状态引导模块(提升几何边界感知与解码效率)。 Result: 在DRIVE和STARE两个公开数据集上实验表明,MM-UNet分别比现有方法F1-score提高1.64%和1.25%。 Conclusion: MM-UNet在视网膜血管分割任务中表现出更高的准确性和鲁棒性,具备临床应用潜力。 Abstract: Accurate detection of retinal vessels plays a critical role in reflecting a wide range of health status indicators in the clinical diagnosis of ocular diseases. Recently, advances in deep learning have led to a surge in retinal vessel segmentation methods, which have significantly contributed to the quantitative analysis of vascular morphology. However, retinal vasculature differs significantly from conventional segmentation targets in that it consists of extremely thin and branching structures, whose global morphology varies greatly across images. These characteristics continue to pose challenges to segmentation precision and robustness. To address these issues, we propose MM-UNet, a novel architecture tailored for efficient retinal vessel segmentation. The model incorporates Morph Mamba Convolution layers, which replace pointwise convolutions to enhance branching topological perception through morph, state-aware feature sampling. Additionally, Reverse Selective State Guidance modules integrate reverse guidance theory with state-space modeling to improve geometric boundary awareness and decoding efficiency. Extensive experiments conducted on two public retinal vessel segmentation datasets demonstrate the superior performance of the proposed method in segmentation accuracy. Compared to the existing approaches, MM-UNet achieves F1-score gains of 1.64 $\%$ on DRIVE and 1.25 $\%$ on STARE, demonstrating its effectiveness and advancement. The project code is public via https://github.com/liujiawen-jpg/MM-UNet.[38] Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers
Zhengjie Zhang,Xiaoxie Mao,Qihao Guo,Shaoting Zhang,Qi Huang,Mu Zhou,Fang Xie,Mianxin Liu
Main category: cs.CV
TL;DR: 本研究提出一种语言增强的生成模型,利用血液生物标志物和MRI扫描合成淀粉样蛋白PET图像,生成的图像在质量和诊断一致性上表现良好,并可用于全自动阿尔茨海默病诊断流程。
Details
Motivation: 由于淀粉样蛋白PET检查成本高且难以普及,研究旨在通过可及性更高的血液生物标志物和MRI来预测PET空间分布,以降低诊断门槛。 Method: 基于大型语言模型和多模态信息融合技术,构建生成模型,从566名参与者的BBM和T1加权MRI数据合成PET图像,并评估其图像质量、诊断一致性和临床适用性。 Result: 合成PET图像与真实PET高度相似(SSIM = 0.920,Pearson r = 0.955),诊断准确率达0.80;基于合成PET的诊断模型AUC为0.78,优于仅使用MRI或BBM的模型,结合两者后提升至0.79。消融实验验证了LLM和提示工程的有效性。 Conclusion: 该语言增强生成模型能有效合成高质量PET图像,提升BBM和MRI在阿尔茨海默病诊断中的应用价值,有望优化临床诊断流程。 Abstract: Background: Alzheimer's disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson's r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer's disease.[39] Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping
Jiajia Li,Keyi Zhu,Qianwen Zhang,Dong Chen,Qi Sun,Zhaojian Li
Main category: cs.CV
TL;DR: 提出了一种基于3D高斯点阵和SAM-2分割的草莓植株三维重建框架,实现背景去除与关键性状自动提取。
Details
Motivation: 传统表型分析方法耗时、费力且具破坏性,现有3D重建方法受背景噪声影响大,限制了农业应用中的精度与效率。 Method: 结合Segment Anything Model v2(SAM-2)与alpha通道掩码进行前景分割,采用3D高斯点阵实现对象中心的草莓植株非破坏性三维重建,并利用DBSCAN聚类和主成分分析(PCA)自动估算植株高度与冠层宽度。 Result: 相比传统流程,该方法在几何重建精度上更高,计算时间显著减少,并能准确自动提取关键植物性状。 Conclusion: 所提方法为草莓植株表型分析提供了一个高效、可扩展且非破坏性的三维重建解决方案。 Abstract: Strawberries are among the most economically significant fruits in the United States, generating over $2 billion in annual farm-gate sales and accounting for approximately 13% of the total fruit production value. Plant phenotyping plays a vital role in selecting superior cultivars by characterizing plant traits such as morphology, canopy structure, and growth dynamics. However, traditional plant phenotyping methods are time-consuming, labor-intensive, and often destructive. Recently, neural rendering techniques, notably Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful frameworks for high-fidelity 3D reconstruction. By capturing a sequence of multi-view images or videos around a target plant, these methods enable non-destructive reconstruction of complex plant architectures. Despite their promise, most current applications of 3DGS in agricultural domains reconstruct the entire scene, including background elements, which introduces noise, increases computational costs, and complicates downstream trait analysis. To address this limitation, we propose a novel object-centric 3D reconstruction framework incorporating a preprocessing pipeline that leverages the Segment Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean strawberry plant reconstructions. This approach produces more accurate geometric representations while substantially reducing computational time. With a background-free reconstruction, our algorithm can automatically estimate important plant traits, such as plant height and canopy width, using DBSCAN clustering and Principal Component Analysis (PCA). Experimental results show that our method outperforms conventional pipelines in both accuracy and efficiency, offering a scalable and non-destructive solution for strawberry plant phenotyping.[40] Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning
Anders Austlid Taskén,Thierry Judge,Erik Andreas Rye Berg,Jinyang Yu,Bjørnar Grenne,Frank Lindseth,Svend Aakhus,Pierre-Marc Jodoin,Nicolas Duchateau,Olivier Bernard,Gabriel Kiss
Main category: cs.CV
TL;DR: 本研究提出了一种基于深度学习的自动化管道autoStrain,用于经食管超声心动图(TEE)中的左心室节段性纵向应变(SLS)估计,通过合成数据训练模型,实现了高精度的运动估计和临床可用的SLS结果。
Details
Motivation: 现有的SLS评估技术依赖大量人工干预,效率低且资源消耗大,难以用于持续监测。因此需要一种自动化、高效且准确的方法来提升临床心脏功能评估的可行性。 Method: 采用两种深度学习方法:基于RAFT光流模型的TeeFlow(密集帧间预测)和基于CoTracker点轨迹模型的TeeTracker(稀疏长序列预测),并在一个高度仿真的合成TEE数据集(synTEE,含80例患者)上进行训练与评估。使用SIMUS模拟管线生成带真实运动标签的合成数据。 Result: TeeTracker在运动估计中表现优于TeeFlow,平均距离误差为0.65 mm;在16例患者的临床验证中,autoStrain的SLS估计与临床参考一致,平均差异为1.09%(95%一致性界限:-8.90%至11.09%);引入模拟缺血数据提升了模型对异常变形的量化能力。 Conclusion: 结合AI驱动的运动估计与TEE可显著提高心脏功能评估的精确性和效率,autoStrain有望成为临床实用的自动化SLS分析工具。 Abstract: Segmental longitudinal strain (SLS) of the left ventricle (LV) is an important prognostic indicator for evaluating regional LV dysfunction, in particular for diagnosing and managing myocardial ischemia. Current techniques for strain estimation require significant manual intervention and expertise, limiting their efficiency and making them too resource-intensive for monitoring purposes. This study introduces the first automated pipeline, autoStrain, for SLS estimation in transesophageal echocardiography (TEE) using deep learning (DL) methods for motion estimation. We present a comparative analysis of two DL approaches: TeeFlow, based on the RAFT optical flow model for dense frame-to-frame predictions, and TeeTracker, based on the CoTracker point trajectory model for sparse long-sequence predictions. As ground truth motion data from real echocardiographic sequences are hardly accessible, we took advantage of a unique simulation pipeline (SIMUS) to generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with ground truth myocardial motion to train and evaluate both models. Our evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a mean distance error in motion estimation of 0.65 mm on a synTEE test dataset. Clinical validation on 16 patients further demonstrated that SLS estimation with our autoStrain pipeline aligned with clinical references, achieving a mean difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%). Incorporation of simulated ischemia in the synTEE data improved the accuracy of the models in quantifying abnormal deformation. Our findings indicate that integrating AI-driven motion estimation with TEE can significantly enhance the precision and efficiency of cardiac function assessment in clinical settings.[41] Can Foundation Models Revolutionize Mobile AR Sparse Sensing?
Yiqin Zhao,Tian Guo
Main category: cs.CV
TL;DR: 本研究探讨了基础模型在移动稀疏感知中的应用,利用真实世界AR数据,展示了其在几何感知图像扭曲和3D场景重建中的显著性能提升。
Details
Motivation: 由于计算、功耗等限制,移动感知系统长期面临感知质量与效率之间的权衡问题,现有稀疏感知方法常因时空信息缺失导致精度下降。 Method: 采用真实世界的移动AR数据,评估基础模型在几何感知图像扭曲及跨帧信息重用中的效果,并研究其在3D场景重建中的可扩展性。 Result: 基础模型显著提升了图像扭曲的准确性,实现了高效的跨帧信息复用,并在3D场景重建任务中表现出领先性能。 Conclusion: 基础模型有望改变移动稀疏感知的格局,但其集成仍面临若干开放挑战,需进一步研究。 Abstract: Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.[42] Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer's Disease Diagnosis
Delin Ma,Menghui Zhou,Jun Qi,Yun Yang,Po Yang
Main category: cs.CV
TL;DR: 提出一种基于协作注意力和一致性引导融合的MRI与PET多模态神经影像融合框架,用于阿尔茨海默病诊断,有效提升分类性能。
Details
Motivation: 现有方法主要关注跨模态互补性,忽视模态特异性特征的重要性,且模态间分布差异导致表征偏差和噪声,影响分类性能。 Method: 引入可学习参数表示(LPR)模块以补偿缺失模态信息,采用共享编码器和模态独立编码器保留共享与特异性特征,并设计一致性引导机制对齐潜在分布。 Result: 在ADNI数据集上的实验表明,该方法在AD诊断中优于现有的融合策略,表现出更优的分类性能。 Conclusion: 所提框架能有效融合MRI与PET模态信息,兼顾模态共享与特异性特征,通过分布对齐减少偏差,提升AD早期诊断准确性。 Abstract: Alzheimer's disease (AD) is the most prevalent form of dementia, and its early diagnosis is essential for slowing disease progression. Recent studies on multimodal neuroimaging fusion using MRI and PET have achieved promising results by integrating multi-scale complementary features. However, most existing approaches primarily emphasize cross-modal complementarity while overlooking the diagnostic importance of modality-specific features. In addition, the inherent distributional differences between modalities often lead to biased and noisy representations, degrading classification performance. To address these challenges, we propose a Collaborative Attention and Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The proposed model introduces a learnable parameter representation (LPR) block to compensate for missing modality information, followed by a shared encoder and modality-independent encoders to preserve both shared and specific representations. Furthermore, a consistency-guided mechanism is employed to explicitly align the latent distributions across modalities. Experimental results on the ADNI dataset demonstrate that our method achieves superior diagnostic performance compared with existing fusion strategies.[43] Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency
Hao Li,Daiwei Lu,Jesse d'Almeida,Dilara Isik,Ehsan Khodapanah Aghdam,Nick DiSanto,Ayberk Acar,Susheela Sharma,Jie Ying Wu,Robert J. Webster III,Ipek Oguz
Main category: cs.CV
TL;DR: 本文提出一种潜在特征对齐方法,用于改进内窥镜视频中中央气道的绝对深度估计,通过对抗学习和方向性特征一致性减少真实与合成图像之间的域差距。
Details
Motivation: 由于难以从手术场景的内窥镜图像中获得绝对深度,限制了在真实图像上的监督学习,因此需要有效的方法来缩小合成与真实图像间的域差异。 Method: 采用对抗学习和方向性特征一致性,在不依赖图像翻译过程的前提下,使深度网络学习到跨域的潜在不变特征,从而提升深度估计性能。 Result: 在中央气道模型的内窥镜视频上进行评估,该方法在绝对和相对深度指标上均优于现有最先进方法,并在不同主干网络和预训练权重下表现一致提升。 Conclusion: 所提出的潜在特征对齐方法能有效减小域间隙,显著提升单目深度估计的精度,适用于自主医疗机器人中的深度感知任务。 Abstract: Monocular depth estimation (MDE) is a critical task to guide autonomous medical robots. However, obtaining absolute (metric) depth from an endoscopy camera in surgical scenes is difficult, which limits supervised learning of depth on real endoscopic images. Current image-level unsupervised domain adaptation methods translate synthetic images with known depth maps into the style of real endoscopic frames and train depth networks using these translated images with their corresponding depth maps. However a domain gap often remains between real and translated synthetic images. In this paper, we present a latent feature alignment method to improve absolute depth estimation by reducing this domain gap in the context of endoscopic videos of the central airway. Our methods are agnostic to the image translation process and focus on the depth estimation itself. Specifically, the depth network takes translated synthetic and real endoscopic frames as input and learns latent domain-invariant features via adversarial learning and directional feature consistency. The evaluation is conducted on endoscopic videos of central airway phantoms with manually aligned absolute depth maps. Compared to state-of-the-art MDE methods, our approach achieves superior performance on both absolute and relative depth metrics, and consistently improves results across various backbones and pretrained weights. Our code is available at https://github.com/MedICL-VU/MDE.[44] Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework
Yucheng Song,Yifan Ge,Junhao Li,Zhining Liao,Zhifang Liao
Main category: cs.CV
TL;DR: 本文提出了一种新的分层任务分解框架HTSC-CIF,用于解决医学报告生成中的领域知识理解不足、文本-视觉实体对齐差和跨模态偏差导致的虚假相关性三大挑战。
Details
Motivation: 现有的医学报告生成模型在处理病灶描述时面临领域知识理解不足、跨模态对齐不佳以及由跨模态偏差引起的虚假相关性等问题,且以往工作仅解决单一问题,缺乏系统性解决方案。 Method: 提出HTSC-CIF框架,将任务分为低、中、高三层次:低层次通过空间对齐增强视觉编码器的医学实体理解;中层次采用前缀语言建模和掩码图像建模进行跨模态互指导对齐;高层次引入基于前门干预的因果干预模块以消除混杂因素影响。 Result: 实验表明HTSC-CIF显著优于当前最先进的医学报告生成方法,在多个指标上取得最佳性能。 Conclusion: HTSC-CIF通过分层策略全面应对医学报告生成中的关键挑战,有效提升了生成质量与模型可解释性,为多模态医疗AI提供了新思路。 Abstract: Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.[45] Are Euler angles a useful rotation parameterisation for pose estimation with Normalizing Flows?
Giorgos Sfikas,Konstantina Nikolaidou,Foteini Papadopoulou,George Retsinas,Anastasios L. Kesidis
Main category: cs.CV
TL;DR: 本文探讨了使用欧拉角参数化作为归一化流模型基础在物体位姿估计中的有效性,尽管欧拉角存在缺陷,但在某些方面相比更复杂的参数化方法仍可能构建出有用的模型。
Details
Motivation: 由于传感器和投影限制或物体本身的对称性,位姿估计可能存在歧义,因此需要概率化的位姿输出;而欧拉角作为一种经典的旋转表示方法,其在概率建模中的潜力尚未被充分探索。 Method: 采用基于欧拉角的参数化方式,结合归一化流(Normalizing Flows)模型进行3D物体位姿的概率估计,并与其它复杂参数化方法进行对比分析。 Result: 研究表明,尽管欧拉角存在奇异性等问题,但在特定条件下仍能构建出有效且实用的概率位姿估计模型,在某些应用场景下表现良好。 Conclusion: 欧拉角虽然有局限性,但作为位姿估计的概率模型基础,在实际应用中具有可行性和潜在优势,值得进一步研究和利用。 Abstract: Object pose estimation is a task that is of central importance in 3D Computer Vision. Given a target image and a canonical pose, a single point estimate may very often be sufficient; however, a probabilistic pose output is related to a number of benefits when pose is not unambiguous due to sensor and projection constraints or inherent object symmetries. With this paper, we explore the usefulness of using the well-known Euler angles parameterisation as a basis for a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation, 3D pose has been parameterized in a number of ways, either in or out of the context of parameter estimation. We explore the idea that Euler angles, despite their shortcomings, may lead to useful models in a number of aspects, compared to a model built on a more complex parameterisation.[46] SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
Fangxun Shu,Yongjie Ye,Yue Liao,Zijian Kang,Weijie Yin,Jiacong Wang,Xiao Liang,Shuicheng Yan,Chao Feng
Main category: cs.CV
TL;DR: SAIL-RL是一种强化学习后训练框架,通过教授多模态大语言模型何时以及如何进行推理,提升其推理能力。
Details
Motivation: 现有方法受限于仅结果监督和统一思维策略,无法保证合理的推理过程,且在简单或复杂任务上易出现过思考或欠思考问题。 Method: 提出双奖励系统:思考奖励(评估事实依据、逻辑连贯性和答案一致性)和判断奖励(自适应决定是否进行深度推理或直接回答)。 Result: 在SAIL-VL2上实验表明,SAIL-RL在4B和8B规模下均提升了推理和多模态理解性能,表现媲美GPT-4o,并显著减少幻觉现象。 Conclusion: SAIL-RL为构建更可靠、自适应的多模态大语言模型提供了一个原则性框架。 Abstract: We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.[47] Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions
Cuong Tuan Nguyen,Ngoc Tuan Nguyen,Triet Hoang Minh Dao,Huy Minh Nhat,Huy Truong Dinh
Main category: cs.CV
TL;DR: 提出一种基于图神经网络(GNN)的手写数学表达式(HME)识别方法,将HME建模为图结构,通过BLSTM和GNN联合优化符号关系,有效提升结构识别性能。
Details
Motivation: 传统HME识别方法在处理复杂空间结构时存在局限,难以准确捕捉符号间的空间依赖关系。 Method: 将HME建模为图,节点表示符号,边表示空间关系;使用深度BLSTM进行符号分割、识别和空间关系分类,构建初始图;结合2D-CFG解析器生成可能的空间关系,并利用GNN进行链接预测以去除冗余连接,最终形成符号标签图。 Result: 实验结果表明该方法在HME结构识别任务中表现出色,能有效提升识别准确性。 Conclusion: 所提出的GNN-based方法能够有效建模HME中的复杂空间结构,显著提升手写数学表达式的识别性能。 Abstract: We propose a Graph Neural Network (GNN)-based approach for Handwritten Mathematical Expression (HME) recognition by modeling HMEs as graphs, where nodes represent symbols and edges capture spatial dependencies. A deep BLSTM network is used for symbol segmentation, recognition, and spatial relation classification, forming an initial primitive graph. A 2D-CFG parser then generates all possible spatial relations, while the GNN-based link prediction model refines the structure by removing unnecessary connections, ultimately forming the Symbol Label Graph. Experimental results demonstrate the effectiveness of our approach, showing promising performance in HME structure recognition.[48] CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
Jizheng Ma,Xiaofei Zhou,Yanlong Song,Han Yan
Main category: cs.CV
TL;DR: 本文提出了CoCoVa(连续视觉-语言思维链)框架,通过引入连续的跨模态推理机制,克服了现有视觉-语言模型受限于离散语言标记的局限性。
Details
Motivation: 现有视觉-语言模型局限于离散语言标记空间中的推理,难以充分表达高维、丰富的视觉感知过程,无法模拟人类非语言性的隐性思维。 Method: 提出CoCoVa框架,核心是一个迭代的推理循环,使用新型Latent Q-Former(LQ-Former)作为动态推理引擎,通过跨模态融合不断优化潜在思维向量链;引入动态标记选择机制聚焦显著视觉区域,并采用对比学习与扩散重建相结合的多任务目标训练模型,确保潜在表示与视觉和文本模态对齐。 Result: 实验表明,CoCoVa在多个基准上优于强大的基线模型,使用1.5B参数骨干模型即可媲美甚至超越7B-9B规模的模型,在扩展到7B大语言模型时仍具竞争力;定性分析显示其潜在空间能捕捉可解释、结构化的推理模式。 Conclusion: CoCoVa通过连续的潜在空间推理,有效弥合了离散语言处理与连续视觉理解之间的表征鸿沟,展现了连续跨模态思维在视觉-语言任务中的巨大潜力。 Abstract: In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.[49] Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization
Shaohan Li,Yunpeng Shi,Gilad Lerman
Main category: cs.CV
TL;DR: 本文提出了Cycle-Sync,一种用于估计相机位姿(旋转和位置)的鲁棒全局框架,其核心是改进的消息传递最小二乘法(MPLS),通过强调循环一致性信息、重新定义循环一致性并引入Welsch型鲁棒损失,实现了当前最低样本复杂度下的精确恢复,并在合成和真实数据上优于现有方法。
Details
Motivation: 现有的相机位姿估计方法在鲁棒性和样本效率方面存在局限,尤其依赖于捆绑调整且对异常值敏感,因此需要一种无需捆绑调整、更具鲁棒性的全局解决方案。 Method: 提出Cycle-Sync框架,改进消息传递最小二乘法(MPLS),使其适应相机位置估计;通过迭代估计的距离重新定义循环一致性,引入Welsch型鲁棒损失,并集成一个受鲁棒子空间恢复启发的即插即用异常值剔除模块,同时将循环一致性完全融入旋转同步过程。 Result: 建立了目前最强的确定性精确恢复保证,证明仅靠循环一致性即可实现最低的样本复杂度;实验表明该方法在合成与真实数据集上 consistently 优于包括使用捆绑调整的完整SfM流程在内的主流位姿估计算法。 Conclusion: Cycle-Sync通过充分利用循环一致性信息和鲁棒优化策略,提供了一种高效、鲁棒且无需捆绑调整的全局相机位姿估计方案,显著提升了估计精度和鲁棒性。 Abstract: We introduce Cycle-Sync, a robust and global framework for estimating camera poses (both rotations and locations). Our core innovation is a location solver that adapts message-passing least squares (MPLS) -- originally developed for group synchronization -- to camera location estimation. We modify MPLS to emphasize cycle-consistent information, redefine cycle consistencies using estimated distances from previous iterations, and incorporate a Welsch-type robust loss. We establish the strongest known deterministic exact-recovery guarantee for camera location estimation, showing that cycle consistency alone -- without access to inter-camera distances -- suffices to achieve the lowest sample complexity currently known. To further enhance robustness, we introduce a plug-and-play outlier rejection module inspired by robust subspace recovery, and we fully integrate cycle consistency into MPLS for rotation synchronization. Our global approach avoids the need for bundle adjustment. Experiments on synthetic and real datasets show that Cycle-Sync consistently outperforms leading pose estimators, including full structure-from-motion pipelines with bundle adjustment.[50] DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding
Zixuan Liu,Siavash H. Khajavi,Guangkai Jiang
Main category: cs.CV
TL;DR: 本文介绍了DetectiumFire,一个大规模多模态火灾相关图像和视频数据集,包含22.5k张高分辨率图像和2.5k个真实世界视频,涵盖多种火灾类型、环境和风险等级,并配有计算机视觉标签和详细文本描述,用于支持火灾理解、合成数据生成和风险推理等任务。
Details
Motivation: 现有火灾领域缺乏高质量、公开的多模态标注数据集,限制了多模态模型在火灾检测与推理中的应用。 Method: 构建了一个名为DetectiumFire的大规模数据集,包含图像和视频,并采用边界框和详细文本提示进行多模态标注,支持目标检测、扩散模型生成和视觉语言推理等多种任务验证。 Result: 实验表明,DetectiumFire在规模、多样性、数据质量和场景覆盖上优于现有基准,在多个任务中验证了其有效性,显著提升了火灾相关AI研究的数据支持能力。 Conclusion: DetectiumFire为火灾理解和智能安全系统的发展提供了重要资源,作者已公开该数据集以促进AI社区对火灾场景的研究。 Abstract: Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890[51] GAFD-CC: Global-Aware Feature Decoupling with Confidence Calibration for OOD Detection
Kun Zou,Yongheng Xu,Jianxing Yu,Yan Pan,Jian Yin,Hanjiang Lai
Main category: cs.CV
TL;DR: 提出了一种新的后处理OOD检测方法GAFD-CC,通过全局感知特征解耦和置信度校准来提升分类边界判别能力。
Details
Motivation: 现有后处理OOD检测方法常忽略特征与logits之间的内在关联,影响检测效果。 Method: 利用分类权重引导的全局感知特征解耦,分离正负相关特征,并结合多尺度logit置信度进行自适应融合。 Result: 在大规模基准上实验表明,该方法优于当前先进方法,具有良好的泛化能力。 Conclusion: GAFD-CC通过挖掘特征与logits的相关性,有效提升了OOD检测性能。 Abstract: Out-of-distribution (OOD) detection is paramount to ensuring the reliability and robustness of learning models in real-world applications. Existing post-hoc OOD detection methods detect OOD samples by leveraging their features and logits information without retraining. However, they often overlook the inherent correlation between features and logits, which is crucial for effective OOD detection. To address this limitation, we propose Global-Aware Feature Decoupling with Confidence Calibration (GAFD-CC). GAFD-CC aims to refine decision boundaries and increase discriminative performance. Firstly, it performs global-aware feature decoupling guided by classification weights. This involves aligning features with the direction of global classification weights to decouple them. From this, GAFD-CC extracts two types of critical information: positively correlated features that promote in-distribution (ID)/OOD boundary refinement and negatively correlated features that suppress false positives and tighten these boundaries. Secondly, it adaptively fuses these decoupled features with multi-scale logit-based confidence for comprehensive and robust OOD detection. Extensive experiments on large-scale benchmarks demonstrate GAFD-CC's competitive performance and strong generalization ability compared to those of state-of-the-art methods.[52] UniChange: Unifying Change Detection with Multimodal Large Language Model
Xu Zhang,Danyang Li,Xiaohang Dong,Tianhao Wu,Hualong Yu,Jianye Wang,Qicheng Li,Xiang Li
Main category: cs.CV
TL;DR: 本文提出了UniChange,首个基于多模态大语言模型(MLLM)的统一变化检测模型,通过引入特殊标记和文本提示,实现了二值变化检测(BCD)和语义变化检测(SCD)任务的统一,并在多个基准上达到SOTA性能。
Details
Motivation: 现有变化检测模型通常只能利用单一类型标注数据,难以同时利用BCD和SCD数据集,导致泛化能力差、适用性受限。 Method: 利用MLLM的语言先验和统一能力,设计UniChange模型,引入[T1]、[T2]和[CHANGE]三个特殊标记,并使用文本提示引导变化类别识别,摆脱对预定义分类头的依赖。 Result: 在WHU-CD、S2Looking、LEVIR-CD+和SECOND四个公共基准上实验表明,UniChange分别取得了90.41、53.04、78.87和57.62的IoU分数,性能超越所有先前方法。 Conclusion: UniChange成功实现了BCD与SCD任务的统一建模,具备强大多源知识融合能力和良好泛化性,为变化检测提供了新范式。 Abstract: Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.[53] M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical Settings
Jiankai Tang,Tao Zhang,Jia Li,Yiru Zhang,Mingyu Zhang,Kegang Wang,Yuming Hao,Bolin Wang,Haiyang Li,Xingyao Wang,Yuanchun Shi,Yuntao Wang,Sichong Qian
Main category: cs.CV
TL;DR: 提出M3PD数据集和F3Mamba模型,利用双视角智能手机视频进行可靠的非接触式心率监测,显著提升在真实场景下的准确性和鲁棒性。
Details
Motivation: 现有基于视频的生理监测方法受限于运动伪影、光照变化和单视角问题,且缺乏面向心血管患者的大规模公开数据集,难以实现跨设备验证。 Method: 构建首个公开的双视角移动PPG数据集M3PD,包含60名受试者(含47名心血管患者)的面部和指尖同步视频;提出基于Mamba的时序建模方法F3Mamba,融合双视角信息进行心率估计。 Result: F3Mamba相比单视角基线模型将心率误差降低了21.9%至30.2%,并在复杂真实场景中表现出更强的鲁棒性。 Conclusion: 双视角融合结合Mamba时序建模可有效提升移动端视频PPG的心率监测精度与实用性,为心血管疾病的便携化监测提供了新方案。 Abstract: Portable physiological monitoring is essential for early detection and management of cardiovascular disease, but current methods often require specialized equipment that limits accessibility or impose impractical postures that patients cannot maintain. Video-based photoplethysmography on smartphones offers a convenient noninvasive alternative, yet it still faces reliability challenges caused by motion artifacts, lighting variations, and single-view constraints. Few studies have demonstrated reliable application to cardiovascular patients, and no widely used open datasets exist for cross-device accuracy. To address these limitations, we introduce the M3PD dataset, the first publicly available dual-view mobile photoplethysmography dataset, comprising synchronized facial and fingertip videos captured simultaneously via front and rear smartphone cameras from 60 participants (including 47 cardiovascular patients). Building on this dual-view setting, we further propose F3Mamba, which fuses the facial and fingertip views through Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to 30.2 percent over existing single-view baselines while improving robustness in challenging real-world scenarios. Data and code: https://github.com/Health-HCI-Group/F3Mamba.[54] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Kevin Qinghong Lin,Yuhao Zheng,Hangyu Ran,Dantong Zhu,Dongxing Mao,Linjie Li,Philip Torr,Alex Jinpeng Wang
Main category: cs.CV
TL;DR: 本文提出了VCode基准,将多模态理解重新定义为SVG代码生成任务,并引入CodeVQA评估协议以衡量符号保真度;同时提出VCoder代理框架,通过“带修正的思考”和“视觉工具驱动的行动”两方面提升视觉中心编码能力,显著优于现有VLMs。
Details
Motivation: 现有研究主要集中在语言为中心的编程任务上,而对视觉为中心的编码关注不足;受人类通过草图推理的启发,作者主张将SVG代码作为一种紧凑、可解释且可执行的视觉表示形式。 Method: 提出VCode基准,涵盖常识、专业领域和视觉感知三个领域,要求模型根据图像生成保留符号意义的SVG代码;设计CodeVQA评估协议,通过在渲染后的SVG上回答问题来评估符号保真度;并提出VCoder框架,结合迭代修正机制与外部视觉工具(如检测器和解析器)增强VLM的视觉编码能力。 Result: 实验表明,尽管前沿视觉语言模型(VLMs)整体表现良好,但在专业知识和3D推理方面仍存在局限;VCoder相较Claude-4-Opus平均提升12.3个百分点;人类和模型在渲染SVG上的表现虽下降,但一致性显示出符号化视觉表示的潜力。 Conclusion: SVG代码是一种有前景的视觉表示方式,VCoder通过引入代理式修正和外部视觉工具有效缩小了语言中心与视觉中心编码之间的差距,推动了多模态模型在视觉理解与符号化表达方面的发展。 Abstract: Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.[55] RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Jiahe Song,Chuang Wang,Bowen Jiang,Yinfan Wang,Hao Zheng,Xingjian Wei,Chengjin Liu,Junyuan Gao,Yubin Wang,Lijun Wu,Jiang Wu,Qian Yu,Conghui He
Main category: cs.CV
TL;DR: 提出RxnCaption框架,将化学反应图解析转化为图像描述问题,利用大视觉语言模型和BBox与索引作为视觉提示(BIVP)策略,显著提升结构提取质量,并构建大规模数据集RxnCaption-11k,推动化学文献中结构化信息提取。
Details
Motivation: 现有化学反应数据多以图像形式存在于论文中,难以被机器读取和用于训练机器学习模型,缺乏大规模可读的反应数据集限制了AI在化学领域的应用。 Method: 提出RxnCaption框架,将传统基于坐标预测的解析转换为图像描述任务;采用“BBox和Index作为视觉提示”(BIVP)策略,使用MolYOLO检测分子边界框并将其绘制到输入图像上,使后续解析变为自然语言生成问题,由大视觉语言模型处理。 Result: BIVP策略显著提升了结构提取质量并简化了模型设计;构建了包含11,000个样本的RxnCaption-11k数据集,规模超过以往真实文献基准一个数量级;实验表明RxnCaption-VL在多个指标上达到SOTA性能。 Conclusion: 所提方法、数据集和模型有效推动了从化学文献中提取结构化信息的能力,有望促进AI在化学领域的更广泛应用。 Abstract: Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.[56] Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds
Leon Schwarzer,Matthias Zeller,Daniel Casado Herraez,Simon Dierl,Michael Heidingsfeld,Cyrill Stachniss
Main category: cs.CV
TL;DR: 提出一种基于聚类的对比损失函数,通过动态点去除进行聚类优化,实现雷达点云运动目标的自监督分割,提升标注效率和性能。
Details
Motivation: 雷达点云稀疏且噪声多,标注成本高,传统方法依赖大量标注数据,难以应用。同时,相机或LiDAR方法因需积累时序数据而引入延迟。因此需要一种高效、低标注需求的移动目标分割方法。 Method: 采用两步法:首先使用基于聚类的对比自监督学习预训练网络,生成对运动敏感的表征;然后在少量标注数据上进行有监督微调。提出新的聚类对比损失函数,并通过动态去除静态点优化聚类。 Result: 该方法在稀疏和噪声雷达点云上实现了自监督移动目标分割,显著提升了标签效率,在微调后优于当前最先进方法。 Conclusion: 所提自监督预训练方法有效增强了雷达点云移动目标分割性能,减少了对大量标注数据的依赖,适用于安全可靠的自动驾驶系统。 Abstract: Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point's Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.[57] A Novel Grouping-Based Hybrid Color Correction Algorithm for Color Point Clouds
Kuo-Liang Chung,Ting-Chung Tang
Main category: cs.CV
TL;DR: 本文提出了一种基于分组的混合颜色校正算法,用于点云的颜色一致性校正,通过自适应地将目标点云分为不同邻近组,并分别采用KBI、JKHE和HE方法进行颜色校正,在1086个测试对上验证了其优越性能。
Details
Motivation: 现有的颜色校正方法主要针对图像,缺乏对点云数据的有效处理,因此需要一种专门适用于颜色点云的一致性校正方法。 Method: 首先估计源点云与目标点云之间的重叠率,然后根据重叠率高低将目标点自适应划分为两组或三组;对近距离组Gcl使用基于K近邻的双边插值(KBI),中距离组Gmod采用联合KBI与直方图均衡化(JKHE),远距离组Gdist使用直方图均衡化(HE)进行颜色校正。 Result: 在1086个颜色点云对上的实验表明,该算法优于现有最先进方法,实现了良好的颜色一致性校正效果,并具备分组效应无关性和有效性验证。 Conclusion: 所提出的分组式混合校正算法能有效提升颜色点云的颜色一致性,具有实际应用价值,且代码已开源。 Abstract: Color consistency correction for color point clouds is a fundamental yet important task in 3D rendering and compression applications. In the past, most previous color correction methods aimed at correcting color for color images. The purpose of this paper is to propose a grouping-based hybrid color correction algorithm for color point clouds. Our algorithm begins by estimating the overlapping rate between the aligned source and target point clouds, and then adaptively partitions the target points into two groups, namely the close proximity group Gcl and the moderate proximity group Gmod, or three groups, namely Gcl, Gmod, and the distant proximity group Gdist, when the estimated overlapping rate is low or high, respectively. To correct color for target points in Gcl, a K-nearest neighbors based bilateral interpolation (KBI) method is proposed. To correct color for target points in Gmod, a joint KBI and the histogram equalization (JKHE) method is proposed. For target points in Gdist, a histogram equalization (HE) method is proposed for color correction. Finally, we discuss the grouping-effect free property and the ablation study in our algorithm. The desired color consistency correction benefit of our algorithm has been justified through 1086 testing color point cloud pairs against the state-of-the-art methods. The C++ source code of our algorithm can be accessed from the website: https://github.com/ivpml84079/Point-cloud-color-correction.[58] Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs
Arya Shah,Vaibhav Tripathi
Main category: cs.CV
TL;DR: 该研究提出了一种统一的冻结编码器基准,用于量化猫和人类视觉表征在多种模型中的跨物种对齐程度,发现自监督ViT(DINO)在早期阶段表现出最强的对齐性,表明自监督与ViT结构偏好能更好地桥接物种差异。
Details
Motivation: 理解猫和人类在眼部解剖结构差异下视觉表征的共性与差异,尤其是猫的竖直瞳孔如何影响其视觉处理机制,以及现有模型能否捕捉这些跨物种特征。 Method: 采用冻结编码器框架,结合Centered Kernel Alignment(CKA)和Representational Similarity Analysis(RSA),在CNN、监督ViT、窗口化Transformer和自监督ViT(DINO)上进行层间表征对齐分析,并补充分布与稳定性测试。 Result: DINO ViT-B/16在所有指标中表现最佳(CKA-RBF ≈0.814,CKA-线性≈0.745,RSA≈0.698),且对齐峰值出现在早期块;监督ViT在CKA上尚可但几何对应较弱;CNN为强基线但不及ViT;窗口化Transformer表现更差。 Conclusion: 自监督学习结合ViT的归纳偏置能产生更接近猫与人类视觉系统对齐的表征几何,优于传统CNN和窗口化Transformer,为跨物种视觉计算的神经科学假设提供了可检验的基础。 Abstract: Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.[59] IllumFlow: Illumination-Adaptive Low-Light Enhancement via Conditional Rectified Flow and Retinex Decomposition
Wenyang Wei,Yang yang,Xixi Jia,Xiangchu Feng,Weiwei Wang,Renzhen Wang
Main category: cs.CV
TL;DR: 提出IllumFlow框架,结合条件Rectified Flow与Retinex理论,分别优化光照和反射分量,实现低光图像增强与去噪。
Details
Motivation: 低光图像存在光照不均和噪声问题,现有方法难以同时处理亮度增强与细节保留。 Method: 基于Retinex理论分解图像为光照和反射分量;采用条件Rectified Flow建模光照变化;设计去噪网络结合流场数据增强去除反射分量噪声。 Result: 在多个低光增强和曝光校正任务上取得优于现有方法的定量与视觉效果,支持亮度自定义调节。 Conclusion: IllumFlow有效分离并优化光照与反射分量,实现了高质量的低光图像增强,兼顾亮度提升、噪声抑制与色彩保真。 Abstract: We present IllumFlow, a novel framework that synergizes conditional Rectified Flow (CRF) with Retinex theory for low-light image enhancement (LLIE). Our model addresses low-light enhancement through separate optimization of illumination and reflectance components, effectively handling both lighting variations and noise. Specifically, we first decompose an input image into reflectance and illumination components following Retinex theory. To model the wide dynamic range of illumination variations in low-light images, we propose a conditional rectified flow framework that represents illumination changes as a continuous flow field. While complex noise primarily resides in the reflectance component, we introduce a denoising network, enhanced by flow-derived data augmentation, to remove reflectance noise and chromatic aberration while preserving color fidelity. IllumFlow enables precise illumination adaptation across lighting conditions while naturally supporting customizable brightness enhancement. Extensive experiments on low-light enhancement and exposure correction demonstrate superior quantitative and qualitative performance over existing methods.[60] ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension
Duo Xu,Hao Cheng,Xin Lin,Zhen Xie,Hao Wang
Main category: cs.CV
TL;DR: 本文提出了一种自动化多阶段代码驱动的管道,用于生成复杂图表理解和推理任务的数据集,并构建了包含38K图表和142K问答对的ChartM$^3$数据集,显著提升了模型在复杂图表理解中的推理能力和跨领域泛化性能。
Details
Motivation: 现有研究对实际应用中常见的复杂图表场景和计算密集型推理任务覆盖有限,缺乏高质量、多样化的视觉推理数据集,限制了多模态大语言模型在该领域的进展。 Method: 提出一种结合检索增强生成(RAG)和思维链(CoT)策略的多阶段代码驱动生成管道:首先通过RAG检索专业图表模板,再利用CoT生成模拟真实数据分布的推理代码,驱动图表渲染和统计计算,最后通过模型评估优化数据质量和多样性。 Result: 构建了ChartM$^3$数据集,包含38K图表、142K Q&A训练对和2,871个高质量评测样本;实验表明,基于该数据集进行监督微调和强化学习后,小规模模型在复杂图表理解任务上的表现可媲美大规模模型,且推理能力和跨域泛化性显著提升。 Conclusion: 所提出的代码驱动生成框架能有效构建高质量、多样化的视觉推理数据集,ChartM$^3$为复杂图表理解提供了重要资源,推动多模态模型在专业图表分析任务中的发展。 Abstract: Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.[61] Synthetic Crop-Weed Image Generation and its Impact on Model Generalization
Garen Boyadjian,Cyrille Pierre,Johann Laconte,Riccardo Bertoglio
Main category: cs.CV
TL;DR: 本文提出了一种使用Blender程序化生成合成作物-杂草图像的流程,用于减少农业除草机器人语义分割模型训练中标注数据的成本,并验证了合成数据在跨域场景下优于真实数据的泛化能力。
Details
Motivation: 深度学习模型需要大量标注的真实田间数据进行训练,但获取成本高;同时,合成图像与真实图像之间存在领域差距,限制了其应用。 Method: 利用Blender构建一个可生成多样化条件(如植物生长阶段、杂草密度、光照和相机角度)下的带标注合成作物-杂草图像的流程,并在多个先进语义分割模型上对合成与真实数据进行基准测试,分析跨域泛化性能。 Result: 在合成数据上训练的模型在迁移到真实场景时存在约10%的sim-to-real差距,表现优于现有最先进方法;且合成数据在跨域任务中展现出比真实数据更强的泛化能力。 Conclusion: 合成农业数据集具有巨大潜力,能有效支持语义分割模型训练,建议采用合成与真实数据结合的混合策略以提升训练效率和模型性能。 Abstract: Precise semantic segmentation of crops and weeds is necessary for agricultural weeding robots. However, training deep learning models requires large annotated datasets, which are costly to obtain in real fields. Synthetic data can reduce this burden, but the gap between simulated and real images remains a challenge. In this paper, we present a pipeline for procedural generation of synthetic crop-weed images using Blender, producing annotated datasets under diverse conditions of plant growth, weed density, lighting, and camera angle. We benchmark several state-of-the-art segmentation models on synthetic and real datasets and analyze their cross-domain generalization. Our results show that training on synthetic images leads to a sim-to-real gap of 10%, surpassing previous state-of-the-art methods. Moreover, synthetic data demonstrates good generalization properties, outperforming real datasets in cross-domain scenarios. These findings highlight the potential of synthetic agricultural datasets and support hybrid strategies for more efficient model training.[62] From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Nicolas Schuler,Lea Dewald,Nick Baldig,Jürgen Graf
Main category: cs.CV
TL;DR: 本文研究了适用于边缘设备的小型视觉语言模型(VLMs)在移动机器人场景中的场景理解和动作识别能力,评估其在真实世界多场景下的表现,并讨论了计算效率、模型偏差与实际应用挑战。
Details
Motivation: 由于大型模型计算复杂度高,难以部署于边缘设备和移动机器人,因此需要探索小型VLM在保持性能的同时实现高效推理的潜力。 Method: 提出一个基于小型视觉语言模型的场景理解与动作识别 pipeline,并在包含城市街道、校园及室内等多种真实场景的数据集上进行评估。 Result: 实验表明小型VLM在边缘设备上具备一定的场景理解与动作识别能力,但在准确性与推理速度之间存在权衡,同时揭示了模型的偏差与局限性。 Conclusion: 小型VLM有潜力应用于移动机器人中的实时场景理解,但需进一步优化以克服性能瓶颈和模型偏差。 Abstract: Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/[63] KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image
Teerapong Panboonyuen
Main category: cs.CV
TL;DR: 提出KAO框架,利用扩散模型中的核自适应优化进行高分辨率卫星图像修复,通过潜在空间条件和显式传播提升效率与精度。
Details
Motivation: 现有方法依赖预训练模型需大量重训练或后置条件模型计算开销大,难以高效准确修复高分辨率卫星图像。 Method: 提出KAO框架,采用潜在空间条件方法优化紧凑的潜在空间,并在扩散过程中引入显式传播实现前向-后向融合。 Result: 实验表明KAO在DeepGlobe和Massachusetts Roads等数据集上实现了最先进的修复效果,兼顾效率与灵活性。 Conclusion: KAO为高分辨率卫星图像修复提供了可扩展且高性能的解决方案,显著提升了修复的稳定性与精度。 Abstract: Satellite image inpainting is a crucial task in remote sensing, where accurately restoring missing or occluded regions is essential for robust image analysis. In this paper, we propose KAO, a novel framework that utilizes Kernel-Adaptive Optimization within diffusion models for satellite image inpainting. KAO is specifically designed to address the challenges posed by very high-resolution (VHR) satellite datasets, such as DeepGlobe and the Massachusetts Roads Dataset. Unlike existing methods that rely on preconditioned models requiring extensive retraining or postconditioned models with significant computational overhead, KAO introduces a Latent Space Conditioning approach, optimizing a compact latent space to achieve efficient and accurate inpainting. Furthermore, we incorporate Explicit Propagation into the diffusion process, facilitating forward-backward fusion, which improves the stability and precision of the method. Experimental results demonstrate that KAO sets a new benchmark for VHR satellite image restoration, providing a scalable, high-performance solution that balances the efficiency of preconditioned models with the flexibility of postconditioned models.[64] MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer
Taiga Yamane,Satoshi Suzuki,Ryo Masumura,Shotaro Tora
Main category: cs.CV
TL;DR: 本文提出了一种用于时空动作识别(STAR)设置的多视角动作识别方法MVAFormer,该方法通过保留空间信息的特征图和基于Transformer的视图间协作模块,在新收集的数据集上比基线方法在F-measure上高出约4.4点。
Details
Motivation: 现有的多视角动作识别方法主要针对从完整视频中识别单一动作的任务设置,无法直接应用于需要逐个识别人物动作的时空动作识别(STAR)场景,因此需要一种新的方法来有效处理STAR设置下的多视角协作问题。 Method: 提出了MVAFormer,引入了一个基于Transformer的新型视图间协作模块,利用保留空间信息的特征图而非丢失空间信息的嵌入向量,并将自注意力机制分为相同视角和不同视角两部分,以更有效地建模多视角之间的关系。 Result: 在新收集的数据集上的实验结果表明,MVAFormer在F-measure指标上比对比基线方法高出约4.4个百分点。 Conclusion: MVAFormer通过有效的多视角协作机制,在时空动作识别(STAR)任务中显著提升了多视角动作识别性能,验证了其在复杂场景下的优越性。 Abstract: Multi-view action recognition aims to recognize human actions using multiple camera views and deals with occlusion caused by obstacles or crowds. In this task, cooperation among views, which generates a joint representation by combining multiple views, is vital. Previous studies have explored promising cooperation methods for improving performance. However, since their methods focus only on the task setting of recognizing a single action from an entire video, they are not applicable to the recently popular spatio-temporal action recognition~(STAR) setting, in which each person's action is recognized sequentially. To address this problem, this paper proposes a multi-view action recognition method for the STAR setting, called MVAFormer. In MVAFormer, we introduce a novel transformer-based cooperation module among views. In contrast to previous studies, which utilize embedding vectors with lost spatial information, our module utilizes the feature map for effective cooperation in the STAR setting, which preserves the spatial information. Furthermore, in our module, we divide the self-attention for the same and different views to model the relationship between multiple views effectively. The results of experiments using a newly collected dataset demonstrate that MVAFormer outperforms the comparison baselines by approximately $4.4$ points on the F-measure.[65] OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
Xilong Zhou,Jianchun Chen,Pramod Rao,Timo Teufel,Linjie Lyu,Tigran Minasian,Oleksandr Sotnychenko,Xiaoxiao Long,Marc Habermann,Christian Theobalt
Main category: cs.CV
TL;DR: OLATverse是一个包含约900万张图像的大规模数据集,涵盖765个真实世界物体,在多种视角和精确控制的光照条件下采集,旨在推动逆渲染、视图合成与重光照技术的发展。
Details
Motivation: 现有方法多依赖合成数据集进行训练,真实世界数据集规模小,限制了模型的真实感和泛化能力。OLATverse旨在填补这一空白,提供大规模、高保真、光照可控的真实物体数据集。 Method: 使用35台DSLR相机和331个独立控制的光源对每个物体进行多视角、多光照条件下的拍摄,并提供校准后的相机参数、精确物体掩码、光度法表面法线和漫反射反照率等辅助信息。 Result: 构建了包含765个真实物体的大规模数据集OLATverse,提供了高质量的多视角图像和精确光照控制下的外观表示,并建立了首个面向真实世界物体的逆渲染与法线估计综合评测集。 Conclusion: OLATverse为将下一代逆渲染与重光照方法与真实世界数据结合提供了关键支持,有望显著提升相关技术在真实场景中的表现与泛化能力。 Abstract: We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. While recent advances in object-centric inverse rendering, novel view synthesis and relighting have shown promising results, most techniques still heavily rely on the synthetic datasets for training and small-scale real-world datasets for benchmarking, which limits their realism and generalization. To address this gap, OLATverse offers two key advantages over existing datasets: large-scale coverage of real objects and high-fidelity appearance under precisely controlled illuminations. Specifically, OLATverse contains 765 common and uncommon real-world objects, spanning a wide range of material categories. Each object is captured using 35 DSLR cameras and 331 individually controlled light sources, enabling the simulation of diverse illumination conditions. In addition, for each object, we provide well-calibrated camera parameters, accurate object masks, photometric surface normals, and diffuse albedo as auxiliary resources. We also construct an extensive evaluation set, establishing the first comprehensive real-world object-centric benchmark for inverse rendering and normal estimation. We believe that OLATverse represents a pivotal step toward integrating the next generation of inverse rendering and relighting methods with real-world data. The full dataset, along with all post-processing workflows, will be publicly released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.[66] Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization
Tao Liu,Kan Ren,Qian Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于目标检测与图神经网络的跨视角无人机定位框架,通过构建细粒度的图结构节点相似性度量,在GNSS拒止环境下实现了高效的异源图像匹配与定位。
Details
Motivation: 在GNSS信号不可用的区域,传统的卫星定位方法失效,且现有跨视角定位方法在处理时空差异和模态差异时存在对齐困难、内容丢失和泛化能力差等问题。因此需要一种更鲁棒的跨视图定位方法。 Method: 将无人机视觉定位转化为对象级别的图匹配问题:利用现代目标检测技术从无人机和卫星图像中提取显著实例,构建图结构;使用图神经网络建模图像内和图像间的节点关系,并采用细粒度的节点相似性度量进行图像检索与定位。 Result: 在公开和真实世界数据集上的大量实验表明,该方法能有效应对异构外观差异,具有良好的泛化能力,尤其适用于红外-可见光等大模态差距场景,定位与检索性能优越。 Conclusion: 本文提出的ODGNNLoc框架通过结合目标检测与图神经网络,显著提升了跨视角无人机定位的准确性与鲁棒性,为低空经济中的无人系统自主导航提供了可行解决方案。 Abstract: With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: https://github.com/liutao23/ODGNNLoc.git.[67] Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
Robinson Umeike,Neil Getty,Yin Xiangyu,Yi Jiang
Main category: cs.CV
TL;DR: 本文提出了PtychoBench,一个用于叠层成像分析的多模态、多任务基准,系统比较了监督微调(SFT)和上下文学习(ICL)两种专业化策略。结果表明,最佳策略取决于任务类型:视觉任务中SFT与ICL互补,文本任务中ICL表现更优。
Details
Motivation: 为了确定如何最优地将通用基础模型(如语言模型和视觉-语言模型)适应于专业科学任务,尤其是在数据稀缺的情况下。 Method: 构建了一个名为PtychoBench的多模态、多任务基准,用于评估监督微调(SFT)和上下文学习(ICL)在视觉伪影检测(VLM)和文本参数推荐(LLM)任务中的表现,并与强基线模型(如GPT-4o和DINOv3分类器)进行对比。 Result: 在视觉任务中,结合SFT与上下文示例引导的模型性能最高(Micro-F1达0.728);而在文本任务中,ICL在大基础模型上表现最佳(Micro-F1达0.847),甚至超过强SFT模型(0.839)。此外,发现上下文感知提示更有效,且微调模型存在上下文干扰现象。 Conclusion: 任务模态决定了最优的专业化路径:视觉任务适合结合SFT与ICL,而文本任务更适合使用ICL。该研究为科学领域AI系统的开发提供了清晰框架。 Abstract: The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.[68] ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing
Yaosen Chen,Wei Wang,Xuming Wen,Han Yang,Yanru Zhang
Main category: cs.CV
TL;DR: 提出一种基于能量模型的视频镜头组装优化方法,能够根据脚本语义和参考视频风格自动组合镜头,提升智能视频剪辑的艺术表达能力。
Details
Motivation: 现有智能视频编辑技术难以捕捉创作者在镜头组装中的独特艺术表达,需实现更符合叙事与艺术风格的自动化剪辑。 Method: 首先通过大语言模型生成剧本,并与视频库进行视觉-语义匹配获取候选镜头;然后从参考视频中分割并标注镜头,提取镜头尺寸、摄像机运动和语义等属性;利用基于能量的模型学习这些属性,对候选镜头序列打分;最后结合多种句法规则优化镜头组装。 Result: 实现了与参考视频剪辑风格一致的镜头序列生成,支持非专业用户创建具有连贯视觉表达和艺术风格的视频。 Conclusion: 该方法不仅能自动化镜头排列,还能学习参考视频的组装风格,有效融合叙事逻辑与艺术表达,降低高质量视频创作门槛。 Abstract: Shot assembly is a crucial step in film production and video editing, involving the sequencing and arrangement of shots to construct a narrative, convey information, or evoke emotions. Traditionally, this process has been manually executed by experienced editors. While current intelligent video editing technologies can handle some automated video editing tasks, they often fail to capture the creator's unique artistic expression in shot assembly.To address this challenge, we propose an energy-based optimization method for video shot assembly. Specifically, we first perform visual-semantic matching between the script generated by a large language model and a video library to obtain subsets of candidate shots aligned with the script semantics. Next, we segment and label the shots from reference videos, extracting attributes such as shot size, camera motion, and semantics. We then employ energy-based models to learn from these attributes, scoring candidate shot sequences based on their alignment with reference styles. Finally, we achieve shot assembly optimization by combining multiple syntax rules, producing videos that align with the assembly style of the reference videos. Our method not only automates the arrangement and combination of independent shots according to specific logic, narrative requirements, or artistic styles but also learns the assembly style of reference videos, creating a coherent visual sequence or holistic visual expression. With our system, even users with no prior video editing experience can create visually compelling videos. Project page: https://sobeymil.github.io/esa.com[69] Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems
Nicolas Schuler,Lea Dewald,Jürgen Graf
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态传感器和本地模型的移动机器人自动化报告生成管道,可在边缘设备上运行,保护隐私且无需外部服务。
Details
Motivation: 随着深度学习的发展,智能机器人系统需要处理大量异构数据以应对复杂环境,尤其在自动驾驶和服务机器人等关键任务中,亟需有效的评估手段。 Method: 设计并实现了一个完全基于本地化模型的自动化报告生成流程,利用多模态传感器数据,在边缘计算设备上完成自然语言报告的生成。 Result: 在涵盖室内、室外和城市环境的多样化数据集上进行了评估,提供了定量与定性的结果,并公开了示例报告及相关补充材料。 Conclusion: 该方法能够在保障隐私的前提下有效支持移动机器人系统的评估与应用推广,具备跨领域适用性和实际部署潜力。 Abstract: Recent advancements in Deep Learning enable hardware-based cognitive systems, that is, mechatronic systems in general and robotics in particular with integrated Artificial Intelligence, to interact with dynamic and unstructured environments. While the results are impressive, the application of such systems to critical tasks like autonomous driving as well as service and care robotics necessitate the evaluation of large amount of heterogeneous data. Automated report generation for Mobile Robotics can play a crucial role in facilitating the evaluation and acceptance of such systems in various domains. In this paper, we propose a pipeline for generating automated reports in natural language utilizing various multi-modal sensors that solely relies on local models capable of being deployed on edge computing devices, thus preserving the privacy of all actors involved and eliminating the need for external services. In particular, we evaluate our implementation on a diverse dataset spanning multiple domains including indoor, outdoor and urban environments, providing quantitative as well as qualitative evaluation results. Various generated example reports and other supplementary materials are available via a public repository.[70] LiteVoxel: Low-memory Intelligent Thresholding for Efficient Voxel Rasterization
Jee Won Lee,Jongseong Brad Choi
Main category: cs.CV
TL;DR: LiteVoxel是一种自适应训练管道,通过改进损失函数和动态剪枝策略,提升了稀疏体素光栅化的稳定性与内存效率,显著降低峰值显存并保留低频细节。
Details
Motivation: 稀疏体素光栅化在优化型场景重建中虽快且可微,但存在低频内容欠拟合、依赖脆弱的剪枝启发式和显存膨胀等问题。 Method: 引入逆Sobel重加权与中期gamma-ramp使损失关注低频区域;采用基于深度分位数的最大混合权重剪枝,结合EMA滞后保护机制,并在显式增长预算下按射线足迹优先级进行细分。 Result: 在Mip-NeRF 360和Tanks & Temples数据集上,有效缓解了低频误差与边界不稳定问题,PSNR/SSIM、训练时间和FPS与强基线相当,峰值显存降低40%-60%。 Conclusion: LiteVoxel实现了更稳定、轻量且内存高效的稀疏体素训练,在不牺牲感知质量的前提下提升重建性能。 Abstract: Sparse-voxel rasterization is a fast, differentiable alternative for optimization-based scene reconstruction, but it tends to underfit low-frequency content, depends on brittle pruning heuristics, and can overgrow in ways that inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that makes SV rasterization both steadier and lighter. Our loss is made low-frequency aware via an inverse-Sobel reweighting with a mid-training gamma-ramp, shifting gradient budget to flat regions only after geometry stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning logic on maximum blending weight, stabilized by EMA-hysteresis guards and refines structure through ray-footprint-based, priority-driven subdivision under an explicit growth budget. Ablations and full-system results across Mip-NeRF 360 (6scenes) and Tanks & Temples (3scenes) datasets show mitigation of errors in low-frequency regions and boundary instability while keeping PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline. Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency detail that prior setups miss, enabling more predictable, memory-efficient training without sacrificing perceptual quality.[71] Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data
Jessica Plassmann,Nicolas Schuler,Georg von Freymann,Michael Schuth
Main category: cs.CV
TL;DR: 本研究探索了无监督学习方法在剪切成像图像中自动异常检测的应用,评估了三种模型架构,并证明学生-教师特征匹配模型在分类鲁棒性和缺陷定位精度方面表现最优。
Details
Motivation: 减少对标注数据和人工判读的依赖,推动剪切成像技术在工业中的应用。 Method: 采用全连接自编码器、卷积自编码器和学生-教师特征匹配模型三种无监督架构,在仅含无缺陷数据的两个子集上进行训练,并在控制条件下使用定制样本数据集进行系统评估。 Result: 学生-教师模型展现出更优的分类性能和精确的缺陷定位能力,t-SNE可视化显示其特征表示更具可分性;YOLOv8作为有监督基准模型用于对比定位效果。 Conclusion: 无监督深度学习,特别是学生-教师方法,具有实现高效、可扩展的工业剪切图像检测的潜力。 Abstract: Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.[72] Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction
Ali Farki,Elaheh Moradi,Deepika Koundal,Jussi Tohka
Main category: cs.CV
TL;DR: 该研究利用深度学习模型从基线MRI预测多年后的脑部MRI图像,实现了对个体脑状态的高保真预测,展示了在神经退行性疾病个性化预后中的潜力。
Details
Motivation: 预测未来脑状态对于研究阿尔茨海默病等神经退行性疾病至关重要,现有方法多集中于认知评分或临床结果预测,而缺乏对全脑MRI图像的长期直接预测。 Method: 采用五种深度学习架构(UNet、U2-Net、UNETR、Time-Embedding UNet 和 ODE-UNet)在ADNI和AIBL两个纵向队列上进行MRI图像到图像的预测,并在外部独立数据集上验证泛化能力。 Result: 最佳模型能够实现高保真的MRI预测,所有模型在跨队列数据上均表现出良好的泛化性能,预测结果与实际随访扫描在全局相似性和局部差异上高度一致。 Conclusion: 深度学习可可靠地在体素水平上预测个体化的脑MRI变化,为神经退行性疾病的个体化预测提供了新途径。 Abstract: Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.[73] The Urban Vision Hackathon Dataset and Models: Towards Image Annotations and Accurate Vision Models for Indian Traffic
Akash Sharma,Chinmay Mhatre,Sankalp Gawali,Ruthvik Bokkasam,Brij Kishore,Vishwajeet Pattanaik,Tarun Rambha,Abdul R. Pinjari,Vijay Kovvali,Anirban Chakraborty,Punit Rathore,Raghu Krishnapuram,Yogesh Simmhan
Main category: cs.CV
TL;DR: UVH-26是印度首个大规模标注交通摄像头图像数据集,包含26,646张高分辨率图像和14类印度特有车辆的180万个边界框,通过众包标注并采用多数投票和STAPLE算法生成共识标注。基于该数据集训练的检测模型在mAP50:95上比COCO预训练模型提升8.4-31.5%,显著提升了印度复杂交通场景下的目标检测性能。
Details
Motivation: 现有公开数据集无法充分反映印度复杂、异构的城市交通环境,缺乏针对印度特有交通工具和真实交通摄像头场景的大规模标注数据,限制了智能交通系统在该地区的应用与发展。 Method: 从班加罗尔2800个Safe-City监控摄像头采集26,646张1080p图像,组织全国565名大学生通过众包黑客松进行标注,涵盖14类印度典型车辆。采用Majority Voting和STAPLE算法对多份标注结果进行融合,生成共识真值。使用YOLO11、RT-DETR、DAMO-YOLO等先进检测器在UVH-26-MV和UVH-26-ST两个版本上进行训练与评估。 Result: 生成了28.3万至31.6万个共识标注框;RT-DETR-X在mAP50:95指标上达到0.67,相比COCO预训练模型的0.40显著提升;所有模型在mAP50:95上均比COCO基准提升8.4%-31.5%。发布了带共识标注的图像集及6个微调后的检测模型。 Conclusion: UVH-26填补了全球交通数据集中关于印度复杂城市交通的空白,证明了领域特定数据在提升目标检测性能上的关键作用,为发展中国家智能交通系统的研发提供了重要基础资源。 Abstract: This report describes the UVH-26 dataset, the first public release by AIM@IISc of a large-scale dataset of annotated traffic-camera images from India. The dataset comprises 26,646 high-resolution (1080p) images sampled from 2800 Bengaluru's Safe-City CCTV cameras over a 4-week period, and subsequently annotated through a crowdsourced hackathon involving 565 college students from across India. In total, 1.8 million bounding boxes were labeled across 14 vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler (Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller, Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k consensus ground truth bounding boxes and labels were derived for distinct objects in the 26k images using Majority Voting and STAPLE algorithms. Further, we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X, and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50, mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in mAP50:95 over equivalent baseline models trained on COCO dataset, with RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40 for COCO-trained weights for common classes (Car, Bus, and Truck). This demonstrates the benefits of domain-specific training data for Indian traffic scenarios. The release package provides the 26k images with consensus annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the 6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the heterogeneity of Indian urban mobility directly from operational traffic-camera streams, UVH-26 addresses a critical gap in existing global benchmarks, and offers a foundation for advancing detection, classification, and deployment of intelligent transportation systems in emerging nations with complex traffic conditions.[74] Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification
Md Rashidunnabi,Kailash A. Hambarde,Vasco Lopes,Joao C. Neves,Hugo Proenca
Main category: cs.CV
TL;DR: 提出MTF-CVReID,一个参数高效的视频行人重识别框架,用于解决跨视角(如空中-地面)下的视角变化、尺度差异和时序不一致问题,在多个基准上实现SOTA性能。
Details
Motivation: 由于极端的视角变化、尺度差异以及时序不一致,基于视频的跨视角行人重识别仍是一个开放问题。 Method: 在ViT-B/16基础上引入七个轻量模块:跨流特征归一化(CSFN)、多分辨率特征协调(MRFH)、身份感知记忆模块(IAMM)、时序动态建模(TDM)、跨视图特征对齐(IVFA)、分层时序模式学习(HTPL)和多视图身份一致性学习(MVICL),提升跨视角鲁棒性和时序一致性。 Result: 仅增加约200万参数和0.7 GFLOPs,保持189 FPS实时性,在AG-VPReID上达到SOTA,并在G2A-VReID和MARS数据集上表现出良好的跨数据集泛化能力。 Conclusion: 精心设计的适配器模块可在不牺牲计算效率的前提下显著提升跨视角视频行人重识别性能。 Abstract: Video-based person re-identification (ReID) in cross-view domains (for example, aerial-ground surveillance) remains an open problem because of extreme viewpoint shifts, scale disparities, and temporal inconsistencies. To address these challenges, we propose MTF-CVReID, a parameter-efficient framework that introduces seven complementary modules over a ViT-B/16 backbone. Specifically, we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment (IVFA) for perspective-invariant representation alignment; (6) Hierarchical Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities; and (7) Multi-View Identity Consistency Learning (MVICL) that enforces cross-view identity coherence using a contrastive learning paradigm. Despite adding only about 2 million parameters and 0.7 GFLOPs over the baseline, MTF-CVReID maintains real-time efficiency (189 FPS) and achieves state-of-the-art performance on the AG-VPReID benchmark across all altitude levels, with strong cross-dataset generalization to G2A-VReID and MARS datasets. These results show that carefully designed adapter-based modules can substantially enhance cross-view robustness and temporal consistency without compromising computational efficiency. The source code is available at https://github.com/MdRashidunnabi/MTF-CVReID[75] A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Jingyu Lu,Haonan Wang,Qixiang Zhang,Xiaomeng Li
Main category: cs.CV
TL;DR: 提出了一种新的无受试者依赖的脑解码框架VCFlow,通过模拟人类视觉系统的腹背架构,实现快速、可扩展的视觉经验重建。
Details
Motivation: 解决跨受试者泛化和脑信号复杂性带来的挑战,推动无受试者依赖脑解码在临床中的应用。 Method: 构建分层解码框架VCFlow,模拟视觉系统的腹背通路,分离并利用早期视觉皮层、腹侧和背侧流的特征,并引入特征级对比学习策略以提取受试者不变的语义表示。 Result: VCFlow在平均仅损失7%精度的情况下,无需重新训练即可在10秒内生成每个重建视频,相比传统方法大幅减少时间和计算需求。 Conclusion: VCFlow为无受试者依赖的视觉重建提供了一种高效、快速且临床可扩展的解决方案。 Abstract: Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution. The source code will be released upon acceptance of the paper.[76] TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Daichi Nagai,Ryugo Morita,Shunsuke Kitada,Hitoshi Iyatomi
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的噪声移植与培育扩散模型(TAUE),用于零样本、分层图像生成,解决了现有方法在生成完整且连贯场景方面的局限性。
Details
Motivation: 现有的文本到图像生成模型输出仅为单一平面图像,难以满足需要分层控制的专业应用;现有分层生成方法依赖大规模数据微调或仅能生成孤立前景,缺乏整体场景一致性。 Method: 提出噪声移植与培育(NTC)技术,从前景和整体生成过程中提取中间潜在表示,并将其植入初始噪声中以生成后续图层,从而在无需微调或辅助数据集的情况下实现语义和结构上的跨层一致性。 Result: 实验表明,该方法在保持高图像质量的同时,实现了与微调方法相当的性能,显著提升了分层一致性,并支持复杂的组合编辑等下游应用。 Conclusion: TAUE消除了对昂贵训练和专有数据集的依赖,为更可访问、可控的生成式工作流提供了新途径。 Abstract: Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.[77] Zero-Shot Multi-Animal Tracking in the Wild
Jan Frederik Meier,Timo Lüddecke
Main category: cs.CV
TL;DR: 提出了一种基于视觉基础模型的零样本多动物跟踪框架,结合Grounding Dino检测器和SAM 2跟踪器,无需重新训练即可在多种数据集上实现稳定高性能。
Details
Motivation: 传统多动物跟踪方法需针对不同场景进行大量模型微调和启发式设计,限制了其泛化能力,因此需要一种无需重新训练即可适应新场景的通用方法。 Method: 结合Grounding Dino物体检测器与Segment Anything Model 2(SAM 2)跟踪器,并引入精心设计的启发式策略,构建了一个无需微调或超参数调整的零样本多动物跟踪框架。 Result: 在ChimpAct、Bird Flock Tracking、AnimalTrack和GMOT-40子集等多个数据集上验证了该方法的有效性,表现出跨物种和跨环境的强健且一致的性能。 Conclusion: 所提出的方法实现了无需重训练的零样本多动物跟踪,在多样化场景中具有良好的应用潜力,展示了视觉基础模型在动物行为分析中的可行性。 Abstract: Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.[78] Robust Face Liveness Detection for Biometric Authentication using Single Image
Poulami Raha,Yeongnam Chae
Main category: cs.CV
TL;DR: 本文提出了一种轻量级CNN框架,用于检测打印、显示、视频和包裹类面部欺骗攻击,并构建了一个包含500多个视频的新2D欺骗攻击数据集,实验证明该方法在CPU上可在1-2秒内实现高效的活体检测。
Details
Motivation: 现有的面部识别系统容易受到呈现攻击(如打印、视频回放等)的威胁,导致安全漏洞,因此需要一种高效且鲁棒的活体检测方法来防御多种类型的欺骗攻击。 Method: 提出一种轻量级卷积神经网络(CNN)架构,能够识别多种2D欺骗攻击类型(打印/显示、视频回放、包裹攻击),并在新构建的包含60名受试者超过500个视频的数据集上进行训练与验证。 Result: 该框架在CPU上实现1-2秒内的快速活体检测,具备良好的实时性和准确性,并通过演示视频展示了对多种攻击类型的检测能力。 Conclusion: 所提出的轻量级CNN模型能有效防御多种常见的2D面部欺骗攻击,具有实际部署价值,适用于对响应速度和安全性要求较高的身份认证场景。 Abstract: Biometric technologies are widely adopted in security, legal, and financial systems. Face recognition can authenticate a person based on the unique facial features such as shape and texture. However, recent works have demonstrated the vulnerability of Face Recognition Systems (FRS) towards presentation attacks. Using spoofing (aka.,presentation attacks), a malicious actor can get illegitimate access to secure systems. This paper proposes a novel light-weight CNN framework to identify print/display, video and wrap attacks. The proposed robust architecture provides seamless liveness detection ensuring faster biometric authentication (1-2 seconds on CPU). Further, this also presents a newly created 2D spoof attack dataset consisting of more than 500 videos collected from 60 subjects. To validate the effectiveness of this architecture, we provide a demonstration video depicting print/display, video and wrap attack detection approaches. The demo can be viewed in the following link: https://rak.box.com/s/m1uf31fn5amtjp4mkgf1huh4ykfeibaa[79] Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
Tianfan Peng,Yuntao Du,Pengzhou Ji,Shijie Dong,Kailin Jiang,Mingchuan Ma,Yijun Tian,Jinhe Bi,Qian Li,Wei Du,Feng Xiao,Lizhen Cui
Main category: cs.CV
TL;DR: 本文提出了UniPruneBench,一个用于多模态大模型视觉token剪枝的统一、可扩展的基准测试平台,涵盖六种能力维度、十个数据集和十种压缩算法,并在三个主流LMM家族上进行评估。实验揭示了随机剪枝是强基线、无单一方法始终最优、OCR任务最敏感以及剪枝比例主导性能下降等关键发现。
Details
Motivation: 现有的视觉token压缩方法评估分散且不一致,缺乏统一标准,导致难以公平比较不同算法在多模态大模型中的效果。 Method: 构建了一个名为UniPruneBench的统一基准,包含标准化的评估协议,覆盖六个能力维度和十个数据集,评估十种代表性压缩算法在LLaVA-v1.5、Intern-VL3和Qwen2.5-VL三类LMM上的表现,并引入运行时间、prefilling延迟等系统级指标。 Result: 实验发现:(1) 随机剪枝是一个强有力的基线;(2) 没有一种方法在所有场景下 consistently 优于其他方法;(3) 不同任务对剪枝的敏感性差异大,OCR最脆弱;(4) 剪枝比例是影响性能下降的主要因素。 Conclusion: UniPruneBench为未来多模态模型的高效视觉token压缩研究提供了可靠、全面的评估基础。 Abstract: Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.[80] Differentiable Hierarchical Visual Tokenization
Marius Aasan,Martine Hjelkrem-Tan,Nico Catalano,Changkyu Choi,Adín Ramírez Rivera
Main category: cs.CV
TL;DR: 提出一种端到端可微分的、适应图像内容的 tokenizer,具有像素级精细度,兼容现有架构,并在图像分类和密集预测任务中表现优异,支持即用型光栅到矢量转换。
Details
Motivation: Vision Transformers 使用固定的 patch token,忽略了图像的空间和语义结构,限制了模型对细节信息的捕捉能力。 Method: 提出一种基于信息准则的层次化模型选择方法,构建端到端可微分的 tokenizer,实现像素级自适应分块,并保持与现有架构的反向兼容性,便于迁移预训练模型。 Result: 在图像级分类和密集预测任务中达到有竞争力的性能,并能直接实现光栅图像到矢量图形的转换。 Conclusion: 该可学习 tokenizer 能有效提升 Vision Transformers 对图像结构的理解能力,兼具灵活性与兼容性,拓展了其在多种视觉任务中的应用潜力。 Abstract: Vision Transformers rely on fixed patch tokens that ignore the spatial and semantic structure of images. In this work, we introduce an end-to-end differentiable tokenizer that adapts to image content with pixel-level granularity while remaining backward-compatible with existing architectures for retrofitting pretrained models. Our method uses hierarchical model selection with information criteria to provide competitive performance in both image-level classification and dense-prediction tasks, and even supports out-of-the-box raster-to-vector conversion.[81] Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification
Chao Yuan,Zanwu Liu,Guiwei Zhang,Haoxuan Xu,Yujian Zhao,Guanglin Niu,Bo Li
Main category: cs.CV
TL;DR: 提出一种新的可见光-红外行人重识别框架MTRL,通过中间生成图像作为可见光到红外模态的过渡,有效对齐跨模态特征,无需额外参数,在三个主流数据集上显著优于现有方法。
Details
Motivation: 解决可见光与红外模态间存在较大差异且现有方法未能充分利用中间特征表示的问题。 Method: 设计Modality-Transition Representation Learning (MTRL)框架,利用中间生成图像作为模态过渡,并采用模态转换对比损失和模态查询正则化损失进行训练。 Result: 在三个典型VI-ReID数据集上实验表明,该方法显著且一致地优于现有的最先进方法,同时不增加模型参数量,保持与骨干网络相同的推理速度。 Conclusion: 所提出的MTRL框架能有效提升VI-ReID性能,兼具高效性和可解释性,为跨模态行人重识别提供了新思路。 Abstract: Visible-infrared person re-identification (VI-ReID) technique could associate the pedestrian images across visible and infrared modalities in the practical scenarios of background illumination changes. However, a substantial gap inherently exists between these two modalities. Besides, existing methods primarily rely on intermediate representations to align cross-modal features of the same person. The intermediate feature representations are usually create by generating intermediate images (kind of data enhancement), or fusing intermediate features (more parameters, lack of interpretability), and they do not make good use of the intermediate features. Thus, we propose a novel VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a middle generated image as a transmitter from visible to infrared modals, which are fully aligned with the original visible images and similar to the infrared modality. After that, using a modality-transition contrastive loss and a modality-query regularization loss for training, which could align the cross-modal features more effectively. Notably, our proposed framework does not need any additional parameters, which achieves the same inference speed to the backbone while improving its performance on VI-ReID task. Extensive experimental results illustrate that our model significantly and consistently outperforms existing SOTAs on three typical VI-ReID datasets.[82] VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
Zhicheng Zhang,Weicheng Wang,Yongjie Zhu,Wenyu Qin,Pengfei Wan,Di Zhang,Jufeng Yang
Main category: cs.CV
TL;DR: 本文提出了一种情感线索引导的推理框架及视频情感基础模型VidEmo,通过两阶段训练提升视频情感理解能力,并构建了包含210万样本的细粒度数据集Emo-CFG,实验结果在15项任务上达到领先性能。
Details
Motivation: 情绪具有动态性和依赖线索的特性,现有方法难以合理解释复杂且不断变化的情感状态,因此需要更有效的模型和数据支持来提升视频情感理解与推理能力。 Method: 提出情感线索引导的分阶段推理框架,设计专用于情感推理的视频情感基础模型VidEmo,采用课程情感学习注入情感知识,结合情感树强化学习进行推理优化,并构建大规模指令型数据集Emo-CFG支持训练与评估。 Result: 模型在15项面部感知任务中表现出色,取得了竞争性的性能,显著推动了视频情感理解的发展。 Conclusion: 所提出的VidEmo模型与Emo-CFG数据集为视频情感理解提供了有效解决方案,通过系统化的训练策略和高质量数据支持,实现了可解释的情感推理新里程碑。 Abstract: Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.[83] LLEXICORP: End-user Explainability of Convolutional Neural Networks
Vojtěch Kůr,Adam Bajger,Adam Kukučka,Marek Hradil,Vít Musil,Tomáš Brázdil
Main category: cs.CV
TL;DR: 本文提出了LLEXICORP,一个将概念相关性传播(CRP)与多模态大语言模型结合的模块化管道,能够自动为概念原型命名并生成自然语言解释,从而降低深度神经网络解释的门槛。
Details
Motivation: 现有的CRP方法依赖人工分析激活图像和生成解释,限制了可扩展性和可访问性,因此需要一种自动化、可扩展的方式来生成忠实且易于理解的模型解释。 Method: 提出LLEXICORP框架,将CRP与多模态大语言模型结合,通过设计提示词教会模型CRP语义,并分离命名与解释任务以确保保真度,自动生成概念名称和自然语言解释。 Result: 在VGG16和ImageNet图像上的定性评估表明,该方法能有效生成有意义的概念名称和分层解释,适用于专家和非技术人员。 Conclusion: 将基于概念的归因方法与大语言模型结合,可显著提升深度神经网络的可解释性,推动更透明AI系统的发展。 Abstract: Convolutional neural networks (CNNs) underpin many modern computer vision systems. With applications ranging from common to critical areas, a need to explain and understand the model and its decisions (XAI) emerged. Prior works suggest that in the top layers of CNNs, the individual channels can be attributed to classifying human-understandable concepts. Concept relevance propagation (CRP) methods can backtrack predictions to these channels and find images that most activate these channels. However, current CRP workflows are largely manual: experts must inspect activation images to name the discovered concepts and must synthesize verbose explanations from relevance maps, limiting the accessibility of the explanations and their scalability. To address these issues, we introduce Large Language model EXplaIns COncept Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a multimodal large language model. Our approach automatically assigns descriptive names to concept prototypes and generates natural-language explanations that translate quantitative relevance distributions into intuitive narratives. To ensure faithfulness, we craft prompts that teach the language model the semantics of CRP through examples and enforce a separation between naming and explanation tasks. The resulting text can be tailored to different audiences, offering low-level technical descriptions for experts and high-level summaries for non-technical stakeholders. We qualitatively evaluate our method on various images from ImageNet on a VGG16 model. Our findings suggest that integrating concept-based attribution methods with large language models can significantly lower the barrier to interpreting deep neural networks, paving the way for more transparent AI systems.[84] Dynamic Reflections: Probing Video Representations with Text Alignment
Tyler Zhu,Tengda Han,Leonidas Guibas,Viorica Pătrăucean,Maks Ovsjanikov
Main category: cs.CV
TL;DR: 本文首次全面研究了视频-文本表征对齐,揭示了跨模态对齐能力依赖于视觉和文本数据的丰富性,并提出了具有预测力的参数化测试时缩放定律。
Details
Motivation: 尽管图像-文本对齐已有进展,但视频数据的时间特性在此背景下的研究仍不足,因此需要系统探究视频-文本表征对齐以理解现代编码器的能力。 Method: 通过分析不同视频和语言编码器在多种视觉与文本输入组合下的对齐表现,提出参数化的测试时 scaling laws,并评估其与下游任务性能及时间推理能力的相关性。 Result: 发现跨模态对齐效果显著依赖于输入数据的丰富性;语义对齐与下游任务性能相关;时间推理能力可通过视频-文本对齐进行评测。 Conclusion: 视频-文本对齐是一种有效的零样本方法,可用于评估时空数据表征能力,为多模态模型提供了新的分析视角。 Abstract: The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/[85] PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Antonio Oroz,Matthias Nießner,Tobias Kirschstein
Main category: cs.CV
TL;DR: 本文提出PercHead,一种用于单图像3D头部重建和语义3D编辑的统一方法,通过双分支编码器与ViT解码器结合高斯点阵渲染,并引入基于DINOv2和SAM2.1的感知监督策略,在新视角合成和极端角度鲁棒性上达到SOTA,同时支持通过分割图、文本或参考图像进行直观的几何与外观解耦编辑。
Details
Motivation: 单图像3D头部重建面临严重遮挡、感知监督弱和3D编辑模糊等挑战,现有方法在视角一致性和编辑灵活性方面不足,因此需要一种兼具高保真重建与直观语义编辑能力的统一框架。 Method: 提出PercHead,采用双分支编码器和ViT-based解码器,通过迭代交叉注意力将2D特征提升至3D空间,使用高斯点阵进行渲染;引入基于DINOv2和SAM2.1的新型感知监督策略;通过更换编码器并微调网络实现语义编辑,解耦几何(分割图)与外观(文本/图像)控制。 Result: 在新视角合成任务上达到SOTA性能,对极端视角具有更强鲁棒性;支持高质量的语义3D编辑,用户可通过绘制分割图或自然语言提示灵活操控几何形状与外观风格。 Conclusion: PercHead实现了高质量的单图像3D头像重建与语义编辑的统一框架,其感知监督策略和解耦设计显著提升了重建精度与编辑自由度,为交互式3D内容创作提供了有效工具。 Abstract: We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE[86] When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Yiyang Zhou,Haoqin Tu,Zijun Wang,Zeyu Wang,Niklas Muennighoff,Fan Nie,Yejin Choi,James Zou,Chaorui Deng,Shen Yan,Haoqi Fan,Cihang Xie,Huaxiu Yao,Qinghao Ye
Main category: cs.CV
TL;DR: MIRA是一个新基准,用于评估模型在生成中间视觉图像以辅助推理任务中的表现,强调视觉思维对复杂问题解决的重要性。
Details
Motivation: 传统CoT方法仅依赖文本推理,难以处理涉及复杂结构和空间关系的任务,因此需要一种能利用中间图像进行推理的新评估方式。 Method: 构建包含546个多模态问题的数据集,标注中间视觉图像和答案,并提出统一的三层次评估协议:仅图像与问题、文本CoT输入、Visual-CoT输入(含视觉线索和文本提示)。 Result: 现有多模态大模型在纯文本提示下表现差,但提供中间视觉线索后性能平均提升33.7%;扩展搜索空间或优化文本提示效果有限。 Conclusion: 中间视觉信息对复杂推理任务至关重要,Visual-CoT显著优于纯文本推理方法。 Abstract: We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.[87] AI-Generated Image Detection: An Empirical Study and Future Research Directions
Nusrat Tasnim,Kutub Uddin,Khalid Mahmood Malik
Main category: cs.CV
TL;DR: 本文提出了一种统一的基准框架,用于在受控和可重现条件下系统评估AI生成媒体(尤其是深度伪造)的取证方法。通过评估十种最先进方法和七个公开数据集,揭示了现有方法在泛化能力和跨模型迁移性方面的显著差异,并强调了提高可解释性和标准化评估的重要性。
Details
Motivation: 现有的AI生成媒体取证方法因缺乏标准化基准、训练协议不一致和评估指标有限而难以进行公平比较,限制了其在安全关键应用中的部署。因此,需要一个统一的评估框架来解决这些问题。 Method: 构建了一个统一的基准框架,对十种最先进的取证方法(包括从头训练、冻结特征和微调)和七个公开的生成对抗网络(GAN)与扩散模型生成的数据集进行了系统评估。采用多种性能指标(如准确率、平均精度、ROC-AUC、错误率和类别敏感性)以及可解释性分析工具(如置信度曲线和Grad-CAM热图)进行综合评估。 Result: 实验结果显示不同方法在分布内表现良好,但在跨模型迁移时性能下降明显;不同训练策略和数据源之间存在显著性能差异;部分方法缺乏稳定性和可解释性。 Conclusion: 当前的多媒体取证方法在泛化性和可解释性方面仍存在局限,该研究为未来开发更鲁棒、可推广和可解释的取证技术提供了重要参考,并倡导社区采用标准化评估流程。 Abstract: The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.[88] PLUTO-4: Frontier Pathology Foundation Models
Harshith Padigela,Shima Nofallah,Atchuth Naveen Chilaparasetti,Ryun Han,Andrew Walker,Judy Shen,Chintan Shah,Blake Martin,Aashish Sood,Elliot Miller,Ben Glass,Andy Beck,Harsha Pokkalla,Syed Ashar Javed
Main category: cs.CV
TL;DR: PLUTO-4是新一代病理学基础模型,包含紧凑型PLUTO-4S和前沿规模的PLUTO-4G,基于大规模多机构数据集预训练,在多种病理任务中达到最先进性能。
Details
Motivation: 为了提升病理图像分析的泛化能力和实际应用效果,需要构建更大规模、更具通用性的基础模型。 Method: 采用两种Vision Transformer架构(PLUTO-4S和PLUTO-4G),分别针对高效部署和最大表征能力进行优化,并使用DINOv2自监督目标在包含55万张WSI的大规模多机构数据集上进行预训练。 Result: PLUTO-4在多个公共和内部基准测试中表现优异,PLUTO-4S适合高通量部署,PLUTO-4G在多项任务中创下新纪录,包括皮肤病理诊断准确率提升11%。 Conclusion: PLUTO-4具有强大的迁移能力与广泛的应用潜力,可作为转化研究和临床诊断的基础模型。 Abstract: Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4's potential to transform real-world applications as a backbone for translational research and diagnostic use cases.[89] Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
Dmitrii Pozdeev,Alexey Artemov,Ananta R. Bhattarai,Artem Sevastopolsky
Main category: cs.CV
TL;DR: 提出DenseMarks,一种用于人头图像的高密度对应学习表示方法,通过Vision Transformer预测每个像素的3D嵌入,并利用对比损失和多任务学习实现鲁棒的跨姿态和个体一致性表示。