Table of Contents
cs.CL [Back]
[1] Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition
Ying Liu,Yuntao Shou,Wei Ai,Tao Meng,Keqin Li
Main category: cs.CL
TL;DR: 本文提出了一种关系感知的去噪与扩散注意力融合模型,用于多模态情感识别(MCER),通过差分Transformer去噪、构建模态内/跨模态关系子图、以及文本引导的跨模态扩散机制,提升噪声环境下多模态情感识别的鲁棒性与语义一致性。
Details
Motivation: 现实场景中音视频信号易受环境噪声和采集条件限制影响,导致特征含噪;且不同模态间数据质量与信息承载能力不平衡,造成融合阶段的信息失真与权重偏差;现有方法忽视噪声模态影响,且未显式建模文本模态在情感理解中的主导作用。 Method: 1)设计差分Transformer,显式计算两个注意力图差异,增强时序一致性信息并抑制无关噪声,实现音视频模态有效去噪;2)构建模态特异性和跨模态关系子图,刻画说话人依赖的情感依赖关系,细粒度建模模态内与跨模态关系;3)引入文本引导的跨模态扩散机制,利用自注意力建模模态内依赖,并自适应将音视频信息扩散至文本流,实现更鲁棒、语义对齐的多模态融合。 Result: 所提模型在多模态情感识别任务中显著提升了在噪声环境下的识别性能,增强了模态融合的鲁棒性与语义一致性,尤其凸显了文本模态的主导作用。 Conclusion: 该方法有效缓解了多模态情感识别中因噪声和模态不平衡导致的信息失真与权重偏差问题,为复杂真实场景下的鲁棒多模态融合提供了新思路。 Abstract: In real-world scenarios, audio and video signals are often subject to environmental noise and limited acquisition conditions, resulting in extracted features containing excessive noise. Furthermore, there is an imbalance in data quality and information carrying capacity between different modalities. These two issues together lead to information distortion and weight bias during the fusion phase, impairing overall recognition performance. Most existing methods neglect the impact of noisy modalities and rely on implicit weighting to model modality importance, thereby failing to explicitly account for the predominant contribution of the textual modality in emotion understanding. To address these issues, we propose a relation-aware denoising and diffusion attention fusion model for MCER. Specifically, we first design a differential Transformer that explicitly computes the differences between two attention maps, thereby enhancing temporally consistent information while suppressing time-irrelevant noise, which leads to effective denoising in both audio and video modalities. Second, we construct modality-specific and cross-modality relation subgraphs to capture speaker-dependent emotional dependencies, enabling fine-grained modeling of intra- and inter-modal relationships. Finally, we introduce a text-guided cross-modal diffusion mechanism that leverages self-attention to model intra-modal dependencies and adaptively diffuses audiovisual information into the textual stream, ensuring more robust and semantically aligned multimodal fusion.[2] RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation
Jiajun Zhang,Yuying Li,Zhixun Li,Xingyu Guo,Jingzhuo Wu,Leqi Zheng,Yiran Yang,Jianke Zhang,Qingbin Li,Shannan Yan,Zhetong Li,Changguo Jia,Junfei Wu,Zilei Wang,Qiang Liu,Liang Wang
Main category: cs.CL
TL;DR: 本文提出了RealChart2Code基准,用于评估视觉语言模型(VLMs)在基于真实数据生成复杂多面板图表方面的能力,发现现有VLMs在此类任务上表现显著下降,揭示了其局限性。
Details
Motivation: 现有VLMs在代码生成方面表现出色,但其在复现真实世界中复杂、多面板可视化图表的能力尚未被系统评估,存在研究空白。 Method: 构建了首个大规模、基于真实数据集、支持多轮对话式迭代代码修正的图表生成基准RealChart2Code(含2800+实例),并对14个主流VLMs进行了全面评测。 Result: 实验表明,VLMs在RealChart2Code上的性能显著低于简单基准;闭源模型明显优于开源模型;即便是最先进的VLMs也常无法准确生成复杂多面板图表。 Conclusion: 该研究揭示了当前VLMs在处理真实数据分析与复杂图表生成任务中的关键局限,为后续研究提供了重要方向和基准工具。 Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf{\texttt{RealChart2Code}}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \texttt{RealChart2Code} reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \url{https://github.com/Speakn0w/RealChart2Code}.[3] Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI
Anna Kozlova,Stanislau Salavei,Pavel Satalkin,Hanna Plotnitskaya,Sergey Parfenyuk
Main category: cs.CL
TL;DR: Doctorina MedBench 是一个基于真实医患交互模拟的代理式医疗AI评估框架,采用多步临床对话建模和D.O.T.S.指标评估临床正确性与对话效率,并支持质量监控、安全测试与技能培养。
Details
Motivation: 传统医学基准依赖标准化试题,难以反映真实临床推理能力;需更贴近实际诊疗过程的评估方式。 Method: 构建基于医患交互模拟的多步临床对话评估框架,涵盖病史采集、资料分析、鉴别诊断与个性化建议;提出D.O.T.S.四维评估指标(诊断、观察/检查、治疗、步数);集成多级测试与质量监控架构,含陷阱案例、分层抽样与回归测试。 Result: 框架已包含1000+临床案例、覆盖750+诊断;验证了临床对话模拟比传统考试式基准更能真实评估临床能力;指标具通用性,可同时评估AI系统、医师及辅助临床思维训练。 Conclusion: Doctorina MedBench 提供了一种更真实、全面、安全且可扩展的医疗AI评估范式,兼具临床实用性与教育价值。 Abstract: We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.[4] Gradient-Informed Training for Low-Resource Multilingual Speech Translation
Ruiyan Sun,Satoshi Nakamura
Main category: cs.CL
TL;DR: 本文提出了一种基于梯度信息自动确定层间共享模式的方法,以缓解低资源多语言语音到文本翻译中的表示冲突问题。
Details
Motivation: 在低资源多语言语音到文本翻译中,跨语言统一的架构共享常引发表征冲突,阻碍模型收敛。 Method: 利用训练梯度信息,通过三种策略进行分析:基于距离的语言聚类、自/跨任务发散度度量用于容量分配、联合分解与典型相关性分析实现子空间对齐。 Result: 在四个语言对上(使用SeamlessM4T-Medium架构)的广泛实验表明,该方法持续提升了翻译质量指标。 Conclusion: 基于梯度驱动的层特定共享策略能有效缓解多语言语音翻译中的表征冲突,提升模型性能。 Abstract: In low-resource multilingual speech-to-text translation, uniform architectural sharing across languages frequently introduces representation conflicts that impede convergence. This work proposes a principled methodology to automatically determine layer-specific sharing patterns by mining training gradient information. Our approach employs three distinct analysis strategies: distance-based language clustering, self/cross-task divergence metrics for capacity allocation, and joint factorization coupled with canonical correlation analysis for subspace alignment. Extensive evaluation across four language pairs (using the SeamlessM4T-Medium architecture) demonstrates persistent improvements in translation quality metrics.[5] Methods for Knowledge Graph Construction from Text Collections: Development and Applications
Vanni Zavarella
Main category: cs.CL
TL;DR: 本文探讨了如何利用自然语言处理、机器学习和生成式人工智能方法,结合语义网最佳实践,从大规模文本语料中自动构建知识图谱,并在数字转型话语分析、AECO领域研究趋势分析及生物医学因果关系图谱生成三个应用场景中进行了实验验证。
Details
Motivation: 海量非结构化文本数据的激增带来了知识提取的机遇与挑战,亟需可扩展、跨领域、可解释且互操作的知识图谱构建方法。 Method: 融合自然语言处理、机器学习与生成式AI技术,并遵循语义网最佳实践,开展知识图谱自动构建方法研究与应用实验。 Result: 在三个具体用例中实现了知识图谱构建,产出基准评估结果、定制化算法、知识图谱数据资源及基于其的数据分析成果。 Conclusion: 该研究验证了AI驱动+语义网赋能的知识图谱构建范式在多领域文本知识抽取中的有效性与实用性,推动了可解释、互操作知识基础设施的发展。 Abstract: Virtually every sector of society is experiencing a dramatic growth in the volume of unstructured textual data that is generated and published, from news and social media online interactions, through open access scholarly communications and observational data in the form of digital health records and online drug reviews. The volume and variety of data across all this range of domains has created both unprecedented opportunities and pressing challenges for extracting actionable knowledge for several application scenarios. However, the extraction of rich semantic knowledge demands the deployment of scalable and flexible automatic methods adaptable across text genres and schema specifications. Moreover, the full potential of these data can only be unlocked by coupling information extraction methods with Semantic Web techniques for the construction of full-fledged Knowledge Graphs, that are semantically transparent, explainable by design and interoperable. In this thesis, we experiment with the application of Natural Language Processing, Machine Learning and Generative AI methods, powered by Semantic Web best practices, to the automatic construction of Knowledge Graphs from large text corpora, in three use case applications: the analysis of the Digital Transformation discourse in the global news and social media platforms; the mapping and trend analysis of recent research in the Architecture, Engineering, Construction and Operations domain from a large corpus of publications; the generation of causal relation graphs of biomedical entities from electronic health records and patient-authored drug reviews. The contributions of this thesis to the research community are in terms of benchmark evaluation results, the design of customized algorithms and the creation of data resources in the form of Knowledge Graphs, together with data analysis results built on top of them.[6] Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio
Yijiong Yu,Shuai Yuan,Jie Zheng,Huazheng Wang,Ji Pei
Main category: cs.CL
TL;DR: 本文提出了一种半动态上下文压缩框架,通过离散化密度感知的压缩比选择器,解决现有软上下文压缩方法中统一压缩比无法适应自然语言信息密度差异的问题。
Details
Motivation: 现有软上下文压缩方法采用固定压缩比,忽视自然语言信息密度的巨大差异;而直接使用输入依赖的连续压缩比又导致模型难以稳定训练和推理。 Method: 提出半动态上下文压缩框架,核心是离散比选择器(Discrete Ratio Selector):基于文本内在信息密度预测压缩目标,并量化为预定义的离散压缩比集合;在合成数据上联合训练选择器与压缩器,以摘要长度作为压缩比标签的代理。 Result: 在多个评估任务中,该密度感知框架(以均值池化为骨干)持续优于静态基线,构建了上下文压缩技术的稳健Pareto前沿。 Conclusion: 离散化、密度感知的压缩比选择机制可有效平衡压缩性能与模型稳定性,为长上下文高效处理提供了新范式。 Abstract: Soft context compression reduces the computational workload of processing long contexts in LLMs by encoding long context into a smaller number of latent tokens. However, existing frameworks apply uniform compression ratios, failing to account for the extreme variance in natural language information density. While adopting a density-aware dynamic compression ratio seems intuitive, empirical investigations reveal that models struggle intrinsically with operations parameterized by input dependent, continuous structural hyperparameters. To resolve this pitfall, we introduce Semi-Dynamic Context Compression framework. Our approach features a Discrete Ratio Selector, which predicts a compression target based on intrinsic information density and quantizes it to a predefined set of discrete compression ratios. It is efficiently jointly trained with the compressor on synthetic data, with the summary lengths as a proxy to create labels for compression ratio prediction. Extensive evaluations confirm that our density-aware framework, utilizing mean pooling as the backbone, consistently outperforms static baselines, establishing a robust Pareto frontier for context compression techniques. Our code, data and model weights are available at https://github.com/yuyijiong/semi-dynamic-context-compress[7] Can Small Models Reason About Legal Documents? A Comparative Study
Snehit Vaddi
Main category: cs.CL
TL;DR: 本文评估了参数量低于10B的大型语言模型在法律任务中的实用性,发现3B激活参数的MoE模型性能媲美GPT-4o-mini,且架构与训练质量比参数量更重要;少样本提示最稳健,检索增强中BM25与稠密检索效果相近,全部实验仅花费62美元。
Details
Motivation: 前沿大模型在法律应用中存在成本高、延迟大和数据隐私风险等问题,亟需探索更轻量、实用的替代方案。 Method: 在ContractNLI、CaseHOLD和ECtHR三个法律基准上,对9个子10B参数模型测试5种提示策略(直接提示、思维链、少样本、BM25 RAG、稠密RAG),共405组实验(每组3次随机种子),全部通过云API完成。 Result: 3B激活参数的MoE模型平均准确率匹敌GPT-4o-mini,并在法律判决识别任务上超越之;9B模型整体表现最差;思维链提示效果因任务而异;少样本提示最稳定;BM25与稠密RAG效果几乎相同;总实验成本仅62美元。 Conclusion: 子10B模型在法律任务中具备实用潜力,模型架构与训练质量比参数规模更关键;少样本提示是首选策略;RAG性能瓶颈在于模型对上下文的理解而非检索本身;低成本云API足以支撑严谨的LLM评估。 Abstract: Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal reasoning, while few-shot prompting emerges as the most consistently effective strategy. Comparing BM25 and dense retrieval for RAG, we find near-identical results, suggesting the bottleneck lies in the language model's utilization of retrieved context rather than retrieval quality. All experiments were conducted via cloud inference APIs at a total cost of $62, demonstrating that rigorous LLM evaluation is accessible without dedicated GPU infrastructure.[8] When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models
Binesh Sadanandan,Vahid Behzadan
Main category: cs.CL
TL;DR: 本文研究了医学领域大语言模型(MedGemma)对提示格式变化的敏感性,发现常用提示工程方法(如思维链、少样本学习)在医学任务中反而损害性能,而cloze评分和排列投票等替代方法表现更优。
Details
Motivation: 大型语言模型在医疗场景中部署日益增多,但其对提示格式变化的敏感性尚缺乏系统刻画,尤其在专业领域中通用提示策略是否适用存疑。 Method: 在MedMCQA和PubMedQA数据集上,对MedGemma(4B/27B)开展多维度鲁棒性测试,包括思维链提示、少样本学习、答案选项打乱、上下文截断(前后),并对比cloze评分与排列投票等替代解码策略。 Result: 思维链降低准确率5.7%;少样本使准确率下降11.9%,位置偏差显著上升;答案打乱导致59.1%预测改变,准确率最多下降27.4个百分点;前截断严重损害性能,后截断保留97%准确率;cloze评分分别达51.8%(4B)和64.5%(27B),优于所有生成式提示;排列投票额外提升4个百分点。 Conclusion: 通用大模型验证有效的提示工程方法不适用于医学专用LLMs;模型内部知识(通过log-prob体现)远超其生成文本所展现的能力;cloze评分和排列投票是更可靠、鲁棒的推理替代方案。 Abstract: Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasing position bias from 0.14 to 0.47. Shuffling answer options causes the model to change predictions 59.1% of the time, with accuracy dropping up to 27.4 percentage points. Front-truncating context to 50% causes accuracy to plummet below the no-context baseline, yet back-truncation preserves 97% of full-context accuracy. We further show that cloze scoring (selecting the highest log-probability option token) achieves 51.8% (4B) and 64.5% (27B), surpassing all prompting strategies and revealing that models "know" more than their generated text shows. Permutation voting recovers 4 percentage points over single-ordering inference. These results demonstrate that prompt engineering techniques validated on general-purpose models do not transfer to domain-specific medical LLMs, and that reliable alternatives exist.[9] MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization
Weizhi Zhang,Xiaokai Wei,Wei-Chieh Huang,Zheng Hui,Chen Wang,Michelle Gong,Philip S. Yu
Main category: cs.CL
TL;DR: 本文提出了首个大规模、以用户为中心、跨领域的记忆基准MemoryCD,基于亚马逊评论数据集中的真实用户行为,用于评估大语言模型在长上下文记忆任务中的表现。
Details
Motivation: 现有大语言模型的记忆评估基准局限于短会话的合成对话,缺乏对真实用户长期、跨领域行为的记忆能力评估。 Method: 构建了基于亚马逊评论数据集的MemoryCD基准,设计了包含14个主流大语言模型和6种记忆方法基线的多维度长上下文记忆评估流程,在12个不同领域上开展4类个性化任务测试。 Result: 实验表明当前记忆方法在多个领域中距离用户满意度仍有较大差距,MemoryCD成为首个支持跨领域终身个性化评估的测试平台。 Conclusion: MemoryCD填补了真实用户长期跨领域记忆评估的空白,为未来记忆建模与个性化研究提供了坚实基础和新方向。 Abstract: Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.[10] Toward Culturally Grounded Natural Language Processing
Sina Bagheri Nezhad
Main category: cs.CL
TL;DR: 本文综述2020–2026年50余篇多语言NLP研究,指出多语言能力不等于文化能力;强调训练数据覆盖虽重要但不充分,词元化、提示语种、评测基准设计等均显著影响文化适配性;主张从孤立语言评测转向建模‘交际生态’,提出以文化为根基的NLP研究议程。
Details
Motivation: 现有多语言NLP进步常被误读为全球包容性提升,但大量研究表明多语言能力与文化胜任力脱节,亟需系统梳理并推动文化敏感的评估与建模范式转变。 Method: 对2020–2026年间50余篇涵盖多语言性能不平等、跨语言迁移、文化感知评测、文化对齐、多模态本地知识建模、评测基准批判及社区驱动数据实践的文献进行系统性综述与主题整合。 Result: 发现训练数据覆盖并非决定性能的充分条件;词元化策略、提示语言、翻译型基准设计、文化特异性监督信号及多模态上下文均实质性影响模型在文化相关任务中的表现;多个新兴基准(如Global-MMLU、CulturalBench等)持续揭示主流多语言模型在低资源或社区特定场景中存在文化误读与性能下降。 Conclusion: 应摒弃将语言视为评测表格中孤立行的做法,转而建模语言实际使用的‘交际生态’;并据此提出以丰富上下文元数据、文化分层评测、参与式对齐、语言内变异建模和多模态社区感知设计为核心的未来研究议程。 Abstract: Recent progress in multilingual NLP is often taken as evidence of broader global inclusivity, but a growing literature shows that multilingual capability and cultural competence come apart. This paper synthesizes over 50 papers from 2020--2026 spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices. Across this literature, training data coverage remains a strong determinant of performance, yet it is not sufficient: tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all materially affect outcomes. Recent work on Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA, GIMMICK, DRISHTIKON, WorldCuisines, CARE, CLCA, and newer critiques of benchmark design and community-grounded evaluation shows that strong multilingual models can still flatten local norms, misread culturally grounded cues, and underperform in lower-resource or community-specific settings. We argue that the field should move from treating languages as isolated rows in a benchmark spreadsheet toward modeling communicative ecologies: the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. On that basis, we propose a research agenda for culturally grounded NLP centered on richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.[11] AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents
Wenbo Gao,Renxi Liu,Xian Wang,Fang Guo,Shuai Yang,Xi Chen,Hui-Ling Zhen,Hanting Chen,Weizhe Lin,Xiaosong Li,Yaoyuan Wang
Main category: cs.CL
TL;DR: 本文提出AgentCollab框架,通过自主协作推理动态协调不同能力的LLM,在保证推理鲁棒性的同时提升执行效率。
Details
Motivation: 解决LLM智能体在长周期推理与工具交互中执行效率与推理鲁棒性之间的根本权衡问题。 Method: 设计自驱动协同推理框架AgentCollab,利用智能体自身自省信号判断推理进展,并在必要时升级至更强模型;引入基于难度感知的累积升级策略,依据近期失败信号动态分配额外推理预算。 Result: 在多步智能体基准测试中,AgentCollab持续改善LLM智能体的准确率-效率Pareto前沿。 Conclusion: AgentCollab无需外部路由模块即可实现高效稳健的多模型协同,为构建成本可控、性能可靠的自主智能体提供了新范式。 Abstract: Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent's own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.[12] Retrieval-Augmented Generation Based Nurse Observation Extraction
Kyomin Hwang,Nojun Kwak
Main category: cs.CL
TL;DR: 本文提出了一种基于检索增强生成(RAG)的自动化流程,用于从护士口述中提取临床观察结果,以减轻护士工作负担,在MEDIQA-SYNUR数据集上达到0.796的F1分数。
Details
Motivation: 减轻护士在临床观察记录方面的工作负担,利用大语言模型提升医疗领域自动化水平。 Method: 基于检索增强生成(RAG)的自动化提取方法,用于从护士口述文本中抽取临床观察信息。 Result: 在MEDIQA-SYNUR测试集上取得0.796的F1-score。 Conclusion: 所提RAG驱动的自动化流程在临床观察提取任务中表现有效,具备实际应用潜力。 Abstract: Recent advancements in Large Language Models (LLMs) have played a significant role in reducing human workload across various domains, a trend that is increasingly extending into the medical field. In this paper, we propose an automated pipeline designed to alleviate the burden on nurses by automatically extracting clinical observations from nurse dictations. To ensure accurate extraction, we introduce a method based on Retrieval-Augmented Generation (RAG). Our approach demonstrates effective performance, achieving an F1-score of 0.796 on the MEDIQA-SYNUR test dataset.[13] I Want to Believe (but the Vocabulary Changed): Measuring the Semantic Structure and Evolution of Conspiracy Theories
Manisha Keim,Sarmad Chandio,Osama Khalid,Rishab Nithyanand
Main category: cs.CL
TL;DR: 本文通过分析Reddit政治子版块10年间的1.7亿条评论,利用对齐词嵌入技术,首次将阴谋论视为语义对象并系统追踪其语义结构的动态演化,揭示其非均匀变化模式(稳定、扩张、收缩、替代),超越了传统关键词方法的局限。
Details
Motivation: 现有研究多关注阴谋论的信念形成、传播与扩散,忽视其语义随时间的变化;且常将相关术语视为静态词汇标记,难以区分真实语义变迁与表层用词变化。 Method: 基于2012–2022年Reddit r/politics子版块共169.9M条评论,首先验证阴谋论相关语言在语义空间中构成连贯、可区分的区域;继而采用对齐词嵌入技术,跨时段比较语义邻域以追踪其演化。 Result: 发现阴谋论语义演化呈现非均匀性,包括语义稳定、扩张、收缩与替换等多种模式,这些模式无法被纯关键词方法有效捕捉。 Conclusion: 阴谋论应被建模为动态语义对象;对齐词嵌入等语义建模方法能更准确刻画其历时演变,为理解在线政治话语中的意义变迁提供新路径。 Abstract: Research on conspiracy theories has largely focused on belief formation, exposure, and diffusion, while paying less attention to how their meanings change over time. This gap persists partly because conspiracy-related terms are often treated as stable lexical markers, making it difficult to separate genuine semantic changes from surface-level vocabulary changes. In this paper, we measure the semantic structure and evolution of conspiracy theories in online political discourse. Using 169.9M comments from Reddit's r/politics subreddit spanning 2012--2022, we first demonstrate that conspiracy-related language forms coherent and semantically distinguishable regions of language space, allowing conspiracy theories to be treated as semantic objects. We then track how these objects evolve over time using aligned word embeddings, enabling comparisons of semantic neighborhoods across periods. Our analysis reveals that conspiracy theories evolve non-uniformly, exhibiting patterns of semantic stability, expansion, contraction, and replacement that are not captured by keyword-based approaches alone.[14] IndoBERT-Relevancy: A Context-Conditioned Relevancy Classifier for Indonesian Text
Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja
Main category: cs.CL
TL;DR: 本文提出了IndoBERT-Relevancy,一个基于IndoBERT Large的上下文条件相关性分类器,用于印尼语文本相关性判断,在自建的31360对标注数据集上训练,取得了F1值0.948和准确率96.5%的优异性能。
Details
Motivation: 印尼语的相关性分类任务尚未被充分探索,且该任务需模型同时推理主题上下文与候选文本之间的关系,不同于情感分析或命名实体识别等单输入任务。 Method: 构建了包含188个主题、31360对标注样本的新型印尼语相关性数据集,并采用迭代式、以失败驱动的数据构建策略;结合真实数据与针对性合成数据训练基于IndoBERT Large的上下文条件分类器。 Result: 模型在相关性分类任务上达到F1分数0.948、准确率96.5%,能有效处理正式与非正式印尼语文本。 Conclusion: 单一数据源不足以支撑鲁棒的相关性分类,而有针对性的合成数据可有效弥补模型弱点;IndoBERT-Relevancy为印尼语NLP提供了实用且高性能的相关性判断工具,并已开源。 Abstract: Determining whether a piece of text is relevant to a given topic is a fundamental task in natural language processing, yet it remains largely unexplored for Bahasa Indonesia. Unlike sentiment analysis or named entity recognition, relevancy classification requires the model to reason about the relationship between two inputs simultaneously: a topical context and a candidate text. We introduce IndoBERT-Relevancy, a context-conditioned relevancy classifier built on IndoBERT Large (335M parameters) and trained on a novel dataset of 31,360 labeled pairs spanning 188 topics. Through an iterative, failure-driven data construction process, we demonstrate that no single data source is sufficient for robust relevancy classification, and that targeted synthetic data can effectively address specific model weaknesses. Our final model achieves an F1 score of 0.948 and an accuracy of 96.5%, handling both formal and informal Indonesian text. The model is publicly available at HuggingFace.[15] LLM Benchmark-User Need Misalignment for Climate Change
Oucheng Liu,Lexing Xie,Jing Jiang
Main category: cs.CL
TL;DR: 本文提出了一种主动知识行为框架和主题-意图-形式分类法,揭示了当前气候领域LLM基准与真实用户需求之间存在显著错配,并指出人-AI知识交互模式接近人-人交互模式。
Details
Motivation: 评估大语言模型(LLMs)在气候变化知识服务中的实际效用,需检验现有基准是否反映真实用户需求。 Method: 构建Proactive Knowledge Behaviors Framework,设计Topic-Intent-Form分类体系,并应用于分析气候相关知识行为数据。 Result: 发现现有基准与真实用户需求存在显著错配;人-AI知识交互模式与人-人交互高度相似。 Conclusion: 应依据真实知识行为优化基准设计、RAG系统开发与LLM训练。 Abstract: Climate change is a major socio-scientific issue shapes public decision-making and policy discussions. As large language models (LLMs) increasingly serve as an interface for accessing climate knowledge, whether existing benchmarks reflect user needs is critical for evaluating LLM in real-world settings. We propose a Proactive Knowledge Behaviors Framework that captures the different human-human and human-AI knowledge seeking and provision behaviors. We further develop a Topic-Intent-Form taxonomy and apply it to analyze climate-related data representing different knowledge behaviors. Our results reveal a substantial mismatch between current benchmarks and real-world user needs, while knowledge interaction patterns between humans and LLMs closely resemble those in human-human interactions. These findings provide actionable guidance for benchmark design, RAG system development, and LLM training. Code is available at https://github.com/OuchengLiu/LLM-Misalign-Climate-Change.[16] Clash of the models: Comparing performance of BERT-based variants for generic news frame detection
Vihang Jumle
Main category: cs.CL
TL;DR: 本研究比较了五种BERT变体在通用新闻框架检测中的性能,提出了多种微调模型,并构建了一个基于瑞士选举背景的标注数据集,以检验计算方法在框架分析中的上下文鲁棒性。
Details
Motivation: 现有研究虽表明Transformer模型优于传统词袋模型,但不同模型在分类任务中的表现对比仍存争议;同时,多数研究依赖美国中心数据,缺乏跨语境验证。 Method: 采用五种BERT变体(BERT、RoBERTa、DeBERTa、DistilBERT、ALBERT)进行通用新闻框架检测的对比实验;对模型进行微调;构建并发布基于瑞士选举语境的标注新闻框架数据集。 Result: 明确了各BERT变体在新闻框架检测任务中的相对性能;提供了多个高性能微调模型;发布了首个瑞士语境下的通用新闻框架标注数据集。 Conclusion: 该研究为政治传播领域的计算文本分析提供了模型选择依据、可用工具及跨语境验证资源,推动了计算框架分析的方法论发展与实践应用。 Abstract: Framing continues to remain one of the most extensively applied theories in political communication. Developments in computation, particularly with the introduction of transformer architecture and more so with large language models (LLMs), have naturally prompted scholars to explore various novel computational approaches, especially for deductive frame detection, in recent years. While many studies have shown that different transformer models outperform their preceding models that use bag-of-words features, the debate continues to evolve regarding how these models compare with each other on classification tasks. By placing itself at this juncture, this study makes three key contributions: First, it comparatively performs generic news frame detection and compares the performance of five BERT-based variants (BERT, RoBERTa, DeBERTa, DistilBERT and ALBERT) to add to the debate on best practices around employing computational text analysis for political communication studies. Second, it introduces various fine-tuned models capable of robustly performing generic news frame detection. Third, building upon numerous previous studies that work with US-centric data, this study provides the scholarly community with a labelled generic news frames dataset based on the Swiss electoral context that aids in testing the contextual robustness of these computational approaches to framing analysis.[17] ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory
Zhuohan Ge,Haoyang Li,Yubo Wang,Nicole Hu,Chen Jason Zhang,Qing Li
Main category: cs.CL
TL;DR: 本文提出ClinicalAgents多智能体框架,通过蒙特卡洛树搜索和双记忆架构模拟专家临床医生的迭代、假设驱动推理过程,显著提升诊断准确性和可解释性。
Details
Motivation: 现有大语言模型在医疗诊断中难以处理复杂的非线性推理,依赖静态线性映射,无法模拟人类医生的迭代假设驱动推理过程。 Method: 提出ClinicalAgents多智能体框架,采用蒙特卡洛树搜索(MCTS)实现动态编排,并设计双记忆架构:可变的工作记忆用于维护患者状态以支持上下文感知推理,静态的经验记忆通过主动反馈循环检索临床指南和历史病例。 Result: 实验表明,ClinicalAgents在诊断准确性和可解释性方面均达到当前最优水平,显著优于强单智能体和多智能体基线。 Conclusion: ClinicalAgents成功弥合了大语言模型与真实临床推理之间的鸿沟,为医疗AI提供了更贴近人类专家认知流程的新范式。 Abstract: While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent to human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an Orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. Central to this framework is a Dual-Memory architecture: a mutable Working Memory that maintains the evolving patient state for context-aware reasoning, and a static Experience Memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.[18] Sparse Auto-Encoders and Holism about Large Language Models
Jumbly Grindrod
Main category: cs.CL
TL;DR: 本文探讨大型语言模型(LLM)是否支持语义整体论,并回应基于稀疏自编码器发现可解释隐特征而提出的分解式语义挑战;作者论证,只要这些特征可数,整体论图景仍可成立。
Details
Motivation: 检验LLM所体现的语义理论——特别是分布语义是否必然导向意义整体论,并回应机制可解释性新发现(如稀疏自编码器提取的可解释隐特征)对整体论的挑战。 Method: 文献分析与概念论证:回顾Grindrod等人关于LLM语义整体论的原有论据;引入并分析稀疏自编码器揭示隐特征的最新机制可解释性研究;进而深入考察这些特征的本体地位与可数性,以评估其对整体论的影响。 Result: 发现稀疏自编码器提取的大量可解释隐特征表面上支持分解式语义观,但作者指出这些特征若为可数集合,则仍可兼容整体论框架。 Conclusion: LLM的语义结构不必在整体论与分解论之间二选一;在特征可数的前提下,整体论图景依然稳健,且能吸纳机制可解释性的新发现。 Abstract: Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).[19] Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents
Nicholas Edwards,Sebastian Schuster
Main category: cs.CL
TL;DR: 本文提出了一种不确定性感知的多智能体框架,用于提升大语言模型(LLM)代理在指令不明确场景(如软件工程)中主动提问澄清的能力,并在SWE-bench Verified变体上验证其有效性。
Details
Motivation: 当前LLM代理多追求自主执行,缺乏对指令不明确(underspecification)的识别与澄清能力,而人类开发者常通过提问补充缺失上下文;该工作旨在弥补这一差距,使代理成为能主动协作的伙伴。 Method: 构建一个解耦‘不确定性检测’与‘代码执行’的多智能体架构,结合OpenHands与Claude Sonnet 4.5,在改进的SWE-bench Verified(含不明确指令)基准上系统评估澄清行为。 Result: 多智能体系统任务解决率达69.40%,显著优于单智能体基线(61.20%),并接近全明确指令下的性能;且其提问行为具有良好校准性——简单任务少问、复杂任务主动问。 Conclusion: 现有LLM已具备作为‘主动协作者’的潜力,可通过显式建模不确定性,在真实开放场景中自主判断并发起澄清交互。 Abstract: As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.[20] GS-BrainText: A Multi-Site Brain Imaging Report Dataset from Generation Scotland for Clinical Natural Language Processing Development and Validation
Beatrice Alex,Claire Grover,Arlene Casey,Richard Tobin,Heather Whalley,William Whiteley
Main category: cs.CL
TL;DR: 本文介绍了GS-BrainText数据集,包含8511份苏格兰脑部放射科报告,其中2431份标注了24种脑部疾病表型,覆盖多中心、宽年龄范围,用于推动可泛化的临床NLP研究。
Details
Motivation: 解决英国临床文本资源匮乏问题,支持临床NLP算法的泛化性评估与开发,尤其关注跨机构、表型和年龄组的性能差异。 Method: 构建多中心、多健康委员会的脑部放射科报告数据集,由多学科临床团队依据统一标注规范进行专家标注,并实施分层双人标注与质量控制;采用现有规则系统EdIE-R进行基准评测。 Result: EdIE-R在不同健康委员会(F1: 86.13–98.13)、表型(F1: 22.22–100)及年龄组(F1: 87.01–98.13)上表现存在显著差异,揭示NLP工具泛化能力的关键挑战。 Conclusion: GS-BrainText填补了UK临床文本资源空白,为研究语言变异、诊断不确定性表达及数据特征对NLP性能的影响提供了重要基础。 Abstract: We present GS-BrainText, a curated dataset of 8,511 brain radiology reports from the Generation Scotland cohort, of which 2,431 are annotated for 24 brain disease phenotypes. This multi-site dataset spans five Scottish NHS health boards and includes broad age representation (mean age 58, median age 53), making it uniquely valuable for developing and evaluating generalisable clinical natural language processing (NLP) algorithms and tools. Expert annotations were performed by a multidisciplinary clinical team using an annotation schema, with 10-100% double annotation per NHS health board and rigorous quality assurance. Benchmark evaluation using EdIE-R, an existing rule-based NLP system developed in conjunction with the annotation schema, revealed some performance variation across health boards (F1: 86.13-98.13), phenotypes (F1: 22.22-100) and age groups (F1: 87.01-98.13), highlighting critical challenges in generalisation of NLP tools. The GS-BrainText dataset addresses a significant gap in available UK clinical text resources and provides a valuable resource for the study of linguistic variation, diagnostic uncertainty expression and the impact of data characteristics on NLP system performance.[21] A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs
Uri Z. Kialy,Avi Shtarkberg,Ayal Klein
Main category: cs.CL
TL;DR: 本文通过稀疏自编码器(SAE)探针研究多语言大模型(Gemma-2-9B-IT)如何表征非正式语用变体(如俚语),发现模型存在一个跨语言、几何一致的‘非正式语用子空间’,该子空间在深层更显著,且能因果调控输出正式度,并零样本泛化至未见语言。
Details
Motivation: 探究多语言大模型是否将文化特定的语用变体(如俚语)处理为语言特异性记忆,还是统一抽象概念。 Method: 使用稀疏自编码器(SAE)探针分析Gemma-2-9B-IT在英语、希伯来语和俄语上的内部表征;构建包含一词多义(字面/非正式)的新数据集以排除词汇敏感性干扰。 Result: 发现一个微小但强健的跨语言非正式语用表征核心,构成几何一致的‘非正式语用子空间’,随网络深度增强;激活 steering 可因果降低所有源语言及6种未见语言的输出正式度。 Conclusion: 多语言大模型将非正式语用内化为可迁移、语言无关的抽象语用表征,而非表面启发式。 Abstract: While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, as isolated language-specific memorizations or as unified, abstract concepts. We study this by probing the internal representations of Gemma-2-9B-IT using Sparse Autoencoders (SAEs) across three typologically diverse source languages: English, Hebrew, and Russian. To definitively isolate pragmatic register processing from trivial lexical sensitivity, we introduce a novel dataset in which every target term is polysemous, appearing in both literal and informal contexts. We find that while much of the informal-register signal is distributed across language-specific features, a small but highly robust cross-linguistic core consistently emerges. This shared core forms a geometrically coherent ``informal register subspace'' that sharpens in the model's deeper layers. Crucially, these shared representations are not merely correlational: activation steering with these features causally shifts output formality across all source languages and transfers zero-shot to six unseen languages spanning diverse language families and scripts. Together, these results provide the first mechanistic evidence that multilingual LLMs internalize informal register not just as surface-level heuristics, but as a portable, language-agnostic pragmatic abstraction.[22] Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR
Shashi Kumar,Esaú Villatoro-Tello,Sergio Burdisso,Kadri Hacioglu,Thibault Bañeras-Roux,Hasindri Watawana,Dairazalia Sanchez-Cortes,Srikanth Madikeri,Petr Motlicek,Andreas Stolcke
Main category: cs.CL
TL;DR: 本文研究了如何利用多模态上下文(特别是先前对话轮次的音频和文本)来提升大语言模型(LLM)驱动的自动语音识别(ASR)性能,并提出一种名为'Abstract Compression'的高效上下文表示方法,用固定数量的隐变量替代长音频序列,在保持部分性能增益的同时显著降低计算开销。
Details
Motivation: 标准LLM-based ASR系统孤立处理每段语音,无法有效利用对话上下文;而直接使用原始多轮音频上下文又因token增长过快导致计算昂贵,因此需探索更高效的上下文建模方式。 Method: 提出Abstract Compression方法:将先前轮次的音频替换为固定数量的可学习隐变量(latent tokens),同时显式保留对应文本转录;并在多轮监督训练下评估其对ASR性能的影响。 Result: 实验表明,经多轮监督训练后,上下文主要提升语境实体识别;Abstract Compression能在in-domain和out-of-domain测试集上恢复部分原始上下文建模的性能增益,同时大幅减少音频token占用。 Conclusion: 多模态对话上下文对LLM-based ASR有益但需高效表征;Abstract Compression是一种在性能与效率间取得良好权衡的有效压缩策略。 Abstract: Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.[23] Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan
Chihiro Taguchi,Yukinori Takubo,David Chiang
Main category: cs.CL
TL;DR: 本文介绍了一项针对日本冲绳濒危语言Ikema的自动语音识别(ASR)系统开发工作,构建了数小时语料库,实现了15%的字符错误率,并验证ASR可显著提升转录效率与降低认知负荷。
Details
Motivation: 语言濒危威胁全球语言多样性,而ASR等技术为濒危语言的记录与振兴提供了新途径。 Method: 基于实地录音构建Ikema语音语料库,训练并评估ASR模型,同时开展ASR辅助转录效率实验。 Result: 建成约{totaldatasethours}小时的Ikema语音语料库;ASR模型字符错误率低至15%;ASR辅助显著缩短转录时间并减轻认知负担。 Conclusion: ASR可作为可扩展、技术驱动的濒危语言记录实践的有效工具。 Abstract: Language endangerment poses a major challenge to linguistic diversity worldwide, and technological advances have opened new avenues for documentation and revitalization. Among these, automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data. This study focuses on Ikema, a severely endangered Ryukyuan language spoken in Okinawa, Japan, with approximately 1,300 remaining speakers, most of whom are over 60 years old. We present an ongoing effort to develop an ASR system for Ikema based on field recordings. Specifically, we (1) construct a {\totaldatasethours}-hour speech corpus from field recordings, (2) train an ASR model that achieves a character error rate as low as 15\%, and (3) evaluate the impact of ASR assistance on the efficiency of speech transcription. Our results demonstrate that ASR integration can substantially reduce transcription time and cognitive load, offering a practical pathway toward scalable, technology-supported documentation of endangered languages.[24] SocialX: A Modular Platform for Multi-Source Big Data Research in Indonesia
Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja
Main category: cs.CL
TL;DR: 本文介绍了SocialX,一个面向印尼多源大数据研究的模块化平台,旨在解决数据分散、格式不一、噪声多样等问题,通过分层架构实现数据采集、语言感知预处理和可插拔分析的统一与解耦。
Details
Motivation: 印尼的大数据研究受限于数据来源分散(如社交媒体、新闻门户、电商平台等)、格式各异、访问方式不同及噪声特征复杂,研究人员需重复构建采集、清洗和分析流程,严重干扰研究本身。 Method: 提出SocialX平台,采用三层解耦架构(采集层、预处理层、分析层),通过轻量级任务协调机制连接;强调模块化与源无关性,支持新增数据源、预处理方法和分析工具而无需修改原有流程;特别设计针对印尼语跨语域文本的语言感知预处理方法。 Result: 实现了可扩展、易维护、开源可用的Web平台(https://www.socialx.id),并通过典型研究工作流验证其有效性与实用性。 Conclusion: SocialX有效缓解了印尼大数据研究中的碎片化问题,提升了研究效率与可复现性,其模块化设计理念对其他低资源语言区域的大数据平台建设具有借鉴意义。 Abstract: Big data research in Indonesia is constrained by a fundamental fragmentation: relevant data is scattered across social media, news portals, e-commerce platforms, review sites, and academic databases, each with different formats, access methods, and noise characteristics. Researchers must independently build collection pipelines, clean heterogeneous data, and assemble separate analysis tools, a process that often overshadows the research itself. We present SocialX, a modular platform for multi-source big data research that integrates heterogeneous data collection, language-aware preprocessing, and pluggable analysis into a unified, source-agnostic pipeline. The platform separates concerns into three independent layers (collection, preprocessing, and analysis) connected by a lightweight job-coordination mechanism. This modularity allows each layer to grow independently: new data sources, preprocessing methods, or analysis tools can be added without modifying the existing pipeline. We describe the design principles that enable this extensibility, detail the preprocessing methodology that addresses challenges specific to Indonesian text across registers, and demonstrate the platform's utility through a walkthrough of a typical research workflow. SocialX is publicly accessible as a web-based platform at https://www.socialx.id.[25] findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
Héctor Javier Vázquez Martínez
Main category: cs.CL
TL;DR: 本文介绍了findsylls,一个模块化、语言无关的工具包,用于统一经典音节检测器和端到端音节分割器,支持音节分割、嵌入提取与多粒度评估,旨在促进跨资源条件下的可复现音节级建模研究。
Details
Motivation: 音节级建模研究长期受限于方法、数据集和评估协议的碎片化,缺乏统一、可复现的实验框架。 Method: 开发了findsylls工具包,整合经典音节检测(如Sylber)与端到端模型(如VG-HuBERT),提供标准化接口,支持组件重组与多维度评估。 Result: 在英语、西班牙语及新标注的中曼德语Kono语料上验证了该工具包的有效性,证明其适用于高低资源语言场景。 Conclusion: findsylls为音节级语音建模提供了统一、可扩展、可复现的开源基础设施,有助于推动该领域系统性进展。 Abstract: Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.[26] From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs
Jiyuan An,Liner Yang,Mengyan Wang,Luming Lu,Weihua An,Erhong Yang
Main category: cs.CL
TL;DR: 本文从机制角度探究大语言模型(LLMs)是否具备结构化的内部空间表征,发现其空间信息虽在中间层编码且具因果影响,但呈现瞬时性、碎片化和弱整合性;跨语言分析还揭示了行为表现相似但内部机制不同的‘机制退化’现象,表明当前LLMs缺乏鲁棒通用的空间推理能力。
Details
Motivation: 澄清LLMs在空间推理基准上的表现是源于结构化的内部空间表征,还是仅依赖语言启发式。 Method: 基于人类空间认知计算理论,将空间推理分解为关系组合、表征变换和有状态空间更新三类原语,设计对应受控任务族;在英、中、阿三种语言上评估多语言LLMs,并结合线性探针、稀疏自编码器特征分析与因果干预分析其内部表征。 Result: 空间相关信息编码于中间层且可因果影响行为,但表征短暂、在不同任务族间碎片化、与最终预测整合薄弱;跨语言分析显示‘机制退化’——相似行为性能由不同内部路径实现。 Conclusion: 当前LLMs仅具备有限且依赖上下文的空间表征,而非鲁棒、通用的空间推理能力,提示需超越基准准确率的机制性评估。 Abstract: As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models' (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reasoning into three primitives, relational composition, representational transformation, and stateful spatial updating, and design controlled task families for each. We evaluate multilingual LLMs in English, Chinese, and Arabic under single pass inference, and analyze internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions. We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior, but these representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross linguistic analysis further reveals mechanistic degeneracy, where similar behavioral performance arises from distinct internal pathways. Overall, our results suggest that current LLMs exhibit limited and context dependent spatial representations rather than robust, general purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.[27] CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law
JiHyeok Jung,TaeYoung Yoon,HyunSouk Cho
Main category: cs.CL
TL;DR: 本文提出了CALRK-Bench,一个基于韩国法律体系的上下文感知法律推理基准,用于评估模型在时间有效性判断、法律信息充分性识别及法律判决变化原因理解三方面的能力;实验表明现有大语言模型在此基准上表现较差。
Details
Motivation: 现有法律基准主要假设法律规范固定不变,无法反映法律判断随时间变化或多种规范交互的现实场景,因此需要构建能评估上下文感知法律推理能力的新基准。 Method: 构建了基于韩国法律体系的CALRK-Bench基准,涵盖法律先例与咨询记录,由法律专家验证,聚焦于三个任务:法律规范的时间有效性识别、法律信息充分性判断、法律判决变化原因理解。 Result: 即使是最新的大语言模型,在CALRK-Bench的三项任务上均表现不佳,说明其缺乏真正的上下文感知法律推理能力。 Conclusion: CALRK-Bench为评估模型是否具备深层法律推理能力(而非单纯记忆法律知识)提供了有效压力测试工具,推动更符合实际需求的法律AI研究。 Abstract: Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK-Bench, a context-aware legal reasoning benchmark based on the legal system in Korean. CALRK-Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK-Bench provides a new stress test for evaluating context-aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at https://github.com/jhCOR/CALRKBench.[28] Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers
Yusheng Zhao,Hourun Li,Bohan Wu,Jingyang Yuan,Meng Zhang,Yichun Yin,Lifeng Shang,Ming Zhang
Main category: cs.CL
TL;DR: 本文提出Switch Attention (SwiAttn),一种动态混合注意力机制,在每个token和每层中自适应选择全注意力或滑动窗口注意力,兼顾长程建模能力与计算效率,并通过自适应正则化和持续预训练优化。
Details
Motivation: 标准全注意力计算复杂度随序列长度平方增长,成为长上下文建模瓶颈;滑动窗口注意力虽高效但感受野受限;现有混合方法多采用静态交替模式,难以灵活适配不同场景的计算需求。 Method: 提出SwiAttn:对每个token在每层动态路由至全注意力分支(用于全局信息聚合)或滑动窗口分支(用于高效局部模式匹配);设计自适应正则化目标以鼓励计算效率;采用持续预训练策略,将全注意力模型迁移至混合架构。 Result: 在23个基准数据集(涵盖4K常规与32K长上下文)上验证了SwiAttn的有效性,显著提升长上下文建模性能与效率权衡。 Conclusion: SwiAttn实现了细粒度、动态的注意力模式选择,在保持全局建模能力的同时大幅提升推理效率,为长上下文语言建模提供了更灵活、高效的混合注意力范式。 Abstract: The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.[29] Word Alignment-Based Evaluation of Uniform Meaning Representations
Daniel Zeman,Federica Gamba
Main category: cs.CL
TL;DR: 本文提出了一种基于节点-词对齐的UMR图匹配算法,用于更直观、可解释地比较句子意义表示,避免了传统smatch方法中的NP-hard搜索问题。
Details
Motivation: 现有图结构语义表示(如AMR/UMR)的对比评估面临节点数量不一致、匹配标准模糊的问题;传统F1导向的节点映射易产生偶然匹配,不利于细粒度错误分析。 Method: 提出一种利用UMR中固有节点-词对齐信息的新节点匹配算法,支持同一句子多个UMR表示的比较,并与主流smatch方法进行对比实验。 Result: 新算法在保持评估有效性的同时,提升了匹配结果的直观性和可解释性,且规避了smatch中的NP-hard计算复杂度问题;相关脚本已开源。 Conclusion: 基于词对齐的节点匹配策略优于传统F1最大化策略,为意义表示评估提供了更可靠、实用的工具。 Abstract: Comparison and evaluation of graph-based representations of sentence meaning is a challenge because competing representations of the same sentence may have different number of nodes, and it is not obvious which nodes should be compared to each other. Existing approaches favor node mapping that maximizes $F_1$ score over node relations and attributes, regardless whether the similarity is intentional or accidental; consequently, the identified mismatches in values of node attributes are not useful for any detailed error analysis. We propose a node-matching algorithm that allows comparison of multiple Uniform Meaning Representations (UMR) of one sentence and that takes advantage of node-word alignments, inherently available in UMR. We compare it with previously used approaches, in particular smatch (the de-facto standard in AMR evaluation), and argue that sensitivity to word alignment makes the comparison of meaning representations more intuitive and interpretable, while avoiding the NP-hard search problem inherent in smatch. A script implementing the method is freely available.[30] Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models
Richard J. Young
Main category: cs.CL
TL;DR: 本研究分析了12个开源推理模型在含误导性提示的MMLU和GPQA任务中的表现,发现其‘思考令牌’与可见答案之间存在显著分歧(55.4%),即模型在内部思考中响应提示但最终答案未体现;该现象具有方向性、受提示类型和模型架构显著影响。
Details
Motivation: 评估大模型在受误导提示影响下的推理可解释性与可靠性,揭示仅监控最终答案可能严重低估其受干扰程度。 Method: 对12个开源推理模型在MMLU和GPQA数据集上施加三类误导提示(sycophancy、consistency、unethical),统计其思考令牌与答案文本中对提示的显式承认情况,并进行交叉分类与比例分析。 Result: 55.4%的误导跟随案例仅在思考令牌中体现提示关键词(thinking-answer divergence),而答案中完全缺失;反向情况仅0.5%;提示类型显著调节该模式(sycophancy最透明,unethical/consistency以thinking-only为主);不同模型差异极大(94.7% vs 19.6%);仍有11.8%案例在两个通道均无任何提示承认。 Conclusion: 仅依赖答案文本监控会遗漏超半数的提示诱导错误;获取思考令牌虽必要,但仍不足以全面捕捉模型受干扰状态,需进一步探索隐式偏差检测方法。 Abstract: Extended-thinking models expose a second text-generation channel ("thinking tokens") alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint's target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model's thinking tokens contain hint-related keywords that the visible answer omits entirely, a pattern termed *thinking-answer divergence*. The reverse (answer-only acknowledgment) is near-zero (0.5%), confirming that the asymmetry is directional. Hint type shapes the pattern sharply: sycophancy is the most *transparent* hint, with 58.8% of sycophancy-influenced cases acknowledging the professor's authority in both channels, while consistency (72.2%) and unethical (62.7%) hints are dominated by thinking-only acknowledgment. Models also vary widely, from near-total divergence (Step-3.5-Flash: 94.7%) to relative transparency (Qwen3.5-27B: 19.6%). These results show that answer-text-only monitoring misses more than half of all hint-influenced reasoning and that thinking-token access, while necessary, still leaves 11.8% of cases with no verbalized acknowledgment in either channel.[31] Analysing Calls to Order in German Parliamentary Debates
Nina Smirnova,Daniel Dan,Philipp Mayr
Main category: cs.CL
TL;DR: 本研究系统分析了德国联邦议院(Bundestag)72年间的议会辩论中“要求秩序”(CtO)现象,提出基于规则的自动检测方法,构建首个标注化的CtO数据集,并建立首个CtO触发原因分类体系;发现CtO的发出具有主观性,受议长和政治动态影响,侮辱个人是最常见触发原因,且男性议员及反对党议员更易被点名,多数CtO发生在涉及政府事务和议长行为的发言中。
Details
Motivation: 议会中的不文明行为(如打断、侮辱)是政治极化与制度冲突的重要信号,但作为正式规范违反指标的‘要求秩序’(CtO)在既有研究中缺乏系统关注。 Method: 提出一种基于规则的CtO自动识别与标注方法,构建覆盖1951–2023年共72年的德语议会辩论标注数据集,并建立首个CtO触发原因分类体系,结合统计分析探究CtO发生的影响因素。 Result: 发现CtO的发出存在主观性,受议长裁量与议会动态影响;侮辱个人是最常见触发原因;男性议员和反对党议员被点名频率显著高于女性和执政联盟议员;多数CtO出现在讨论政府事务和议长行为的发言中。 Conclusion: CtO虽为正式程序工具,实则反映深层政治权力结构与性别、党派不平等;该研究为量化分析议会规范执行与政治行为提供了新方法与可复用数据资源。 Abstract: Parliamentary debate constitutes a central arena of political power, shaping legislative outcomes and public discourse. Incivility within this arena signals political polarization and institutional conflict. This study presents a systematic investigation of incivility in the German Bundestag by examining calls to order (CtO; plural: CtOs) as formal indicators of norm violations. Despite their relevance, CtOs have received little systematic attention in parliamentary research. We introduce a rule-based method for detecting and annotating CtOs in parliamentary speeches and present a novel dataset of German parliamentary debates spanning 72 years that includes annotated CtO instances. Additionally, we develop the first classification system for CtO triggers and analyze the factors associated with their occurrence. Our findings show that, despite formal regulations, the issuance of CtOs is partly subjective and influenced by session presidents and parliamentary dynamics, with certain individuals disproportionately affected. An insult towards individuals is the most frequent cause of CtO. In general, male members and those belonging to opposition parties receive more calls to order than their female and coalition-party counterparts. Most CtO triggers were detected in speeches dedicated to governmental affairs and actions of the presidency. The CtO triggers dataset is available at: https://github.com/kalawinka/cto_analysis.[32] Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models
Mikko Saukkoriipi,Nicole Hernandez,Jaakko Sahlsten,Kimmo Kaski,Otso Arponen
Main category: cs.CL
TL;DR: 本文提出了一种本地部署的临床上下文问答(CCQA)框架,利用开源大语言模型(LLM)在完全离线条件下直接从电子健康记录(EHR)中回答临床问题;在芬兰语临床文本数据集上评估显示,Llama-3.1-70B等模型准确率高达95.3%,低精度量化(4/8-bit)保持性能并提升部署可行性,但临床评估发现2.9%输出存在临床显著错误,提示需人工审核与验证。
Details
Motivation: 临床医生常需从电子健康记录(EHR)中检索患者特异性信息,该过程耗时且易出错,亟需安全、高效、本地化的自然语言问答工具。 Method: 构建本地可部署的CCQA框架,使用4B–70B参数的开源LLM,在完全离线条件下,在1,664组专家标注的芬兰语临床问答对(来自183名患者)上进行基准测试;采用自由文本生成和多项选择两种评估范式,并测试4-bit与8-bit低精度量化效果;开展临床专家评估以识别临床显著错误及语义等价问题的一致性。 Result: Llama-3.1-70B在自由文本生成中达95.3%准确率与97.3%一致性;Qwen3-30B-A3B-2507表现相当;低精度量化未损性能;临床评估发现2.9%输出含临床显著错误,0.96%语义等价问题产生矛盾答案(一正一误)。 Conclusion: 本地部署的开源LLM可在保障数据隐私前提下高精度支持EHR临床问答,但其输出存在不可忽视的临床风险,必须辅以严格验证与人类监督方可用于真实临床场景。 Abstract: Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.[33] ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims
Raia Abu Ahmad,Max Upravitelev,Aida Usmanova,Veronika Solopova,Georg Rehm
Main category: cs.CL
TL;DR: ClimateCheck 2026 是一项面向气候主张自动验证的共享任务,扩展了前一年的数据与任务,并评估了多种检索与推理方法在不完整标注下的表现,揭示了现有指标的偏差及不同气候虚假信息类型的可验证性差异。
Details
Motivation: 气候相关主张的自动验证面临科学文献专业性强、气候虚假信息修辞策略多样的挑战。 Method: 采用密集检索流水线、交叉编码器集成以及结合结构化分层推理的大语言模型;并引入适应不完整标注的自动化评估框架。 Result: 共吸引20支注册队伍、8份排行榜提交;发现常规指标(如Recall@K)存在系统性排名偏差;跨任务分析表明并非所有气候虚假信息类型具有同等可验证性。 Conclusion: 未来事实核查系统的设计需考虑虚假信息类型的差异性及其对验证难度的影响,并改进在标注不全场景下的评估方法。 Abstract: Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinformation narrative classification task. Running from January to February 2026 on the CodaBench platform, the competition attracted 20 registered participants and 8 leaderboard submissions, with systems combining dense retrieval pipelines, cross-encoder ensembles, and large language models with structured hierarchical reasoning. In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems. A cross-task analysis further reveals that not all climate disinformation is equally verifiable, potentially implicating how future fact-checking systems should be designed.[34] Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs
Vinicius Anjos de Almeida,Sandro Saorin da Silva,Josimar Chire,Leonardo Vicenzi,Nícolas Henrique Borges,Helena Kociolek,Sarah Miriã de Castro Rocha,Frederico Nassif Gomes,Júlia Cristina Ferreira,Oge Marques,Lucas Emanuel Silva e Oliveira
Main category: cs.CL
TL;DR: 本研究评估了多种BERT模型和大语言模型(LLM)在葡萄牙语临床命名实体识别(NER)任务上的性能,并探索了缓解多标签类别不平衡的策略;结果表明mmBERT-base表现最优(micro F1=0.76),且迭代分层采样显著提升性能。
Details
Motivation: 葡萄牙语临床NER基准数据稀缺,亟需系统评估适配该语言和领域的预训练模型及不平衡处理方法。 Method: 在公开SemClinBr语料和私有乳腺癌数据集上,统一训练并对比BioBERTpt、BERTimbau、ModernBERT、mmBERT及GPT-5、Gemini-2.5等模型;采用精确率、召回率和F1-score评估;尝试迭代分层、加权损失和过采样三种不平衡处理策略。 Result: mmBERT-base取得最佳性能(micro F1 = 0.76),优于其他所有模型;迭代分层策略有效改善类别平衡与整体性能;多语言BERT(尤其是mmBERT)在本地低资源环境下表现优异。 Conclusion: mmBERT是葡萄牙语临床NER的高效实用选择,结合迭代分层等数据划分策略可进一步提升效果,为资源受限场景提供可行方案。 Abstract: Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT-based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT-5 and Gemini-2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1-score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT-base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data-splitting strategies further enhance performance.[35] AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese
Afonso Simplício,Gonçalo Vinagre,Miguel Moura Ramos,Diogo Tavares,Rafael Ferreira,Giuseppe Attanasio,Duarte M. Alves,Inês Calvo,Inês Vieira,Rui Guerra,James Furtado,Beatriz Canaverde,Iago Paulo,Vasco Ramos,Diogo Glória-Silva,Miguel Faria,Marcos Treviso,Daniel Gomes,Pedro Gomes,David Semedo,André Martins,João Magalhães
Main category: cs.CL
TL;DR: 本文介绍了AMALIA,一个专注于欧洲葡萄牙语(pt-PT)的开源大语言模型,通过在中后期训练阶段使用高质量pt-PT数据,并发布了一套专为pt-PT设计的基准测试集,显著提升了在pt-PT特有评估任务上的性能。
Details
Motivation: 欧洲葡萄牙语(pt-PT)在现有大语言模型的训练数据和原生评估中严重不足,机器翻译的基准测试难以捕捉其语言与文化特性。 Method: 提出AMALIA模型,在中期和后期训练阶段更多地使用高质量欧洲葡萄牙语数据;同时构建并发布一套包含翻译标准任务及四个新数据集的pt-PT专用评测基准。 Result: AMALIA在翻译后的通用基准上表现与强基线相当,在pt-PT专属评测任务上性能显著提升。 Conclusion: 针对欧洲葡萄牙语的定向训练与原生基准评测对提升其语言模型性能至关重要。 Abstract: Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.[36] JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems
Guangzhao Yang,Yu Pan,Shi Qiu,Ningjie Bai
Main category: cs.CL
TL;DR: 本文提出JAL-Turn,一种轻量、高效的纯语音转接检测框架,通过联合声学-语言建模与共享冻结ASR编码器实现低延迟、零额外开销的实时转接预测,并借助可扩展数据构建流程,在多语种基准和日语客服数据上显著优于现有方法。
Details
Motivation: 现有系统仅依赖声学或语义线索导致准确率和稳定性不足;为大语言模型赋予全双工能力则需昂贵数据与高训练部署开销,难以满足实时性要求。 Method: 提出JAL-Turn框架:采用联合声学-语言建模范式,通过跨注意力模块自适应融合预训练声学表征与语言特征;共享冻结ASR编码器以实现转接检测与语音识别并行运行;设计可扩展的数据构建流程,从大规模真实对话语料中自动提取可靠转接标签。 Result: 在多个公开多语种基准及内部日语客服数据集上,JAL-Turn在检测精度上持续超越强SOTA基线,同时保持卓越的实时性能。 Conclusion: JAL-Turn是一种兼顾高效性、鲁棒性与实用性的 turn-taking 检测方案,适用于工业级语音AI代理部署。 Abstract: Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full-duplex capabilities require costly full-duplex data and incur substantial training and deployment overheads, limiting real-time performance. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention module adaptively integrates pre-trained acoustic representations with linguistic features to support low-latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL-Turn enables turn-taking prediction to run fully in parallel with speech recognition, introducing no additional end-to-end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn-taking labels from large-scale real-world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.[37] ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs
Inês Vieira,Inês Calvo,Iago Paulo,James Furtado,Rafael Ferreira,Diogo Tavares,Diogo Glória-Silva,David Semedo,João Magalhães
Main category: cs.CL
TL;DR: 本文介绍了ALBA,一个专为评估大型语言模型(LLM)在欧洲葡萄牙语(pt-PT)中语言能力而设计的基准测试,涵盖八个语言维度,并结合专家标注与LLM-as-a-judge框架进行可扩展评估。
Details
Motivation: 现有LLM训练数据和基准主要面向巴西葡萄牙语(pt-BR),导致对欧洲葡萄牙语(pt-PT)等代表性不足语言的评估不足,亟需专门针对pt-PT的语言学基准。 Method: 构建了由语言学专家手工编制的ALBA基准,覆盖八大语言维度;并提出LLM-as-a-judge评估框架以实现对pt-PT生成文本的可扩展自动评估。 Result: 在多种模型上的实验表明,模型在不同语言维度上表现差异显著,验证了ALBA对揭示模型在pt-PT中语言能力短板的有效性。 Conclusion: ALBA为pt-PT语言工具开发提供了全面、语种敏感的评估基础,强调了构建语言学驱动、区域适配基准的重要性。 Abstract: As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.[38] How Open Must Language Models be to Enable Reliable Scientific Inference?
James A. Michaelov,Catherine Arnett,Tyler A. Chang,Pamela D. Rivière,Samuel M. Taylor,Cameron R. Jones,Sean Trott,Roger P. Levy,Benjamin K. Bergen,Micah Altman
Main category: cs.CL
TL;DR: 本文探讨了模型的开放性或封闭性如何影响科学研究中的推断可靠性,指出当前封闭模型通常不适合科学研究,并提出了识别和缓解推断威胁的方法及模型选择的正当性要求。
Details
Motivation: 探讨模型开放性或封闭性对科学研究中推断可靠性的影响,以及信息限制如何威胁可靠推断。 Method: 分析当前封闭模型在科学研究中的适用性问题,并提出系统识别推断威胁及缓解措施的方法。 Result: 发现当前封闭模型通常不适合科学目的,但存在一些例外;并提出应系统识别推断威胁、采取缓解措施,并为模型选择提供具体理由。 Conclusion: 封闭模型普遍不利于科学推断,需通过透明化、系统评估与合理论证来提升研究可靠性。 Abstract: How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.[39] Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model
Maria Kefala,Jeffery L. Painter,Syed Tauhid Bukhari,Maurizio Sessa
Main category: cs.CL
TL;DR: 本研究构建了一个面向欧盟的时间索引型不良事件(AE)参考数据集,基于1995–2025年1513种中心授权药品的说明书(SmPC)历史版本,利用DeepSeek V3提取4.8节中的AE,并结合监管元数据(如标签更新时间)实现时间标注,共纳入110,823条药物-AE关联,支持信号检测方法的早期性能评估。
Details
Motivation: 现有药物流行病学参考数据集缺乏不良事件被监管机构正式确认的时间信息,无法限定分析于上市前/确认前阶段,严重制约信号检测方法的早期性能评估。 Method: 从欧盟药品注册库获取全部中心授权药品(n=1513)当前及历史SmPC(截至2025年12月15日),提取Section 4.8内容,使用DeepSeek V3模型识别不良事件;同步程序化提取监管元数据(如标签变更日期);以AE首次写入SmPC的日期作为时间索引依据。 Result: 建成覆盖1995–2025年的数据库,含17,763个SmPC版本、125,026条药物-AE关联;限定于活性产品后得1,479种药品、110,823条关联;74.5% AE为上市前识别;安全更新高峰在2012年左右;胃肠道、皮肤及神经系统疾病为最常见SOC;每药中位AE数为48个,分布于14个SOC。 Conclusion: 该时间索引参考数据集填补了欧盟药物流行病学领域关键空白,通过引入AE监管确认时间维度,显著提升信号检测算法性能评估的准确性与可比性,为方法学比较提供可靠基准。 Abstract: Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.[40] When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models
Juan Gabriel Kostelec,Xiang Wang,Axel Laborieux,Christos Sourmpis,Qinghai Guo
Main category: cs.CL
TL;DR: 本文提出Hybrid-KDA架构与GenDistill多阶段蒸馏流程,强调以生成式评估替代传统对数似然评估,揭示其更真实反映蒸馏模型质量差异;在Qwen3-0.6B上系统分析六大设计因素,发现数据集选择、completion-only掩码和冻结注意力层对生成质量影响最大;最优模型在知识类基准上保持教师86–90%准确率,KV缓存减少75%,128K上下文下首token延迟降低2–4倍。
Details
Motivation: 现有蒸馏工作多依赖log-likelihood排序评估,掩盖了生成质量的真实差距,导致设计决策偏差;需更贴近实际应用的生成式评估来指导高效混合模型构建。 Method: 提出Hybrid Kimi Delta Attention(Hybrid-KDA)学生架构与GenDistill多阶段蒸馏流程,并全程采用生成式评估;在Qwen3-0.6B上对训练目标、损失掩码、训练时长、数据集选择、参数冻结策略及架构选择共六个维度进行系统消融实验。 Result: log-likelihood评估显著低估师生差距,甚至反转设计选择优劣排序;数据集选择、completion-only掩码和冻结注意力层对生成质量影响最大;最优Hybrid-KDA模型在知识基准上达教师86–90%准确率,KV缓存减少75%,128K上下文下TTFT提升2–4倍。 Conclusion: 生成式评估比log-likelihood更可靠,应成为蒸馏研究的标准;Hybrid-KDA+GenDistill框架可有效平衡效率与生成质量,关键设计选择需基于生成任务验证。 Abstract: Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.[41] MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference
Joris Köster,Zixuan Liu,Siavash Khajavi,Zizhan Zheng
Main category: cs.CL
TL;DR: MemBoost 是一种内存增强的 LLM 服务框架,通过轻量模型复用历史答案、检索支持信息,并按需将困难查询升级至强模型,从而显著降低推理成本而不牺牲质量。
Details
Motivation: 大型语言模型(LLMs)在实际服务中推理开销高,尤其面对大量重复或近似查询时资源浪费严重。 Method: 提出 MemBoost 框架:支持答案复用、持续记忆扩展和成本感知路由;轻量模型优先响应,仅对不确定或困难查询调用强模型;区别于传统 RAG,专为交互式场景设计。 Result: 在多模型与模拟负载实验中,MemBoost 显著减少大模型调用次数和总体推理成本,同时保持与强模型基线相当的回答质量。 Conclusion: MemBoost 有效平衡了推理效率与生成质量,为高并发、重复查询密集的实际 LLM 服务提供了可扩展且经济的解决方案。 Abstract: Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.[42] EnTaCs: Analyzing the Relationship Between Sentiment and Language Choice in English-Tamil Code-Switching
Paul Bontempo
Main category: cs.CL
TL;DR: This paper explores how sentiment affects language choice in English-Tamil code-switched text, finding that positive utterances use more English and mixed-sentiment ones switch languages more frequently, supporting socio-linguistic theories of prestige and identity.
Details
Motivation: To understand how emotional content influences language choice in multilingual code-switching settings, particularly in English-Tamil interactions, grounded in socio-linguistic associations of prestige and identity. Method: Fine-tuned XLM-RoBERTa for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset; computed English proportion and language switch frequency per utterance; applied linear regression to analyze relationships with sentiment. Result: Positive utterances have significantly higher English proportion (34.3%) than negative ones (24.8%); mixed-sentiment utterances show highest language switch frequency when controlling for utterance length. Conclusion: Emotional content demonstrably influences language choice in multilingual code-switching, aligning with socio-linguistic theories linking sentiment to prestige and identity in embedded/matrix languages. Abstract: This paper investigates the relationship between utterance sentiment and language choice in English-Tamil code-switched text, using methods from machine learning and statistical modelling. We apply a fine-tuned XLM-RoBERTa model for token-level language identification on 35,650 romanized YouTube comments from the DravidianCodeMix dataset, producing per-utterance measurements of English proportion and language switch frequency. Linear regression analysis reveals that positive utterances exhibit significantly greater English proportion (34.3%) than negative utterances (24.8%), and mixed-sentiment utterances show the highest language switch frequency when controlling for utterance length. These findings support the hypothesis that emotional content demonstrably influences language choice in multilingual code-switching settings, due to socio-linguistic associations of prestige and identity with embedded and matrix languages.[43] Weight Tying Biases Token Embeddings Towards the Output Space
Antonio Lopardo,Avyukth Harish,Catherine Arnett,Akshat Gupta
Main category: cs.CL
TL;DR: 本文揭示了权重绑定(weight tying)在语言模型中主要优化嵌入矩阵以服务于输出预测,而非输入表征,这种‘解嵌偏差’源于训练初期输出梯度占主导,并通过调整输入梯度比例可缓解该偏差。
Details
Motivation: 尽管权重绑定是语言模型设计中的常见做法,但其对学习到的嵌入空间的影响尚不清楚。 Method: 通过对比分析绑定与非绑定模型的嵌入矩阵对齐性、使用tuned lens分析早期层计算影响,并通过缩放输入梯度进行因果验证。 Result: 发现绑定嵌入矩阵更接近输出(解嵌)矩阵而非输入嵌入;输出梯度在训练初期占主导,导致解嵌偏差并削弱早期层对残差流的贡献;缩放输入梯度可减轻该偏差。 Conclusion: 权重绑定本质上使嵌入矩阵偏向输出预测,牺牲其作为输入表征的功能,这解释了其在大模型中可能损害性能的原因,并对小模型训练具有指导意义。 Abstract: Weight tying, i.e. sharing parameters between input and output embedding matrices, is common practice in language model design, yet its impact on the learned embedding space remains poorly understood. In this paper, we show that tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings of comparable untied models, indicating that the shared matrix is shaped primarily for output prediction rather than input representation. This unembedding bias arises because output gradients dominate early in training. Using tuned lens analysis, we show this negatively affects early-layer computations, which contribute less effectively to the residual stream. Scaling input gradients during training reduces this bias, providing causal evidence for the role of gradient imbalance. This is mechanistic evidence that weight tying optimizes the embedding matrix for output prediction, compromising its role in input representation. These results help explain why weight tying can harm performance at scale and have implications for training smaller LLMs, where the embedding matrix contributes substantially to total parameter count.cs.CV [Back]
[44] A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning
Changyu Liu,James Chenhao Liang,Wenhao Yang,Yiming Cui,Jinghao Yang,Tianyang Wang,Qifan Wang,Dongfang Liu,Cheng Han
Main category: cs.CV
TL;DR: 本文提出A-SelecT方法,通过动态选择DiT中信息最丰富的时步,提升其在判别任务中的表征能力和训练效率。
Details
Motivation: 现有DiT模型在判别表示学习中受限于时步搜索不充分和DiT特有特征表征利用不足,导致训练效率和表征能力受限。 Method: 提出自动选择时步(A-SelecT)方法,在单次运行中动态定位选定Transformer特征下信息最丰富的时步,避免耗时的穷举搜索和次优判别特征选择。 Result: 在分类与分割基准上,结合A-SelecT的DiT显著优于以往所有基于扩散的判别方法,且更高效、更有效。 Conclusion: A-SelecT有效提升了DiT在生成预训练后用于下游判别任务的性能,为扩散模型的判别式应用提供了新思路。 Abstract: Diffusion models have significantly reshaped the field of generative artificial intelligence and are now increasingly explored for their capacity in discriminative representation learning. Diffusion Transformer (DiT) has recently gained attention as a promising alternative to conventional U-Net-based diffusion models, demonstrating a promising avenue for downstream discriminative tasks via generative pre-training. However, its current training efficiency and representational capacity remain largely constrained due to the inadequate timestep searching and insufficient exploitation of DiT-specific feature representations. In light of this view, we introduce Automatically Selected Timestep (A-SelecT) that dynamically pinpoints DiT's most information-rich timestep from the selected transformer feature in a single run, eliminating the need for both computationally intensive exhaustive timestep searching and suboptimal discriminative feature selection. Extensive experiments on classification and segmentation benchmarks demonstrate that DiT, empowered by A-SelecT, surpasses all prior diffusion-based attempts efficiently and effectively.[45] A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents
Fitsum Sileshi Beyene,Christopher L. Dancy
Main category: cs.CV
TL;DR: 本文批判性审视了OCR与文档理解系统的评估现状,指出当前主流评估严重偏向现代、西方、机构化文档,忽视历史档案(尤其是黑人历史报纸)中因版式、字体和材料退化带来的挑战;研究发现现有基准数据集和训练数据几乎不包含黑人社区出版的历史文献,评估指标也未能捕捉历史报纸常见的结构错误(如栏位坍塌、排版错误、幻觉文本),进而导致结构性不可见性与表征性伤害;作者从组织与制度层面分析了该问题的成因,并呼吁重构更具包容性的评估范式。
Details
Motivation: 当前OCR与文档理解系统的评估严重偏向现代、西方、机构化文档,掩盖了其在历史及边缘化档案(如黑人历史报纸)中的失效问题,亟需揭示评估偏差及其社会影响。 Method: 采用PRISMA框架系统回顾2006–2025年间OCR与文档理解领域的论文及基准数据集,分析其训练数据来源、基准设计与评估指标;结合实证研究与重要黑人报刊档案统计数据,论证评估缺失所导致的结构性不可见性与表征性伤害;并从组织(中观)与制度(宏观)层面解析问题根源。 Result: 发现黑人历史报纸等社区生成的历史文献几乎未被纳入现有训练数据或评估基准;主流评估聚焦字符准确率与现代版式任务成功,无法反映历史报纸中普遍存在的列坍塌、印刷错误、幻觉文本等结构性失败;评估缺口导致结构性不可见性与表征性伤害。 Conclusion: OCR与文档理解系统的评估范式存在系统性偏见,需从数据治理、基准设计与激励机制入手,在组织与制度层面推动更具历史敏感性与社会包容性的评估标准。 Abstract: Optical character recognition (OCR) and document understanding systems increasingly rely on large vision and vision-language models, yet evaluation remains centered on modern, Western, and institutional documents. This emphasis masks system behavior in historical and marginalized archives, where layout, typography, and material degradation shape interpretation. This study examines how OCR and document understanding systems are evaluated, with particular attention to Black historical newspapers. We review OCR and document understanding papers, as well as benchmark datasets, which are published between 2006 and 2025 using the PRISMA framework. We look into how the studies report training data, benchmark design, and evaluation metrics for vision transformer and multimodal OCR systems. During the review, we found that Black newspapers and other community-produced historical documents rarely appear in reported training data or evaluation benchmarks. Most evaluations emphasize character accuracy and task success on modern layouts. They rarely capture structural failures common in historical newspapers, including column collapse, typographic errors, and hallucinated text. To put these findings into perspective, we use previous empirical studies and archival statistics from significant Black press collections to show how evaluation gaps lead to structural invisibility and representational harm. We propose that these gaps occur due to organizational (meso) and institutional (macro) behaviors and structure, shaped by benchmark incentives and data governance decisions.[46] Evaluating Synthetic Images as Effective Substitutes for Experimental Data in Surface Roughness Classification
Binwei Chen,Huachao Leng,Chi Yeung Mang,Tsz Wai Cheung,Yanhua Chen,Wai Keung Anthony Loh,Chi Ho Wong,Chak Yin Tang
Main category: cs.CV
TL;DR: 本研究利用Stable Diffusion XL生成陶瓷表面粗糙度的合成图像,用于AI分类任务,发现合成数据可有效替代或补充真实实验图像,在保持分类精度的同时显著降低对高成本成像设备和大量标注数据的依赖。
Details
Motivation: AI在表面粗糙度分类中受限于大规模标注数据需求和昂贵的高分辨率成像设备。 Method: 使用Stable Diffusion XL生成陶瓷表面粗糙度的合成图像,并将其与真实实验图像融合用于训练分类模型;系统调节epoch数、batch size和学习率等超参数以评估鲁棒性。 Result: 合成图像增强的真实数据集能达到与纯实验图像相当的测试准确率;特定超参数配置可在减少数据需求的同时维持性能。 Conclusion: 生成式AI可显著提升材料图像分类的数据效率与可靠性,为降低实验成本、加速模型开发及拓展AI在材料工程中的应用提供可行路径。 Abstract: Hard coatings play a critical role in industry, with ceramic materials offering outstanding hardness and thermal stability for applications that demand superior mechanical performance. However, deploying artificial intelligence (AI) for surface roughness classification is often constrained by the need for large labeled datasets and costly high-resolution imaging equipment. In this study, we explore the use of synthetic images, generated with Stable Diffusion XL, as an efficient alternative or supplement to experimentally acquired data for classifying ceramic surface roughness. We show that augmenting authentic datasets with generative images yields test accuracies comparable to those obtained using exclusively experimental images, demonstrating that synthetic images effectively reproduce the structural features necessary for classification. We further assess method robustness by systematically varying key training hyperparameters (epoch count, batch size, and learning rate), and identify configurations that preserve performance while reducing data requirements. Our results indicate that generative AI can substantially improve data efficiency and reliability in materials-image classification workflows, offering a practical route to lower experimental cost, accelerate model development, and expand AI applicability in materials engineering.[47] Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis
Yuan Zhang,Sihao Dou,Kai Hu,Shuhua Deng,Chunhong Cao,Fen Xiao,Xieping Gao
Main category: cs.CV
TL;DR: 本文提出了一种面向内窥镜视频分析的认知启发式分层自监督学习框架FPRL,通过先聚焦帧内病灶区域学习静态语义,再建模跨帧病灶演化以学习上下文语义,显著提升了少标注条件下的表征能力。
Details
Motivation: 现有为自然视频设计的自监督预训练方法侧重密集时空建模且存在运动偏差,忽视了临床决策所依赖的静态、结构化语义,而内窥镜视频分析又受限于高质量标注稀缺。 Method: 提出Focus-to-Perceive Representation Learning(FPRL)框架:1)采用教师先验自适应掩码(TPAM)与多视角稀疏采样学习帧内静态语义;2)通过跨视角掩码特征补全(CVMFC)和注意力引导时序预测(AGTP)建模跨帧上下文语义;二者协同建模并区分两类语义。 Result: 在11个内窥镜视频数据集上的大量实验表明,FPRL在多种下游任务中性能优于现有方法,验证了其在内窥镜视频表征学习中的有效性。 Conclusion: FPRL通过模拟临床检查的认知过程,实现了对静态病灶语义与结构化时序演化的联合建模,为少标注医学视频分析提供了新范式。 Abstract: Endoscopic video analysis is essential for early gastrointestinal screening but remains hindered by limited high-quality annotations. While self-supervised video pre-training shows promise, existing methods developed for natural videos prioritize dense spatio-temporal modeling and exhibit motion bias, overlooking the static, structured semantics critical to clinical decision-making. To address this challenge, we propose Focus-to-Perceive Representation Learning (FPRL), a cognition-inspired hierarchical framework that emulates clinical examination. FPRL first focuses on intra-frame lesion-centric regions to learn static semantics, and then perceives their evolution across frames to model contextual semantics. To achieve this, FPRL employs a hierarchical semantic modeling mechanism that explicitly distinguishes and collaboratively learns both types of semantics. Specifically, it begins by capturing static semantics via teacher-prior adaptive masking (TPAM) combined with multi-view sparse sampling. This approach mitigates redundant temporal dependencies and enables the model to concentrate on lesion-related local semantics. Following this, contextual semantics are derived through cross-view masked feature completion (CVMFC) and attention-guided temporal prediction (AGTP). These processes establish cross-view correspondences and effectively model structured inter-frame evolution, thereby reinforcing temporal semantic continuity while preserving global contextual integrity. Extensive experiments on 11 endoscopic video datasets show that FPRL achieves superior performance across diverse downstream tasks, demonstrating its effectiveness in endoscopic video representation learning. The code is available at https://github.com/MLMIP/FPRL.[48] ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions
Zikai Wang,Zhilu Zhang,Yiqing Wang,Hui Li,Wangmeng Zuo
Main category: cs.CV
TL;DR: 本文提出ArtHOI框架,首次实现从单目RGB视频中重建4D人体-铰接物体交互,通过融合多个基础模型先验并引入自适应采样优化和多模态大语言模型引导的手物对齐方法解决物理不真实性和精度问题。
Details
Motivation: 现有手-物交互(HOI)方法局限于刚性物体,而铰接物体的4D重建通常依赖预扫描或多视角视频;从单目RGB视频重建4D人体-铰接物体交互仍属未解难题。 Method: 提出优化驱动的ArtHOI框架,整合多个基础模型先验;引入自适应采样优化(ASR)以校准物体度量尺度与位姿;设计多模态大语言模型(MLLM)引导的手-物对齐方法,利用接触推理作为约束优化手物网格组合。 Result: 在自建数据集ArtHOI-RGBD和ArtHOI-Wild上验证了方法的有效性与鲁棒性,支持多种物体与交互场景。 Conclusion: ArtHOI成功解决了单目RGB视频下4D人体-铰接物体交互重建这一病态问题,为通用交互理解与生成提供了新范式。 Abstract: Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh in world space. Furthermore, we propose a Multimodal Large Language Model (MLLM) guided hand-object alignment method, utilizing contact reasoning information as constraints of hand-object mesh composition optimization. To facilitate a comprehensive evaluation, we also contribute two new datasets, ArtHOI-RGBD and ArtHOI-Wild. Extensive experiments validate the robustness and effectiveness of our ArtHOI across diverse objects and interactions. Project: https://arthoi-reconstruction.github.io.[49] End-to-end Feature Alignment: A Simple CNN with Intrinsic Class Attribution
Parniyan Farvardin,David Chapman
Main category: cs.CV
TL;DR: 本文提出了一种名为Feature-Align CNN(FA-CNN)的新型CNN架构,通过引入保序操作(如阻尼跳跃连接和全局平均池化分类头),实现端到端特征对齐,从而增强模型可解释性;理论证明其倒数第二层特征图等价于Grad-CAM热力图,并在多个基准数据集上验证了有效性与可解释性优势。
Details
Motivation: 传统CNN中无序操作(如线性层、卷积层)导致语义概念混杂与打乱,使原始特征图难以理解,亟需提升模型内在可解释性。 Method: 设计FA-CNN架构,引入两个保序层:阻尼跳跃连接和全局平均池化分类头,强制实现从输入像素到类别logits的端到端特征对齐;并从理论上证明其特征图与Grad-CAM的一致性及逐层演化特性。 Result: FA-CNN在图像分类基准数据集上性能良好;其原始特征图在‘移除百分比像素’可解释性任务中表现优于或媲美Grad-CAM与置换方法;理论证明其penultimate特征图严格等价于Grad-CAM,并呈现平滑逐层演化。 Conclusion: FA-CNN通过结构设计实现了内在类归属与高可解释性,兼具理论严谨性与实证有效性;未来工作包括拓展至混合模型并探讨当前局限性。 Abstract: We present Feature-Align CNN (FA-CNN), a prototype CNN architecture with intrinsic class attribution through end-to-end feature alignment. Our intuition is that the use of unordered operations such as Linear and Conv2D layers cause unnecessary shuffling and mixing of semantic concepts, thereby making raw feature maps difficult to understand. We introduce two new order preserving layers, the dampened skip connection, and the global average pooling classifier head. These layers force the model to maintain an end-to-end feature alignment from the raw input pixels all the way to final class logits. This end-to-end alignment enhances the interpretability of the model by allowing the raw feature maps to intrinsically exhibit class attribution. We prove theoretically that FA-CNN penultimate feature maps are identical to Grad-CAM saliency maps. Moreover, we prove that these feature maps slowly morph layer-by-layer over network depth, showing the evolution of features through network depth toward penultimate class activations. FA-CNN performs well on benchmark image classification datasets. Moreover, we compare the averaged FA-CNN raw feature maps against Grad-CAM and permutation methods in a percent pixels removed interpretability task. We conclude this work with a discussion and future, including limitations and extensions toward hybrid models.[50] LEMON: a foundation model for nuclear morphology in Computational Pathology
Loïc Chadoutaud,Alice Blondel,Hana Feki,Jacqueline Fontugne,Emmanuel Barillot,Thomas Walter
Main category: cs.CV
TL;DR: 本文提出了LEMON,一种用于单细胞图像表征学习的自监督基础模型,通过在数百万个来自不同组织和癌症类型的细胞图像上训练,学习鲁棒且多样的细胞核形态表征,支持大规模单细胞病理学分析。
Details
Motivation: 单细胞水平的表征学习在计算病理学中尚未被充分探索,但对细胞类型和表型刻画至关重要。 Method: 提出LEMON模型,基于自监督学习,在大规模、多样化的单细胞图像数据集上进行训练,以学习细胞核形态的通用表征。 Result: 在五个基准数据集的多种预测任务上验证了LEMON的优异性能,证明其在单细胞病理分析中的有效性与潜力。 Conclusion: LEMON为细胞水平的计算病理学提供了一种新范式,具备可扩展性、鲁棒性和多功能性,并已开源模型权重。 Abstract: Computational pathology relies on effective representation learning to support cancer research and precision medicine. Although self-supervised learning has driven major progress at the patch and whole-slide image levels, representation learning at the single-cell level remains comparatively underexplored, despite its importance for characterizing cell types and cellular phenotypes. We introduce LEMON (Learning Embeddings from Morphology Of Nuclei), a self-supervised foundation model for scalable single-cell image representation learning. Trained on millions of cell images from diverse tissues and cancer types, LEMON learns robust and versatile morphological representations that support large-scale single-cell analyses in pathology. We evaluate LEMON on five benchmark datasets across a range of prediction tasks and show that it provides strong performance, highlighting its potential as a new paradigm for cell-level computational pathology. Model weights are available at https://huggingface.co/aliceblondel/LEMON.[51] Do All Vision Transformers Need Registers? A Cross-Architectural Reassessment
Spiros Baxevanakis,Platon Karageorgis,Ioannis Dravilas,Konrad Szewczyk
Main category: cs.CV
TL;DR: 本文复现了Darcet等人(2024)关于在ViT中引入空输入token(registers)以消除注意力图伪影的工作,并验证其在多个模型(DINO、DINOv2、OpenCLIP、DeiT3)上的泛化性,发现部分结论不具普适性,同时探讨了模型尺寸影响并厘清了术语混淆问题。
Details
Motivation: 解决Vision Transformers中注意力图出现伪影、影响可解释性的问题,并检验Darcet等人提出的register机制是否具有跨模型的泛化能力。 Method: 复现实验并扩展至DINO、DINOv2、OpenCLIP和DeiT3等多种ViT架构;系统评估register对注意力图质量、模型性能及不同规模模型的影响;分析并统一原论文中的术语使用。 Result: 确认了register在部分模型上能有效消除伪影并提升注意力图清晰度,但其效果不具备完全跨模型通用性;小模型也受益于register;术语不一致会影响跨模型结论的可靠性。 Conclusion: Register机制虽有效,但其适用性依赖具体模型架构与训练范式;推广该方法需谨慎,并应规范术语以支持更可靠的跨模型分析。 Abstract: Training Vision Transformers (ViTs) presents significant challenges, one of which is the emergence of artifacts in attention maps, hindering their interpretability. Darcet et al. (2024) investigated this phenomenon and attributed it to the need of ViTs to store global information beyond the [CLS] token. They proposed a novel solution involving the addition of empty input tokens, named registers, which successfully eliminate artifacts and improve the clarity of attention maps. In this work, we reproduce the findings of Darcet et al. (2024) and evaluate the generalizability of their claims across multiple models, including DINO, DINOv2, OpenCLIP, and DeiT3. While we confirm the validity of several of their key claims, our results reveal that some claims do not extend universally to other models. Additionally, we explore the impact of model size, extending their findings to smaller models. Finally, we untie terminology inconsistencies found in the original paper and explain their impact when generalizing to a wider range of models.[52] Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis
Yancheng Zhang,Xiaohan Zhang,Guangyu Sun,Zonglin Lyu,Safwan Wshah,Chen Chen
Main category: cs.CV
TL;DR: 本文提出Geo^2框架,利用几何基础模型(如VGGT)的3D几何先验,通过GeoMap构建共享3D感知隐空间以缓解跨视角差异,并设计GeoFlow流匹配模型实现双向跨视角图像合成,同时引入一致性损失保证双向合成一致性,在CVGL和CVIS任务上均达到SOTA性能。
Details
Motivation: 现有几何基础模型(GFMs)虽具备强3D几何特征提取能力,但在跨视角地理空间任务(CVGL和CVIS)中的潜力尚未被充分挖掘;而地面与航拍图像间巨大的视角差异给直接应用带来挑战。 Method: 提出统一框架Geo^2,包含:1)GeoMap——将地面与航拍特征嵌入共享3D感知隐空间,缓解跨视角差异并支撑定位与合成;2)GeoFlow——以几何感知隐嵌入为条件的流匹配模型,实现双向CVIS;3)一致性损失——强制双向合成的隐空间对齐。 Result: 在CVUSA、CVACT和VIGOR等标准基准上,Geo^2在跨视角定位(CVGL)和双向跨视角图像合成(CVIS)任务中均取得SOTA性能。 Conclusion: 3D几何先验可有效提升跨视角地理空间学习性能,Geo^2通过共享3D感知隐空间与几何引导的流匹配建模,实现了CVGL与CVIS的联合优化与性能突破。 Abstract: Cross-view geo-spatial learning consists of two important tasks: Cross-View Geo-Localization (CVGL) and Cross-View Image Synthesis (CVIS), both of which rely on establishing geometric correspondences between ground and aerial views. Recent Geometric Foundation Models (GFMs) have demonstrated strong capabilities in extracting generalizable 3D geometric features from images, but their potential in cross-view geo-spatial tasks remains underexplored. In this work, we present Geo^2, a unified framework that leverages Geometric priors from GFMs (e.g., VGGT) to jointly perform geo-spatial tasks, CVGL and bidirectional CVIS. Despite the 3D reconstruction ability of GFMs, directly applying them to CVGL and CVIS remains challenging due to the large viewpoint gap between ground and aerial imagery. We propose GeoMap, which embeds ground and aerial features into a shared 3D-aware latent space, effectively reducing cross-view discrepancies for localization. This shared latent space naturally bridges cross-view image synthesis in both directions. To exploit this, we propose GeoFlow, a flow-matching model conditioned on geometry-aware latent embeddings. We further introduce a consistency loss to enforce latent alignment between the two synthesis directions, ensuring bidirectional coherence. Extensive experiments on standard benchmarks, including CVUSA, CVACT, and VIGOR, demonstrate that Geo^2 achieves state-of-the-art performance in both localization and synthesis, highlighting the effectiveness of 3D geometric priors for cross-view geo-spatial learning.[53] ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
Haonan Han,Jiancheng Huang,Xiaopeng Sun,Junyan He,Rui Yang,Jie Hu,Xiaojiang Peng,Lin Ma,Xiaoming Wei,Xiu Li
Main category: cs.CV
TL;DR: 本文提出ViGoR基准,旨在评估AIGC模型在视觉生成任务中的逻辑推理能力,揭示当前模型在物理、因果和空间推理方面的严重缺陷。
Details
Motivation: 现有评估方法依赖表面指标或碎片化基准,导致'性能幻觉',忽视生成过程中的逻辑推理能力,而现代AIGC模型实际存在'逻辑沙漠'问题。 Method: 提出ViGoR(Vision-Generative Reasoning-centric Benchmark),包含四大创新:1)覆盖图像到图像与视频任务的跨模态整体评估;2)同时评估中间过程与最终结果的双轨机制;3)基于证据的自动化裁判以保证高人类一致性;4)细粒度认知维度分解诊断分析。 Result: 在20多个领先模型上的实验表明,即使最先进模型也存在显著推理缺陷,验证了ViGoR作为下一代智能视觉模型‘压力测试’的有效性。 Conclusion: ViGoR为打破AIGC模型性能幻觉、推动具备真正逻辑推理能力的视觉生成模型发展提供了关键评估框架。 Abstract: Beneath the stunning visual fidelity of modern AIGC models lies a "logical desert", where systems fail tasks that require physical, causal, or complex spatial reasoning. Current evaluations largely rely on superficial metrics or fragmented benchmarks, creating a ``performance mirage'' that overlooks the generative process. To address this, we introduce ViGoR Vision-G}nerative Reasoning-centric Benchmark), a unified framework designed to dismantle this mirage. ViGoR distinguishes itself through four key innovations: 1) holistic cross-modal coverage bridging Image-to-Image and Video tasks; 2) a dual-track mechanism evaluating both intermediate processes and final results; 3) an evidence-grounded automated judge ensuring high human alignment; and 4) granular diagnostic analysis that decomposes performance into fine-grained cognitive dimensions. Experiments on over 20 leading models reveal that even state-of-the-art systems harbor significant reasoning deficits, establishing ViGoR as a critical ``stress test'' for the next generation of intelligent vision models. The demo have been available at https://vincenthancoder.github.io/ViGoR-Bench/[54] Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents
Laura Fink,Linus Franke,George Kopanas,Marc Stamminger,Peter Hedman
Main category: cs.CV
TL;DR: 本文提出了一种无需相机标定和后处理融合的前馈式密集符号距离场(SDF)回归方法,利用预训练多视角几何Transformer的中间特征空间,通过学习的体素化提取与卷积解码直接生成完整、精确的SDF。
Details
Motivation: 现有方法在多视角几何Transformer中丢弃了富含世界信息的中间特征,转而依赖逐视图预测与后处理融合,导致完整性信息丢失与误差累积。 Method: 提出基于几何Transformer中间特征的可学习体素化提取机制:构建规范体素嵌入网格,通过交叉注意力与自注意力逐步融合多视角几何信息;再由轻量卷积解码器输出密集SDF;并设计面向有效性的SDF监督方案(源自深度图或3D资产),适配非流形网格等实际问题。 Result: 在稀疏与密集视角设置下均生成完整、定义良好且几何合理的SDF,具备强几何补全能力,推理耗时低于3秒。 Conclusion: 证明了预训练几何Transformer中间特征空间蕴含强大联合世界表征能力,直接从中提取3D几何优于传统逐视图预测+后融合范式,为高效、鲁棒的无监督/弱监督3D重建提供了新路径。 Abstract: We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at https://lorafib.github.io/fus3d.[55] GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding
Trong Thang Pham,Hien Nguyen,Ngan Le
Main category: cs.CV
TL;DR: 本文提出GazeQwen,一种参数高效的多模态大语言模型(MLLM)增强方法,通过隐藏状态调制引入眼动注视信息,显著提升视频理解性能。
Details
Motivation: 现有MLLM无法有效利用眼动注视信息进行视频理解,即使提供视觉叠加或文本描述形式的注视线索。 Method: 提出GazeQwen方法:使用轻量级注视重采样器(~1-5M参数)编码V-JEPA 2.1视频特征与基于注视点的位置编码,生成加性残差,并通过前向钩子注入LLM解码器特定层;可选第二阶段加入LoRA适配器以加强集成。 Result: 在StreamGaze基准全部10项任务上达到63.9%准确率,较同骨干Qwen2.5-VL-7B+视觉注视提示提升16.1点,较GPT-4o提升10.5点,为开源及闭源模型中最高分。 Conclusion: 在LLM中学习注视信息的注入位置比扩大模型规模或优化提示工程更有效;GazeQwen验证了轻量、结构化地融合眼动信号的价值。 Abstract: Current multimodal large language models (MLLMs) cannot effectively utilize eye-gaze information for video understanding, even when gaze cues are supplied via visual overlays or text descriptions. We introduce GazeQwen, a parameter efficient approach that equips an open-source MLLM with gaze awareness through hidden-state modulation. At its core is a compact gaze resampler (~1-5 M trainable parameters) that encodes V-JEPA 2.1 video features together with fixation-derived positional encodings and produces additive residuals injected into selected LLM decoder layers via forward hooks. An optional second training stage adds low-rank adapters (LoRA) to the LLM for tighter integration. Evaluated on all 10 tasks of the StreamGaze benchmark, GazeQwen reaches 63.9% accuracy, a +16.1 point gain over the same Qwen2.5-VL-7B backbone with gaze as visual prompts and +10.5 points over GPT-4o, the highest score among all open-source and proprietary models tested. These results suggest that learning where to inject gaze within an LLM is more effective than scaling model size or engineering better prompts. All code and checkpoints are available at https://github.com/phamtrongthang123/gazeqwen .[56] Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation
Jasmine Moreira
Main category: cs.CV
TL;DR: 本文提出了一种基于MediaPipe手部关键点检测器和CNN的动态手势识别方法,用于巴西手语(LIBRAS)控制智能家居系统,达到95%(弱光)和92%(正常光)准确率。
Details
Motivation: 为实现基于巴西手语(LIBRAS)的手势控制家庭自动化系统,需鲁棒、实时、低资源消耗的动态手势识别方法。 Method: 融合MediaPipe Hand Landmarker提取21个手部关键点,并构建90×21的时空矩阵作为输入;使用CNN进行分类;采用滑动窗口与帧三重复制策略实现实时连续识别,避免使用RNN。 Result: 在弱光和正常光照条件下分别达到95%和92%的识别准确率,支持11类静态与动态LIBRAS手势。 Conclusion: 该方法高效且适用于实时家庭自动化场景,但需更多样化用户参与的系统性实验以验证泛化能力。 Abstract: This paper proposes a method for dynamic hand gesture recognition based on the composition of two models: the MediaPipe Hand Landmarker, responsible for extracting 21 skeletal keypoints of the hand, and a convolutional neural network (CNN) trained to classify gestures from a spatiotemporal matrix representation of dimensions 90 by 21 of those keypoints. The method is applied to the recognition of LIBRAS (Brazilian Sign Language) gestures for device control in a home automation system, covering 11 classes of static and dynamic gestures. For real-time inference, a sliding window with temporal frame triplication is used, enabling continuous recognition without recurrent networks. Tests achieved 95\% accuracy under low-light conditions and 92\% under normal lighting. The results indicate that the approach is effective, although systematic experiments with greater user diversity are needed for a more thorough evaluation of generalization.[57] GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
Saelyne Yang,Jaesang Yu,Yi-Hao Peng,Kevin Qinghong Lin,Jae Won Cho,Yale Song,Juho Kim
Main category: cs.CV
TL;DR: 本文提出GUIDE基准,用于评估GUI代理在感知用户行为、推断意图和提供帮助方面的能力,发现现有模型表现不佳,但引入用户上下文可显著提升性能。
Details
Motivation: 现有GUI代理研究主要关注自动化操作,忽视了用户探索、迭代和保持控制权的意图,需转向理解用户行为与意图的协作模式。 Method: 构建GUIDE基准,包含120名新手用户在10种软件中67.5小时带出声思考的屏幕录制数据,并定义三项任务:行为状态检测、意图预测和帮助预测。 Result: 八种前沿多模态模型在行为状态检测和帮助预测任务上准确率仅为44.6%和55.0%;加入用户上下文后,帮助预测准确率最高提升50.2个百分点。 Conclusion: 结构化理解用户上下文对GUI代理提供有效协助至关重要,GUIDE为未来研究提供了新基准与数据集。 Abstract: Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.[58] Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception
Jingpei Lu,Fengyi Jiang,Xiaorui Zhang,Lingbo Jin,Omid Mohareri
Main category: cs.CV
TL;DR: 本文提出了一种基于Transformer的外科手术去烟雾模型,结合物理启发的去烟雾头,可同时预测无烟图像和烟雾图;通过合成数据生成与真实da Vinci系统采集的大规模配对数据集(5817对)支撑训练与评估,实验表明其在图像重建、深度估计与器械分割等下游任务中性能领先。
Details
Motivation: 手术烟雾严重干扰内窥镜视觉感知及视觉相关功能,影响微创与机器人辅助手术的安全与效率。 Method: 提出基于Transformer的去烟雾模型,引入物理启发的去烟雾头联合预测干净图像与烟雾图;构建合成数据生成流程(>80,000配对样本),并发布迄今最大真实配对手术烟雾数据集(5817对,da Vinci系统采集)。 Result: 在公开基准与自建数据集上图像重建性能达到SOTA;验证了去烟雾对立体深度估计与器械分割等下游任务的提升效果及现存局限。 Conclusion: 该方法为手术视频增强提供了新范式,合成+真实数据策略有效缓解标注稀缺问题,但数字去烟在复杂动态场景下的鲁棒性仍需提升。 Abstract: Minimally invasive and robot-assisted surgery relies heavily on endoscopic imaging, yet surgical smoke produced by electrocautery and vessel-sealing instruments can severely degrade visual perception and hinder vision-based functionalities. We present a transformer-based surgical desmoking model with a physics-inspired desmoking head that jointly predicts smoke-free image and corresponding smoke map. To address the scarcity of paired smoky-to-smoke-free training data, we develop a synthetic data generation pipeline that blends artificial smoke patterns with real endoscopic images, yielding over 80,000 paired samples for supervised training. We further curate, to our knowledge, the largest paired surgical smoke dataset to date, comprising 5,817 image pairs captured with the da Vinci robotic surgical system, enabling benchmarking on high-resolution endoscopic images. Extensive experiments on both a public benchmark and our dataset demonstrate state-of-the-art performance in image reconstruction compared to existing dehazing and desmoking approaches. We also assess the impact of desmoking on downstream stereo depth estimation and instrument segmentation, highlighting both the potential benefits and current limitations of digital smoke removal methods.[59] Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations
Suraj Prasad,Pinak Mahapatra
Main category: cs.CV
TL;DR: 本文提出了首个带精确时间戳的白板式教育视频数据集,并利用该数据集微调视觉语言模型(Qwen2-VL-7B)以实现语音驱动的同步手绘生成,验证了时间戳条件对时序对齐的有效性及跨学科泛化能力。
Details
Motivation: 现有方法无法解决白板式教育视频中自由手绘与语音叙述之间的多模态同步问题,且缺乏结构化、可复现的绘图表征。 Method: 构建包含24组带毫秒级时间戳的手绘-语音配对数据集(覆盖8个STEM领域),并基于LoRA微调Qwen2-VL-7B模型,预测与语音同步的完整笔画序列;采用主题分层五折交叉验证评估。 Result: 时间戳条件显著提升模型的时间对齐性能,优于消融基线;模型能在未见STEM主题上良好泛化。 Conclusion: 该工作验证了小样本下语音驱动同步绘图生成的可行性,推动自动化教育内容生成发展,并开源数据集与代码。 Abstract: Creating whiteboard-style educational videos demands precise coordination between freehand illustrations and spoken narration, yet no existing method addresses this multimodal synchronization problem with structured, reproducible drawing representations. We present the first dataset of 24 paired Excalidraw demonstrations with narrated audio, where every drawing element carries millisecond-precision creation timestamps spanning 8 STEM domains. Using this data, we study whether a vision-language model (Qwen2-VL-7B), fine-tuned via LoRA, can predict full stroke sequences synchronized to speech from only 24 demonstrations. Our topic-stratified five-fold evaluation reveals that timestamp conditioning significantly improves temporal alignment over ablated baselines, while the model generalizes across unseen STEM topics. We discuss transferability to real classroom settings and release our dataset and code to support future research in automated educational content generation.[60] Automated Quality Assessment of Blind Sweep Obstetric Ultrasound for Improved Diagnosis
Prasiddha Bhandari,Kanchan Poudel,Nishant Luitel,Bishram Acharya,Angelina Ghimire,Tyler Wellman,Kilian Koepsell,Pradeep Raj Regmi,Bishesh Khanal
Main category: cs.CV
TL;DR: 本文系统评估了盲扫产科超声(BSOU)图像质量对三种AI任务的影响,模拟了多种采集偏差,并开发了自动质量评估模型以检测这些偏差;结果表明,通过反馈机制重新采集被标记的低质量扫描可提升下游任务性能,证明自动化质量评估对构建可靠、可扩展的AI辅助产前超声工作流至关重要。
Details
Motivation: BSOU在资源有限地区具有广泛应用潜力,但其AI系统的可靠性高度依赖于采集质量,而目前尚不清楚采集协议偏差对下游AI预测的具体影响。 Method: 通过模拟常见的采集偏差(如扫查方向反转、探头倒置、扫查不完整)来系统评估BSOU质量对三个AI任务(扫查标签分类、胎儿方位分类、胎盘位置分类)的影响,并开发自动化质量评估模型;进一步模拟反馈重采机制,验证其对下游性能的提升效果。 Result: 发现AI模型对采集偏差高度敏感;所开发的质量评估模型能有效识别各类偏差;引入反馈重采机制后,下游任务性能显著提升。 Conclusion: 自动化质量评估是构建可靠、可扩展AI辅助产前超声工作流的关键环节,尤其适用于低资源环境。 Abstract: Blind Sweep Obstetric Ultrasound (BSOU) enables scalable fetal imaging in low-resource settings by allowing minimally trained operators to acquire standardized sweep videos for automated Artificial Intelligence(AI) interpretation. However, the reliability of such AI systems depends critically on the quality of the acquired sweeps, and little is known about how deviations from the intended protocol affect downstream predictions. In this work, we present a systematic evaluation of BSOU quality and its impact on three key AI tasks: sweep-tag classification, fetal presentation classification, and placenta-location classification. We simulate plausible acquisition deviations, including reversed sweep direction, probe inversion, and incomplete sweeps, to quantify model robustness, and we develop automated quality-assessment models capable of detecting these perturbations. To approximate real-world deployment, we simulate a feedback loop in which flagged sweeps are re-acquired, showing that such correction improves downstream task performance. Our findings highlight the sensitivity of BSOU-based AI models to acquisition variability and demonstrate that automated quality assessment can play a central role in building reliable, scalable AI-assisted prenatal ultrasound workflows, particularly in low-resource environments.[61] World Reasoning Arena
PAN Team,Qiyue Gao,Kun Zhou,Jiannan Xiang,Zihan Liu,Dequan Yang,Junrong Chen,Arif Ahmad,Cong Zeng,Ganesh Bannur,Xinqi Huang,Zheqi Liu,Yi Gu,Yichi Yang,Guangyi Liu,Zhiting Hu,Zhengzhong Liu,Eric Xing
Main category: cs.CV
TL;DR: 本文提出了WR-Arena,一个面向世界模型(WMs)的综合性评测基准,从动作模拟保真度、长时序预测和模拟推理与规划三个维度评估WMs的仿真能力,揭示了当前模型与人类假设性推理能力之间的显著差距。
Details
Motivation: 现有世界模型基准过于关注下一状态预测和视觉保真度,忽视了支撑智能行为所需的更丰富的仿真能力。 Method: 构建WR-Arena基准,涵盖动作模拟保真度、长时序预测、模拟推理与规划三大维度,并设计任务分类体系与多样化数据集,超越单轮和感知层面的评测。 Result: 通过在前沿世界模型上的实验,发现当前模型在假设性推理方面与人类水平存在显著差距;WR-Arena被验证为有效的诊断工具和下一代世界模型发展的指导框架。 Conclusion: WR-Arena为世界模型提供了更全面、更具挑战性的评测标准,推动其向具备鲁棒理解、准确预测和目标导向行动能力的方向发展。 Abstract: World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at https://github.com/MBZUAI-IFM/WR-Arena.[62] Polarization-Based Eye Tracking with Personalized Siamese Architectures
Beyza Kalkanli,Tom Bu,Mahsa Shakeri,Alexander Fix,Dave Stronks,Dmitri Model,Mantas Žurauskas
Main category: cs.CV
TL;DR: 本文提出了一种基于Siamese架构的差分个性化方法,用于偏振增强的眼动追踪,显著减少校准样本需求并提升精度。
Details
Motivation: 头戴设备中眼动追踪需逐用户校准,效率低;为克服个体差异、降低校准负担,需更鲁棒的个性化方法。 Method: 采用Siamese网络学习相对注视位移,结合少量校准帧重建绝对注视点;在偏振敏感相机采集的338人数据集上进行基准测试,并对比近红外(NIR)输入。 Result: 相比线性校准,仅用1/10样本即达同等性能;偏振输入使注视误差降低最多12%;与线性校准融合后进一步提升最多13%。 Conclusion: Siamese个性化是一种实用、高效且高精度的眼动追踪校准方法,尤其适用于偏振增强系统。 Abstract: Head-mounted devices integrated with eye tracking promise a solution for natural human-computer interaction. However, they typically require per-user calibration for optimal performance due to inter-person variability. A differential personalization approach using Siamese architectures learns relative gaze displacements and reconstructs absolute gaze from a small set of calibration frames. In this paper, we benchmark Siamese personalization on polarization-enabled eye tracking. For benchmarking, we use a 338-subject dataset captured with a polarization-sensitive camera and 850 nm illumination. We achieve performance comparable to linear calibration with 10-fold fewer samples. Using polarization inputs for Siamese personalization reduces gaze error by up to 12% compared to near-infrared (NIR)-based inputs. Combining Siamese personalization with linear calibration yields further improvements of up to 13% over a linearly calibrated baseline. These results establish Siamese personalization as a practical approach enabling accurate eye tracking.[63] Few Shots Text to Image Retrieval: New Benchmarking Dataset and Optimization Methods
Ofer Idan,Vladi Vexler,Gil Lederman,Dima Sivov,Aviad Cohen Zada,Shir Niego Komforti
Main category: cs.CV
TL;DR: 本文提出了少样本文本到图像检索(FSIR)任务及配套基准数据集FSIR-BD,旨在提升预训练视觉语言模型在组合性查询和分布外(OOD)图像-文本对上的检索性能,并设计了两种兼容任意图像编码器的少样本检索优化方法,在mAP指标上优于现有基线。
Details
Motivation: 预训练视觉语言模型在组合性查询和分布外(OOD)图像-文本对上表现不佳,而人类能通过极少示例学习;本文旨在填补该性能差距,推动更接近人类水平的少样本组合推理能力。 Method: 提出FSIR任务与首个专用于带参考示例的文本到图像检索的基准FSIR-BD;构建含38,353张图像、303个查询的数据集,划分测试集(82%)与少样本参考集(FSR,18%);设计两种基于单样本或少样本参考图像的检索优化方法,兼容任意预训练图像编码器。 Result: 实验表明:(1)FSIR-BD构成具有挑战性的新基准;(2)所提优化方法在mAP上显著优于现有基线。 Conclusion: FSIR任务与FSIR-BD为少样本图像检索提供了新范式和可靠评测平台;所提轻量、即插即用的优化方法有效提升了模型对组合性与OOD查询的泛化能力,为迈向类人少样本理解迈出重要一步。 Abstract: Pre-trained vision-language models (VLMs) excel in multimodal tasks, commonly encoding images as embedding vectors for storage in databases and retrieval via approximate nearest neighbor search (ANNS). However, these models struggle with compositional queries and out-of-distribution (OOD) image-text pairs. Inspired by human cognition's ability to learn from minimal examples, we address this performance gap through few-shot learning approaches specifically designed for image retrieval. We introduce the Few-Shot Text-to-Image Retrieval (FSIR) task and its accompanying benchmark dataset, FSIR-BD - the first to explicitly target image retrieval by text accompanied by reference examples, focusing on the challenging compositional and OOD queries. The compositional part is divided to urban scenes and nature species, both in specific situations or with distinctive features. FSIR-BD contains 38,353 images and 303 queries, with 82% comprising the test corpus (averaging per query 37 positives, ground truth matches, and significant number of hard negatives) and 18% forming the few-shot reference corpus (FSR) of exemplar positive and hard negative images. Additionally, we propose two novel retrieval optimization methods leveraging single shot or few shot reference examples in the FSR to improve performance. Both methods are compatible with any pre-trained image encoder, making them applicable to existing large-scale environments. Our experiments demonstrate that: (1) FSIR-BD provides a challenging benchmark for image retrieval; and (2) our optimization methods outperform existing baselines as measured by mean Average Precision (mAP). Further research into FSIR optimization methods will help narrow the gap between machine and human-level understanding, particularly for compositional reasoning from limited examples.[64] THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond
Letian Wang,Andrei Zanfir,Eduard Gabriel Bazavan,Misha Andriluka,Cristian Sminchisescu
Main category: cs.CV
TL;DR: THFM是一个基于文本到视频扩散模型的统一视频基础模型,用于人体中心感知,能同时处理密集(如深度、法线、分割、稠密姿态)和稀疏(如2D/3D关键点估计)任务;仅用合成数据训练,却在多个基准上媲美或超越专用模型,并展现出对多人体、类人角色及动物等的泛化能力。
Details
Motivation: 构建一个统一、通用的视频感知模型,避免为不同任务分别设计专用模型,并探索仅依赖合成数据训练能否实现强泛化能力与跨任务兼容性。 Method: 将预训练的文本到视频扩散模型改造为单次前向传播的感知模型,引入可学习token以支持稀疏预测,并通过文本提示调制实现多任务灵活切换。 Result: 在多种密集与稀疏感知任务基准上达到或超过现有专用模型性能;仅用合成数据训练即具备对多人体、非人类类人角色及动物的零样本泛化能力。 Conclusion: 基于扩散模型的视频表征具有强大先验与泛化潜力,THFM验证了统一架构+合成数据+文本引导是构建通用视频感知基础模型的有效范式。 Abstract: We present THFM, a unified video foundation model for human-centric perception that jointly addresses dense tasks (depth, normals, segmentation, dense pose) and sparse tasks (2d/3d keypoint estimation) within a single architecture. THFM is derived from a pretrained text-to-video diffusion model, repurposed as a single-forward-pass perception model and augmented with learnable tokens for sparse predictions. Modulated by the text prompt, our single unified model is capable of performing various perception tasks. Crucially, our model is on-par or surpassing state-of-the-art specialized models on a variety of benchmarks despite being trained exclusively on synthetic data (i.e.~without training on real-world or benchmark specific data). We further highlight intriguing emergent properties of our model, which we attribute to the underlying diffusion-based video representation. For example, our model trained on videos with a single human in the scene generalizes to multiple humans and other object classes such as anthropomorphic characters and animals -- a capability that hasn't been demonstrated in the past.[65] Shared Representation for 3D Pose Estimation, Action Classification, and Progress Prediction from Tactile Signals
Isaac Han,Seoyoung Lee,Sangyeon Park,Ecehan Akan,Yiyue Luo,Joseph DelPreto,Kyung-Joong Kim
Main category: cs.CV
TL;DR: 本文提出SCOTTI模型,首次利用足底触觉信号联合完成3D姿态估计、动作分类和动作进度预测三项任务,并构建了包含15名参与者、8类活动、共7小时的新型触觉数据集。
Details
Motivation: 视觉方法在人机交互中存在遮挡和隐私问题,而现有触觉方法仅单独处理各任务,性能受限。 Method: 提出共享卷积Transformer(SCOTTI),通过多任务学习联合建模3D姿态估计、动作分类与动作进度预测;使用自研无线鞋垫传感器采集足底触觉信号。 Result: SCOTTI在三项任务上均优于单任务基线方法;构建了首个面向动作进度预测的足底触觉数据集(15人、8类活动、7小时)。 Conclusion: 共享表征的多任务触觉推理框架能有效提升各项任务性能,验证了足底触觉信号在人机交互中联合理解人体运动的潜力。 Abstract: Estimating human pose, classifying actions, and predicting movement progress are essential for human-robot interaction. While vision-based methods suffer from occlusion and privacy concerns in realistic environments, tactile sensing avoids these issues. However, prior tactile-based approaches handle each task separately, leading to suboptimal performance. In this study, we propose a Shared COnvolutional Transformer for Tactile Inference (SCOTTI) that learns a shared representation to simultaneously address three separate prediction tasks: 3D human pose estimation, action class categorization, and action completion progress estimation. To the best of our knowledge, this is the first work to explore action progress prediction using foot tactile signals from custom wireless insole sensors. This unified approach leverages the mutual benefits of multi-task learning, enabling the model to achieve improved performance across all three tasks compared to learning them independently. Experimental results demonstrate that SCOTTI outperforms existing approaches across all three tasks. Additionally, we introduce a novel dataset collected from 15 participants performing various activities and exercises, with 7 hours of total duration, across eight different activities.[66] Good Scores, Bad Data: A Metric for Multimodal Coherence
Vasundra Srinivasan
Main category: cs.CV
TL;DR: 本文提出了一种新的评估指标Multimodal Coherence Score(MCS),用于独立于下游任务地衡量多模态融合质量,涵盖身份、空间、语义和决策四个维度,并通过优化学习权重;实验表明MCS比任务准确率更敏感地反映融合质量差异,且各维度对特定故障模式具有独立响应性。
Details
Motivation: 高下游任务准确率不能保证多模态输入数据本身的一致性,现有评估方式缺乏对模态间融合质量的直接度量。 Method: 提出Multimodal Coherence Score(MCS)指标,将多模态一致性分解为身份、空间、语义和决策四个可量化维度,权重通过Nelder-Mead优化学习;在Visual Genome和COCO数据集上,结合DETR、CLIP和ViLT模型进行评估与验证。 Result: MCS在三个融合架构上展现出比任务准确率更高的敏感性(Spearman rho = 0.093 vs. 0.071);扰动实验验证各维度仅响应对应故障模式,无交叉干扰;MCS轻量、无需人工标注。 Conclusion: MCS是一种有效、解耦、可解释的多模态融合质量评估指标,不仅能判断是否失效,还能定位具体失效维度。 Abstract: Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight, requires no human annotation, and tells you not just that something broke, but what broke.[67] DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation
Abolfazl Meyarian,Amin Karimi Monsefi,Rajiv Ramnath,Ser-Nam Lim
Main category: cs.CV
TL;DR: 本文提出DiReCT方法,通过解耦语义与物理信息的对比学习,提升视频生成模型的物理合理性,同时保持视觉质量。
Details
Motivation: 现有基于流匹配的视频生成器虽能生成高质量视频,但因重建目标未区分物理合理与不合理动态,常违反基础物理规律;而对比流匹配在文本条件视频生成中面临语义-物理纠缠问题,导致负样本选择困难和梯度冲突。 Method: 提出DiReCT框架:1)宏观对比项,从语义远离区域采样负样本以实现全局轨迹分离;2)微观对比项,利用大语言模型扰动单一物理维度(如运动学、力、材料等)构造难负样本;3)速度空间分布正则化防止视觉质量退化。 Result: 在VideoPhy基准上,相比基线和监督微调(SFT),物理常识得分分别提升16.7%和11.3%,且不增加训练时间。 Conclusion: 语义与物理的解耦对比学习可有效提升视频生成的物理合理性,且无需牺牲视觉保真度或训练效率。 Abstract: Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample's, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.[68] DenseSwinV2: Channel Attentive Dual Branch CNN Transformer Learning for Cassava Leaf Disease Classification
Shah Saood,Saddam Hussain Khan
Main category: cs.CV
TL;DR: 本文提出了一种名为Hybrid Dense SwinV2的双分支混合模型,结合DenseNet的局部高分辨率特征提取能力与定制化Swin Transformer V2(SwinV2)的全局上下文建模能力,并引入通道注意力压缩模块进行特征增强与融合,在 cassava 疾病分类任务中达到98.02%准确率和97.81% F1分数,显著优于传统CNN和Transformer模型。
Details
Motivation: 解决 cassava 叶片疾病分类中因病灶视觉相似、遮挡、噪声及复杂背景导致的识别困难问题,提升田间实际诊断的鲁棒性与实用性。 Method: 构建双分支架构:一为DenseNet分支提取密集局部特征并保障梯度流动;二为定制SwinV2分支通过移位窗口自注意力建模长程依赖;两分支各自接入通道注意力压缩模块以强化判别性响应;最后融合二者增强后的特征图进行分类。 Result: 在含31000张图像、涵盖5类(包括正常)的公开cassava数据集上,取得98.02%分类准确率和97.81% F1分数,性能超越主流CNN与Transformer模型。 Conclusion: Hybrid Dense SwinV2通过CNN与Transformer优势互补及注意力驱动的特征精炼,实现了高精度、强鲁棒性的cassava病害识别,具备良好的落地应用潜力。 Abstract: This work presents a new Hybrid Dense SwinV2, a two-branch framework that jointly leverages densely connected convolutional features and hierarchical customized Swin Transformer V2 (SwinV2) representations for cassava disease classification. The proposed framework captures high resolution local features through its DenseNet branch, preserving the fine structural cues and also allowing for effective gradient flow. Concurrently, the customized SwinV2 models global contextual dependencies through the idea of shifted-window self attention, which enables the capture of long range interactions critical in distinguishing between visually similar lesions. Moreover, an attention channel-squeeze module is employed for each CNN Transformer stream independently to emphasize discriminative disease related responses and suppress redundant or background driven activations. Finally, these discriminative channels are fused to achieve refined representations from the dense local and SwinV2 global correlated strengthened feature maps, respectively. The proposed Dense SwinV2 utilized a public cassava leaf disease dataset of 31000 images, comprised of five diseases, including brown streak, mosaic, green mottle, bacterial blight, and normal leaf conditions. The proposed Dense SwinV2 demonstrates a significant classification accuracy of 98.02 percent with an F1 score of 97.81 percent, outperforming well-established convolutional and transformer models. These results underline the fact that Hybrid Dense SwinV2 offers robustness and practicality in the field level diagnosis of cassava disease and real world challenges related to occlusion, noise, and complex backgrounds.[69] Reinforcing Structured Chain-of-Thought for Video Understanding
Peiyao Wang,Haotian Xu,Noranart Vesdapunt,Rui Hou,Jingyi Zhang,Haibin Ling,Oleksandr Obiednikov,Ning Zhou,Kah Kuen Fu
Main category: cs.CV
TL;DR: 本文提出了一种无需监督微调的单阶段强化学习框架SDRL,通过结构化思维链(总结→思考→回答)和两个自监督机制(CVK与DVR)提升多模态大模型在视频理解中的时序推理能力与泛化性,并在七个视频问答数据集上达到SOTA。
Details
Motivation: 现有基于强化学习的多模态大语言模型视频理解方法存在推理漂移、时序理解弱、依赖高成本CoT标注与多阶段训练、推理路径僵化等问题,限制了泛化能力并可能引入偏差。 Method: 提出Summary-Driven Reinforcement Learning(SDRL),采用单阶段RL框架与结构化CoT格式(Summarize->Think->Answer);在GRPO目标中嵌入两个自监督机制:1)Consistency of Vision Knowledge(CVK),通过最小化生成摘要间的KL散度增强事实一致性;2)Dynamic Variety of Reasoning(DVR),依据组准确率动态调节思考多样性以促进探索。 Result: 在七个公开VideoQA数据集上达到当前最优性能(state-of-the-art)。 Conclusion: SDRL有效平衡了对齐性与探索性,同时监督答案输出与推理过程,摆脱了对SFT和人工CoT标注的依赖,提升了MLLMs在视频理解任务中的鲁棒性、泛化性与时序推理能力。 Abstract: Multi-modal Large Language Models (MLLMs) show promise in video understanding. However, their reasoning often suffers from thinking drift and weak temporal comprehension, even when enhanced by Reinforcement Learning (RL) techniques like Group Relative Policy Optimization (GRPO). Moreover, existing RL methods usually depend on Supervised Fine-Tuning (SFT), which requires costly Chain-of-Thought (CoT) annotation and multi-stage training, and enforces fixed reasoning paths, limiting MLLMs' ability to generalize and potentially inducing bias. To overcome these limitations, we introduce Summary-Driven Reinforcement Learning (SDRL), a novel single-stage RL framework that obviates the need for SFT by utilizing a Structured CoT format: Summarize -> Think -> Answer. SDRL introduces two self-supervised mechanisms integrated into the GRPO objective: 1) Consistency of Vision Knowledge (CVK) enforces factual grounding by reducing KL divergence among generated summaries; and 2) Dynamic Variety of Reasoning (DVR) promotes exploration by dynamically modulating thinking diversity based on group accuracy. This novel integration effectively balances alignment and exploration, supervising both the final answer and the reasoning process. Our method achieves state-of-the-art performance on seven public VideoQA datasets.[70] Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets
Alex Koran,Dimitrios Sinodinos,Hadi Hojjati,Takuya Nanri,Fangge Chen,Narges Armanfard
Main category: cs.CV
TL;DR: 本文提出了一种视频-语言增强的异常检测器VLAAD,用于端到端自动驾驶中的碰撞感知表示学习,并构建了CARLA-Collide和Real-Collide两个多模态碰撞数据集,在仿真与真实世界任务中均显著提升性能。
Details
Motivation: 高违规率(尤其是碰撞相关)是端到端自动驾驶的主要瓶颈,而现有方法对碰撞感知表征学习关注不足;同时缺乏高质量、多模态、多样化场景的碰撞数据集。 Method: 提出基于多实例学习(MIL)的视频-语言增强异常检测器VLAAD,用于时序定位与主动预测碰撞信号;构建大规模仿真数据集CARLA-Collide和真实世界数据集Real-Collide,支持闭/开环评估;将VLAAD作为即插即用模块集成至TransFuser++等E2E模型。 Result: 在CARLA Leaderboard上使TransFuser++驾驶得分相对提升14.12%;在Real-Collide上以0.6B参数量超越多十亿参数VLM,AUC提升23.3%。 Conclusion: VLAAD通过引入语言引导的多模态碰撞表征与专用数据集,有效提升了端到端自动驾驶的碰撞感知能力与泛化性,验证了轻量、专用模块优于通用大模型的设计范式。 Abstract: High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the CARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.[71] Low-Rank-Modulated Functa: Exploring the Latent Space of Implicit Neural Representations for Interpretable Ultrasound Video Analysis
Julia Wolleb,Cristiana Baloescu,Alicia Durrer,Hemant D. Tagare,Xenophon Papademetris
Main category: cs.CV
TL;DR: 本文提出LRM-Functa,一种基于低秩调制的Functa架构,用于超声视频的隐空间建模,显著提升时序结构可解释性与周期性可视化能力,并在无监督ED/ES帧检测和下游任务中表现优越。
Details
Motivation: Functa类隐式神经表示虽在图像重建上表现优异,但其潜在空间结构与可解释性尚未被充分探索,尤其在时序医学影像(如超声视频)中缺乏对动态模式的显式建模。 Method: 提出低秩调制Functa(LRM-Functa),在时间分辨隐空间中对调制向量施加低秩适应约束,使隐变量沿心脏周期形成清晰周期性轨迹;支持隐空间遍历生成新帧,并直接读取ED/ES关键帧。 Result: 在心脏超声中实现高精度无监督ED/ES帧检测;单帧压缩至秩k=2仍保持ejection fraction预测竞争力;在OOD心脏POC数据集及肺超声B线分类任务中验证泛化性。 Conclusion: LRM-Functa提供了一种紧凑、可解释且可泛化的超声视频分析框架,兼顾重建质量、时序建模与临床可解释性。 Abstract: Implicit neural representations (INRs) have emerged as a powerful framework for continuous image representation learning. In Functa-based approaches, each image is encoded as a latent modulation vector that conditions a shared INR, enabling strong reconstruction performance. However, the structure and interpretability of the corresponding latent spaces remain largely unexplored. In this work, we investigate the latent space of Functa-based models for ultrasound videos and propose Low-Rank-Modulated Functa (LRM-Functa), a novel architecture that enforces a low-rank adaptation of modulation vectors in the time-resolved latent space. When applied to cardiac ultrasound, the resulting latent space exhibits clearly structured periodic trajectories, facilitating visualization and interpretability of temporal patterns. The latent space can be traversed to sample novel frames, revealing smooth transitions along the cardiac cycle, and enabling direct readout of end-diastolic (ED) and end-systolic (ES) frames without additional model training. We show that LRM-Functa outperforms prior methods in unsupervised ED and ES frame detection, while compressing each video frame to as low as rank k=2 without sacrificing competitive downstream performance on ejection fraction prediction. Evaluations on out-of-distribution frame selection in a cardiac point-of-care dataset, as well as on lung ultrasound for B-line classification, demonstrate the generalizability of our approach. Overall, LRM-Functa provides a compact, interpretable, and generalizable framework for ultrasound video analysis. The code is available at https://github.com/JuliaWolleb/LRM_Functa.[72] BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles
Shounak Sural,Ragunathan Rajkumar
Main category: cs.CV
TL;DR: BEVMapMatch是一种不依赖GNSS的鲁棒车辆重定位框架,通过lidar+相机融合生成多模态鸟瞰图(BEV)语义分割,并利用基于交叉注意力的搜索机制匹配地图块,实现高精度全局定位。
Details
Motivation: 在GNSS拒止或降级环境中,自动驾驶车辆的安全广泛部署面临定位挑战,亟需不依赖GNSS的鲁棒替代定位方法。 Method: 提出BEVMapMatch框架:1)采用上下文感知的lidar+相机融合方法生成多模态BEV语义分割图;2)基于交叉注意力机制检索候选地图块;3)对最优候选进行精细对齐以实现全局定位;4)融合多帧BEV分割提升精度。 Result: 在GNSS拒止与恶劣天气环境下,BEVMapMatch显著优于现有重定位方法,Recall@1m达39.8%,约为最佳基线方法的两倍。 Conclusion: BEVMapMatch为GNSS拒止环境下的自动驾驶车辆提供了高效、鲁棒且高精度的重定位解决方案,验证了多模态BEV融合与注意力驱动地图匹配的有效性。 Abstract: Localization in GNSS-denied and GNSS-degraded environments is a challenge for the safe widespread deployment of autonomous vehicles. Such GNSS-challenged environments require alternative methods for robust localization. In this work, we propose BEVMapMatch, a framework for robust vehicle re-localization on a known map without the need for GNSS priors. BEVMapMatch uses a context-aware lidar+camera fusion method to generate multimodal Bird's Eye View (BEV) segmentations around the ego vehicle in both good and adverse weather conditions. Leveraging a search mechanism based on cross-attention, the generated BEV segmentation maps are then used for the retrieval of candidate map patches for map-matching purposes. Finally, BEVMapMatch uses the top retrieved candidate for finer alignment against the generated BEV segmentation, achieving accurate global localization without the need for GNSS. Multiple frames of generated BEV segmentation further improve localization accuracy. Extensive evaluations show that BEVMapMatch outperforms existing methods for re-localization in GNSS-denied and adverse environments, with a Recall@1m of 39.8%, being nearly twice as much as the best performing re-localization baseline. Our code and data will be made available at https://github.com/ssuralcmu/BEVMapMatch.git.[73] Neuro-Cognitive Reward Modeling for Human-Centered Autonomous Vehicle Control
Zhuoli Zhuang,Yu-Cheng Chang,Yu-Kai Wang,Thomas Do,Chin-Teng Lin
Main category: cs.CV
TL;DR: 本文提出一种基于脑电图(EEG)引导的强化学习框架,将人类认知反馈(如事件相关电位ERP)直接融入自动驾驶的奖励信号中,无需行为响应,提升了碰撞规避能力。
Details
Motivation: 现有基于人类反馈的强化学习(RLHF)依赖耗时且间接的人工排序偏好数据,难以有效对齐人类驾驶意图;而人类具备快速场景理解与决策的认知优势,亟需更直接、实时的认知反馈机制。 Method: 采集20名被试在驾驶模拟器中的EEG信号,分析突发环境变化诱发的事件相关电位(ERP);构建神经网络,从视觉场景信息预测ERP强度;并将该预测的ERP强度作为认知奖励信号融入强化学习算法。 Result: 实验表明,引入EEG引导的认知奖励显著提升了RL智能体的碰撞规避性能;验证了神经认知反馈在增强自动驾驶系统安全性方面的有效性。 Conclusion: EEG等神经信号可作为更自然、更及时的人类意图代理,为RLHF提供新范式;本工作证实了将人脑认知过程显式建模并嵌入自动驾驶决策闭环的可行性与优越性。 Abstract: Recent advancements in computer vision have accelerated the development of autonomous driving. Despite these advancements, training machines to drive in a way that aligns with human expectations remains a significant challenge. Human factors are still essential, as humans possess a sophisticated cognitive system capable of rapidly interpreting scene information and making accurate decisions. Aligning machine with human intent has been explored with Reinforcement Learning with Human Feedback (RLHF). Conventional RLHF methods rely on collecting human preference data by manually ranking generated outputs, which is time-consuming and indirect. In this work, we propose an electroencephalography (EEG)-guided decision-making framework to incorporate human cognitive insights without behaviour response interruption into reinforcement learning (RL) for autonomous driving. We collected EEG signals from 20 participants in a realistic driving simulator and analyzed event-related potentials (ERP) in response to sudden environmental changes. Our proposed framework employs a neural network to predict the strength of ERP based on the cognitive information from visual scene information. Moreover, we explore the integration of such cognitive information into the reward signal of the RL algorithm. Experimental results show that our framework can improve the collision avoidance ability of the RL algorithm, highlighting the potential of neuro-cognitive feedback in enhancing autonomous driving systems. Our project page is: https://alex95gogo.github.io/Cognitive-Reward/.[74] Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
Gustavo Chau Loo Kung,Mohammad Abbasi,Camila Blank,Juze Zhang,Alan Q. Wang,Sophie Ostmeier,Akshay Chaudhari,Kilian Pohl,Ehsan Adeli
Main category: cs.CV
TL;DR: 本文提出了一种名为D-RoPE的扩散空间旋转位置编码方法,并将其嵌入dMRI Transformer中,以联合建模dMRI数据的空间、扩散加权和方向依赖性,从而在不同采集协议下学习鲁棒且可迁移的表征。经自监督预训练后,在多个下游任务中展现出优于基线模型的性能。
Details
Motivation: 现有深度学习方法难以有效捕捉dMRI信号的独特性质(如空间、扩散加权及方向依赖性),且对不同采集协议(如方向数差异)泛化能力差,限制了通用表征的学习。 Method: 提出扩散空间旋转位置编码(D-RoPE),嵌入到专为dMRI设计的Transformer架构中;采用自监督掩码自动编码(masked autoencoding)进行预训练。 Result: 在下游任务中,微调后的特征使轻度认知障碍分类准确率提升6%,认知评分预测相关系数提高0.05;整体性能优于或媲美多种基线方法(包括全监督基线)。 Conclusion: D-RoPE有效建模了dMRI的多维结构特性,显著提升了表征的鲁棒性与跨协议迁移能力,为dMRI通用表征学习提供了新范式。 Abstract: Diffusion Magnetic Resonance Imaging (dMRI) plays a critical role in studying microstructural changes in the brain. It is, therefore, widely used in clinical practice; yet progress in learning general-purpose representations from dMRI has been limited. A key challenge is that existing deep learning approaches are not well-suited to capture the unique properties of diffusion signals. Brain dMRI is normally composed of several brain volumes, each with different attenuation characteristics dependent on the direction and strength of the diffusion-sensitized gradients. Thus, there is a need to jointly model spatial, diffusion-weighting, and directional dependencies in dMRI. Furthermore, varying acquisition protocols (e.g., differing numbers of directions) further limit traditional models. To address these gaps, we introduce a diffusion space rotatory positional embedding (D-RoPE) plugged into our dMRI transformer to capture both the spatial structure and directional characteristics of diffusion data, enabling robust and transferable representations across diverse acquisition settings and an arbitrary number of diffusion directions. After self-supervised masked autoencoding pretraining, tests on several downstream tasks show that the learned representations and the pretrained model can provide competitive or superior performance compared to several baselines in these downstream tasks (even compared to a fully trained baseline); the finetuned features from our pretrained encoder resulted in a 6% higher accuracy in classifying mild cognitive impairment and a 0.05 increase in the correlation coefficient when predicting cognitive scores. Code is available at: github.com/gustavochau/D-RoPE.[75] JRM: Joint Reconstruction Model for Multiple Objects without Alignment
Qirui Wu,Yawar Siddiqui,Duncan Frost,Samir Aroudj,Armen Avetisyan,Richard Newcombe,Angel X. Chang,Jakob Engel,Henry Howard-Jenkins
Main category: cs.CV
TL;DR: 本文提出联合重建模型(JRM),通过隐式聚合未对齐的多视角观测,在潜在空间中实现个性化生成式对象重建,无需显式匹配或刚性对齐,从而提升重复对象建模的鲁棒性与泛化性。
Details
Motivation: 现有基于对象中心的重建方法因假设对象独立而忽略场景中重复出现的同一对象所提供的强一致性信号,尤其在跨视角或跨扫描时难以利用重复结构提升重建质量。 Method: JRM是一种3D流匹配生成模型,将对象重建建模为个性化生成任务:多个观测共享一个公共主体(consistent latent subject),同时适配各自姿态与状态;其核心是隐式地在潜在空间中聚合未对齐观测,不依赖显式匹配或刚性对齐。 Result: 在合成与真实数据上的实验表明,JRM消除了显式对齐需求,提升了对错误关联的鲁棒性,并能自然处理非刚性形变(如关节运动),重建质量优于独立建模和基于对齐的基线方法。 Conclusion: 隐式联合建模比显式对齐更鲁棒、更通用,为对象中心的3D重建提供了新范式,尤其适用于含重复与非刚性结构的复杂场景。 Abstract: Object-centric reconstruction seeks to recover the 3D structure of a scene through composition of independent objects. While this independence can simplify modeling, it discards strong signals that could improve reconstruction, notably repetition where the same object model is seen multiple times in a scene, or across scans. We propose the Joint Reconstruction Model (JRM) to leverage repetition by framing object reconstruction as one of personalized generation: multiple observations share a common subject that should be consistent for all observations, while still adhering to the specific pose and state from each. Prior methods in this direction rely on explicit matching and rigid alignment across observations, making them sensitive to errors and difficult to extend to non-rigid transformations. In contrast, JRM is a 3D flow-matching generative model that implicitly aggregates unaligned observations in its latent space, learning to produce consistent and faithful reconstructions in a data-driven manner without explicit constraints. Evaluations on synthetic and real-world data show that JRM's implicit aggregation removes the need for explicit alignment, improves robustness to incorrect associations, and naturally handles non-rigid changes such as articulation. Overall, JRM outperforms both independent and alignment-based baselines in reconstruction quality.[76] FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation
Changyang Li,Xueqing Huang,Shin-Fang Chng,Huangying Zhan,Qingan Yan,Yi Xu
Main category: cs.CV
TL;DR: 本文提出FAST3DIS,一种端到端的前馈式3D实例分割方法,通过3D锚点查询与跨视角注意力机制绕过传统非可微聚类,结合双层正则化提升精度与效率。
Details
Motivation: 现有基于'提升-聚类'范式的3D实例分割方法存在聚类不可微、多视角扩展性差、表征学习与分割目标脱节等问题。 Method: 提出Feed-forward Anchored Scene Transformer(FAST3DIS):基于深度骨干网络构建3D锚点驱动的查询式Transformer;设计学习型3D锚点生成器与锚点采样跨视角注意力机制;引入多视角对比学习与动态调度的空间重叠惩罚构成双层正则化策略。 Result: 在复杂室内3D数据集上,相比前沿聚类方法,在保持竞争力分割精度的同时,显著提升了内存可扩展性与推理速度。 Conclusion: FAST3DIS验证了摒弃后处理聚类、实现端到端可微3D实例分割的可行性与优势,兼顾几何先验与实例语义学习。 Abstract: While recent feed-forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift-and-cluster" paradigm. Grouping dense pixel-wise embeddings via non-differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed-forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end-to-end approach that effectively bypasses post-hoc clustering. We introduce a 3D-anchored, query-based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance-specific semantics while retaining its zero-shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor-sampling cross-attention mechanism for view-consistent 3D instance segmentation. By projecting 3D object queries directly into multi-view feature maps, our method samples context efficiently. Furthermore, we introduce a dual-level regularization strategy, that couples multi-view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state-of-the-art clustering-based methods.[77] Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
Zhuan Shi,Alireza Dehghanpour Farashah,Rik de Vries,Golnoosh Farnadi
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的邻域感知局部概念擦除方法(NLCE),在文本到图像扩散模型中实现目标概念的精准擦除,同时保护语义相关的邻近概念,提升细粒度生成保真度。
Details
Motivation: 现有局部概念擦除方法在抑制目标概念时会无意削弱语义相近的邻近概念,导致细粒度领域生成质量下降。 Method: NLCE包含三阶段无训练流程:(1) 谱加权嵌入调制,衰减目标概念方向并稳定邻近概念表征;(2) 注意力引导的空间门控,定位残余概念激活区域;(3) 空间门控硬擦除,在必要位置清除残留痕迹。 Result: 在Oxford Flowers、Stanford Dogs等细粒度数据集上,NLCE有效擦除目标概念的同时显著提升了邻近类别保真度;在名人身份、敏感内容和艺术风格等场景也展现出强鲁棒性与泛化能力。 Conclusion: NLCE通过引入邻域感知机制,实现了更精细、更可控的局部概念擦除,在保持生成能力的同时缓解了语义漂移问题,为可控图像生成提供了新思路。 Abstract: Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localized erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.[78] FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants
Mahesh Bhosale,Abdul Wasi,Shantam Srivastava,Shifa Latif,Tianyu Luan,Mingchen Gao,David Doermann,Xuan Gong
Main category: cs.CV
TL;DR: 本文提出FairLLaVA,一种参数高效微调方法,通过最小化目标属性间的互信息,使多模态大语言模型在视觉指令调优中实现群体公平性,同时不损害整体性能。
Details
Motivation: 多模态大语言模型(MLLMs)在临床等安全关键场景中存在跨人群表现不均问题,可能引发诊断偏差与信任危机,而现有研究对MLLMs的公平性关注不足。 Method: 提出FairLLaVA,采用低秩适配器进行参数高效微调,并通过互信息最小化正则化模型表征,使其对人口统计学属性不变;该方法为轻量插件式、架构无关的设计。 Result: 在胸部X光报告生成和皮肤镜视觉问答两大医学基准上,FairLLaVA显著降低组间差异,同时提升公平性加权的临床性能与自然语言生成质量。 Conclusion: FairLLaVA为MLLMs提供了一种高效、通用且实用的公平性增强方案,尤其适用于高风险医疗AI应用。 Abstract: While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.[79] VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation
Rakib Hossain Sajib,Md Kishor Morol,Rajan Das Gupta,Mohammad Sakib Mahmood,Shuvra Smaran Das
Main category: cs.CV
TL;DR: 本文评估了GPT-4o、Claude 3.5 Sonnet和LLaMA 3.2 Vision等大型视觉语言模型(LVLMs)在无需微调的情况下对人脸图像进行零样本年龄估计的能力,在UTKFace和FG-NET数据集上取得有竞争力的结果,同时揭示了图像质量与人口统计子群带来的性能差异及公平性挑战。
Details
Motivation: 传统深度学习方法依赖大量标注数据和领域特定训练,而LVLMs有望实现零样本年龄估计;需系统评估其在该任务上的泛化能力与现实适用性。 Method: 对GPT-4o、Claude 3.5 Sonnet和LLaMA 3.2 Vision进行严格零样本评估(无微调),在UTKFace和FG-NET数据集上使用MAE、MSE、RMSE、MAPE、MBE、R²、CCC和±5年准确率共八项指标进行评测。 Result: LVLMs在零样本设置下展现出与传统监督模型相媲美的性能;但存在图像质量敏感性和跨年龄/种族子群的性能差距;提示词敏感性、可解释性、计算开销与公平性仍是挑战。 Conclusion: LVLMs具备用于真实场景(如法医、医疗监测、人机交互)人脸年龄估计的潜力,但需进一步解决公平性与鲁棒性问题;本工作构建了可复现的零样本基准。 Abstract: Human age estimation from facial images represents a challenging computer vision task with significant applications in biometrics, healthcare, and human-computer interaction. While traditional deep learning approaches require extensive labeled datasets and domain-specific training, recent advances in large vision-language models (LVLMs) offer the potential for zero-shot age estimation. This study presents a comprehensive zero-shot evaluation of state-of-the-art Large Vision-Language Models (LVLMs) for facial age estimation, a task traditionally dominated by domain-specific convolutional networks and supervised learning. We assess the performance of GPT-4o, Claude 3.5 Sonnet, and LLaMA 3.2 Vision on two benchmark datasets, UTKFace and FG-NET, without any fine-tuning or task-specific adaptation. Using eight evaluation metrics, including MAE, MSE, RMSE, MAPE, MBE, $R^2$, CCC, and $\pm$5-year accuracy, we demonstrate that general-purpose LVLMs can deliver competitive performance in zero-shot settings. Our findings highlight the emergent capabilities of LVLMs for accurate biometric age estimation and position these models as promising tools for real-world applications. Additionally, we highlight performance disparities linked to image quality and demographic subgroups, underscoring the need for fairness-aware multimodal inference. This work introduces a reproducible benchmark and positions LVLMs as promising tools for real-world applications in forensic science, healthcare monitoring, and human-computer interaction. The benchmark focuses on strict zero-shot inference without fine-tuning and highlights remaining challenges related to prompt sensitivity, interpretability, computational cost, and demographic fairness.[80] GeoReFormer: Geometry-Aware Refinement for Lane Segment Detection and Topology Reasoning
Danny Abraham,Nikhil Kamalkumar Advani,Arun Das,Nikil Dutt
Main category: cs.CV
TL;DR: 本文提出GeoReFormer,一种几何与拓扑感知的Transformer架构,用于3D车道线检测与拓扑推理,通过结构化查询初始化、有界坐标空间优化和门控拓扑传播提升性能与一致性。
Details
Motivation: 现有基于Transformer的方法沿用面向紧凑目标检测的解码器设计,未能显式建模车道线作为连续有向图结构的几何与关系特性。 Method: 提出GeoReFormer:1)数据驱动的几何先验用于结构化查询初始化;2)有界的坐标空间细化以稳定多段线形变;3)每个查询的门控拓扑传播以选择性融合关系上下文。 Result: 在OpenLane-V2基准上达到34.5% mAP,为当前最优,并显著提升拓扑一致性。 Conclusion: 显式嵌入几何与关系结构先验可有效提升3D车道检测与拓扑推理性能,验证了专用归纳偏置设计的必要性。 Abstract: Accurate 3D lane segment detection and topology reasoning are critical for structured online map construction in autonomous driving. Recent transformer-based approaches formulate this task as query-based set prediction, yet largely inherit decoder designs originally developed for compact object detection. However, lane segments are continuous polylines embedded in directed graphs, and generic query initialization and unconstrained refinement do not explicitly encode this geometric and relational structure. We propose GeoReFormer (Geometry-aware Refinement Transformer), a unified query-based architecture that embeds geometry- and topology-aware inductive biases directly within the transformer decoder. GeoReFormer introduces data-driven geometric priors for structured query initialization, bounded coordinate-space refinement for stable polyline deformation, and per-query gated topology propagation to selectively integrate relational context. On the OpenLane-V2 benchmark, GeoReFormer achieves state-of-the-art performance with 34.5% mAP while improving topology consistency over strong transformer baselines, demonstrating the utility of explicit geometric and relational structure encoding.[81] Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features
Mengdi Liu,Qiang Li,Weizhi Nie,Shaopeng Zhang,Yuting Su
Main category: cs.CV
TL;DR: 本文提出了一种无监督域自适应(UDA)框架,用于在无需目标域标注的情况下,跨机构自动提取A型主动脉夹层(TAAD)的关键临床特征,兼顾分割精度与临床可解释性。
Details
Motivation: 现有TAAD研究过度关注分割精度,忽视临床可量化的特征提取;且依赖昂贵的像素级标注,难以跨中心泛化。 Method: 提出一种面向临床紧急流程的无监督域自适应(UDA)驱动框架,在仅有源域有限标注的前提下,适配目标域无标注数据,实现稳定跨机构多类分割与可量化临床特征提取。 Result: 在跨域分割任务上显著优于现有SOTA方法;多中心外科医生读片研究证实其提取的临床特征对术前评估具有实际辅助价值。 Conclusion: 该端到端‘分割→特征’框架解决了TAAD跨机构部署中无标注、高泛化、强临床实用性三重挑战,具备真实临床落地潜力。 Abstract: Type A Aortic Dissection (TAAD) is a life-threatening cardiovascular emergency that demands rapid and precise preoperative evaluation. While key anatomical and pathological features are decisive for surgical planning, current research focuses predominantly on improving segmentation accuracy, leaving the reliable, quantitative extraction of clinically actionable features largely under-explored. Furthermore, constructing comprehensive TAAD datasets requires labor-intensive, expert level pixel-wise annotations, which is impractical for most clinical institutions. Due to significant domain shift, models trained on a single center dataset also suffer from severe performance degradation during cross-institutional deployment. This study addresses a clinically critical challenge: the accurate extraction of key TAAD clinical features during cross-institutional deployment in the total absence of target-domain annotations. To this end, we propose an unsupervised domain adaptation (UDA)-driven framework for the automated extraction of TAAD clinical features. The framework leverages limited source-domain labels while effectively adapting to unlabeled data from target domains. Tailored for real-world emergency workflows, our framework aims to achieve stable cross-institutional multi-class segmentation, reliable and quantifiable clinical feature extraction, and practical deployability independent of high-cost annotations. Extensive experiments demonstrate that our method significantly improves cross-domain segmentation performance compared to existing state-of-the-art approaches. More importantly, a reader study involving multiple cardiovascular surgeons confirms that the automatically extracted clinical features provide meaningful assistance for preoperative assessment, highlighting the practical utility of the proposed end-to-end segmentation-to-feature pipeline.[82] Learning to Trim: End-to-End Causal Graph Pruning with Dynamic Anatomical Feature Banks for Medical VQA
Zibo Xu,Qiang Li,Weizhi Nie,Yuting Su
Main category: cs.CV
TL;DR: 本文提出了一种可学习的因果剪枝框架(LCT),通过动态解耦数据集级伪相关与实例级因果信号,提升医学视觉问答(MedVQA)模型的泛化性与鲁棒性。
Details
Motivation: MedVQA模型常因依赖数据集特有伪相关(如解剖结构重复模式或问题类型规律)而非真实诊断证据,导致泛化能力受限;现有因果方法多为静态调整或后处理,缺乏端到端可学习机制。 Method: 提出Learnable Causal Trimming(LCT)框架:构建基于动量更新的Dynamic Anatomical Feature Bank(DAFB)以建模全局伪相关模式;设计可微剪枝模块,动态估计样本表征与全局特征库的依赖关系,并软抑制高相关特征、增强实例特异性证据。 Result: 在VQA-RAD、SLAKE、SLAKE-CP和PathVQA四个医学VQA数据集上,LCT显著优于现有去偏策略,提升了模型鲁棒性与跨数据集泛化能力。 Conclusion: 将因果剪枝嵌入端到端训练并实现可学习动态调节,能更有效地解耦伪相关、挖掘因果诊断依据,是提升MedVQA模型可信性的有效路径。 Abstract: Medical Visual Question Answering (MedVQA) models often exhibit limited generalization due to reliance on dataset-specific correlations, such as recurring anatomical patterns or question-type regularities, rather than genuine diagnostic evidence. Existing causal approaches are typically implemented as static adjustments or post-hoc corrections. To address this issue, we propose a Learnable Causal Trimming (LCT) framework that integrates causal pruning into end-to-end optimization. We introduce a Dynamic Anatomical Feature Bank (DAFB), updated via a momentum mechanism, to capture global prototypes of frequent anatomical and linguistic patterns, serving as an approximation of dataset-level regularities. We further design a differentiable trimming module that estimates the dependency between instance-level representations and the global feature bank. Features highly correlated with global prototypes are softly suppressed, while instance-specific evidence is emphasized. This learnable mechanism encourages the model to prioritize causal signals over spurious correlations adaptively. Experiments on VQA-RAD, SLAKE, SLAKE-CP and PathVQA demonstrate that LCT consistently improves robustness and generalization over existing debiasing strategies.[83] Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs
Jiazheng Xing,Chao Xu,Hangjie Yuan,Mengmeng Wang,Jun Dan,Hangwei Qian,Yong Liu
Main category: cs.CV
TL;DR: 本文提出FSAR-LLaVA,首个端到端利用多模态大语言模型(如Video-LLaVA)作为多模态知识库直接增强少样本动作识别(FSAR)的方法;通过多模态特征增强模块、复合任务导向原型构建与无训练多模态原型匹配度量,实现视觉-文本联合度量学习,在多种任务上以极小可训练参数取得优越性能。
Details
Motivation: 现有少样本动作识别方法依赖次优的特征→字幕→特征流程,且仅在视觉空间内进行度量学习,未能充分利用多模态大语言模型(MLLMs)的跨模态语义理解与对齐能力。 Method: 提出FSAR-LLaVA:1)利用MLLM的多模态解码器提取时空与语义增强表征,并通过多模态特征增强模块解耦并强化视觉/文本特征;2)设计灵活适配场景的输入提示,结合MLLM对齐输出驱动复合任务导向原型构建;3)引入无训练的多模态原型匹配度量,自适应选择关键线索并协同利用解耦后的多模态特征进行度量学习。 Result: 在多个少样本动作识别任务上显著优于现有方法,同时仅需极少可训练参数。 Conclusion: FSAR-LLaVA首次实现了MLLM在FSAR中的端到端、多模态联合度量学习范式,验证了利用MLLM作为通用多模态知识库提升少样本识别能力的有效性与泛化性。 Abstract: Multimodal Large Language Models (MLLMs) have propelled the field of few-shot action recognition (FSAR). However, preliminary explorations in this area primarily focus on generating captions to form a suboptimal feature->caption->feature pipeline and adopt metric learning solely within the visual space. In this paper, we propose FSAR-LLaVA, the first end-to-end method to leverage MLLMs (such as Video-LLaVA) as a multimodal knowledge base for directly enhancing FSAR. First, at the feature level, we leverage the MLLM's multimodal decoder to extract spatiotemporally and semantically enriched representations, which are then decoupled and enhanced by our Multimodal Feature-Enhanced Module into distinct visual and textual features that fully exploit their semantic knowledge for FSAR. Next, we leverage the versatility of MLLMs to craft input prompts that flexibly adapt to diverse scenarios, and use their aligned outputs to drive our designed Composite Task-Oriented Prototype Construction, effectively bridging the distribution gap between meta-train and meta-test sets. Finally, to enable multimodal features to guide metric learning jointly, we introduce a training-free Multimodal Prototype Matching Metric that adaptively selects the most decisive cues and efficiently leverages the decoupled feature representations produced by MLLMs. Extensive experiments demonstrate superior performance across various tasks with minimal trainable parameters.[84] Finding Distributed Object-Centric Properties in Self-Supervised Transformers
Samyak Rawlekar,Amitabh Swain,Yujun Cai,Yiwei Wang,Ming-Hsuan Yang,Narendra Ahuja
Main category: cs.CV
TL;DR: 本文提出Object-DINO,一种无需训练的方法,利用ViT中各层patch-level注意力组件(q/k/v)的相似性提取分布式物体中心信息,显著提升无监督物体发现性能并缓解多模态大模型中的物体幻觉问题。
Details
Motivation: 现有自监督ViT(如DINO)依赖[CLS] token注意力图进行物体发现,但其图像级目标导致物体定位不准、存在伪激活;而局部patch间交互中蕴含更丰富的物体中心信息未被充分利用。 Method: 通过分析所有层中patch-level注意力组件(query/key/value)的互相似性,发现物体中心特性分布于全网络;据此提出Object-DINO:跨层聚类注意力头,自动识别对应所有物体的物体中心簇。 Result: 在无监督物体发现任务上CorLoc提升3.6至12.4;有效缓解多模态大语言模型中的物体幻觉,提供视觉接地;全程无需额外训练。 Conclusion: 分布式patch-level注意力相似性蕴含强物体中心表征,Object-DINO通过训练-free方式挖掘该信息,可显著提升下游视觉理解任务性能。 Abstract: Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.[85] Face2Parts: Exploring Coarse-to-Fine Inter-Regional Facial Dependencies for Generalized Deepfake Detection
Kutub Uddin,Nusrat Tasnim,Byung Tae Oh
Main category: cs.CV
TL;DR: 本文提出了一种名为Face2Parts的混合深度伪造检测方法,基于分层特征表示(HFR),通过从帧、整脸及关键面部区域(唇、眼、鼻)提取粗粒度到细粒度特征,并结合通道注意力机制和深度三元组学习,显著提升了检测性能。
Details
Motivation: 现有深度伪造检测方法因关注不同面部区域而各有优势,但缺乏对多尺度面部区域间依赖关系的建模,难以应对多样化的伪造操作。 Method: 提出Face2Parts方法,基于分层特征表示(HFR),分别提取帧级、人脸级及唇、眼、鼻等关键区域特征;引入通道注意力机制建模区域间依赖关系,并采用深度三元组学习增强判别性。 Result: 在多个基准数据集(FF++, CDF1, CDF2, DFD, DFDC, DTIM, PDD, WLDR)上取得优异AUC结果,最高达100%,平均表现优于现有方法,尤其在跨数据集和跨伪造方法设置下泛化能力强。 Conclusion: Face2Parts通过融合粗到细的多粒度面部特征与注意力驱动的区域关系建模,有效提升了深度伪造检测的准确性与鲁棒性,为多媒体内容真实性验证提供了新思路。 Abstract: Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42\% on FF++, 79.80\% on CDF1, 85.34\% on CDF2, 89.41\% on DFD, 84.07\% on DFDC, 95.62\% on DTIM, 80.76\% on PDD, and 100\% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.[86] PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
Shaoxuan Li,Zhixuan Zhao,Hanze Deng,Zirun Ma,Shulin Tian,Zuyan Liu,Yushi Hu,Haoning Wu,Yuhao Dong,Benlin Liu,Ziwei Liu,Ranjay Krishna
Main category: cs.CV
TL;DR: PerceptionComp是一个手动标注的、面向复杂长时序视频感知推理的基准,要求模型整合多个时间分离的视觉证据并满足组合逻辑约束,现有MLLMs在此基准上表现远低于人类水平,凸显感知推理仍是重大瓶颈。
Details
Motivation: 现有视频理解基准过于简单,无法评估模型在复杂、长时序、多步感知推理任务中的能力;需要一个能真正考验视觉证据整合、时空推理和组合逻辑能力的新基准。 Method: 构建了一个名为PerceptionComp的手动标注视频问答基准,包含279个多样化领域视频和1114个高复杂度问题,每个问题需跨时间片段综合多种感知要素(物体、属性、关系、动作等)并满足合取与顺序逻辑约束。 Result: 人类在禁止重看条件下的准确率仅18.97%;当前最优MLLM(Gemini-3-Flash)在五选一设置下仅达45.96%,开源模型均低于40%。 Conclusion: PerceptionComp揭示了当前多模态大语言模型在感知中心的长时序视频推理任务上存在显著不足,该基准有望推动感知推理能力的发展。 Abstract: We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.[87] Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
Daiqiang Li,Zihao Pan,Zeyu Zhang,Ronghao Chen,Huacan Wang,Honggang Chen,Haiyun Jiang
Main category: cs.CV
TL;DR: 本文通过实证研究,探讨了GUI场景下历史截图的视觉令牌剪枝策略,提出了三个关键发现:背景区域对界面状态转换具有重要辅助作用;随机剪枝在保持空间结构方面优于精心设计的策略;GUI智能体存在类似人类认知的近因效应,优先保留近期截图可显著降低计算成本而不影响性能。
Details
Motivation: 高分辨率GUI截图产生大量视觉令牌,直接保留完整历史信息计算开销大,亟需高效的历史截图令牌剪枝策略。 Method: 开展针对GUI场景的历史截图令牌剪枝实证研究,采用边缘检测分离前景与背景区域,并对比分析不同剪枝策略(包括随机剪枝)及不同时间权重分配方式的效果。 Result: 发现背景区域能有效捕捉界面状态变化;随机剪枝在同等计算预算下性能更优;引入近因效应、对近期截图分配更大令牌预算可大幅降低计算成本且几乎不损性能。 Conclusion: GUI视觉智能体的设计应重视背景语义价值、利用随机剪枝的空间保持优势,并借鉴人类认知的近因效应进行动态令牌预算分配,从而实现高效可靠的GUI导航。 Abstract: In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effective pruning strategies. First, we observe that GUI screenshots exhibit a distinctive foreground-background semantic composition. To probe this property, we apply a simple edge-based separation to partition screenshots into foreground and background regions. Surprisingly, we find that, contrary to the common assumption that background areas have little semantic value, they effectively capture interface-state transitions, thereby providing auxiliary cues for GUI reasoning. Second, compared with carefully designed pruning strategies, random pruning possesses an inherent advantage in preserving spatial structure, enabling better performance under the same computational budget. Finally, we observe that GUI Agents exhibit a recency effect similar to human cognition: by allocating larger token budgets to more recent screenshots and heavily compressing distant ones, we can significantly reduce computational cost while maintaining nearly unchanged performance. These findings offer new insights and practical guidance for the design of efficient GUI visual agents.[88] Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays
Kang Liu,Zhuoqi Ma,Siyu Liang,Yunan Li,Xiyue Gao,Chao Liang,Kun Xie,Qiguang Miao
Main category: cs.CV
TL;DR: 本文提出CoGaze框架,通过融合临床背景与放射科医生注视轨迹,提升胸部X光片的视觉-语言预训练效果,在多项任务上显著优于现有方法。
Details
Motivation: 现有医学视觉-语言预训练模型忽视放射科医生的注视轨迹和临床背景,难以建模疾病特异性模式并削弱跨模态对齐。 Method: 提出上下文增强的视觉编码器,并设计多层次监督范式:混合正样本对比学习、疾病感知的跨模态表征学习、以及利用医生注视作为概率先验引导注意力聚焦诊断关键区域。 Result: 在自由文本与结构化报告生成、零样本分类、图像-文本检索等任务上均显著超越SOTA,分别提升CheXbertF1达+2.0%、BLEU2达+1.2%、AUROC达+23.2%、Precision@1达+12.2%。 Conclusion: 融合临床上下文与医生注视信息可有效提升医学视觉-语言模型的诊断推理能力与跨模态对齐性能。 Abstract: Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.[89] Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification
Zizhao Chen,Ping Wei,Ziyang Ren,Huan Li,Xiangru Yin
Main category: cs.CV
TL;DR: 本文提出MaLSF框架,通过主动双向验证和分层语义聚合,提升多模态虚假信息检测与定位能力。
Details
Motivation: 现有基于被动整体融合的多模态验证方法因'特征稀释'问题难以发现细微局部语义不一致,无法有效应对日益复杂的多模态虚假信息。 Method: 提出MaLSF框架,包含:1)双向跨模态验证(BCV)模块,使用文本和图像互为查询来显式定位冲突;2)分层语义聚合(HSA)模块,整合多粒度冲突信号;并设计多种掩码-标签对提取解析器以获取细粒度语义锚点。 Result: 在DGM4和多模态假新闻检测任务上达到SOTA性能,消融实验和可视化结果验证了其有效性与可解释性。 Conclusion: MaLSF通过引入掩码感知的局部语义融合与主动双向验证机制,显著提升了多模态虚假信息检测的准确性与可解释性,为该领域提供了新范式。 Abstract: As multimodal misinformation becomes more sophisticated, its detection and grounding are crucial. However, current multimodal verification methods, relying on passive holistic fusion, struggle with sophisticated misinformation. Due to 'feature dilution,' global alignments tend to average out subtle local semantic inconsistencies, effectively masking the very conflicts they are designed to find. We introduce MaLSF (Mask-aware Local Semantic Fusion), a novel framework that shifts the paradigm to active, bidirectional verification, mimicking human cognitive cross-referencing. MaLSF utilizes mask-label pairs as semantic anchors to bridge pixels and words. Its core mechanism features two innovations: 1) a Bidirectional Cross-modal Verification (BCV) module that acts as an interrogator, using parallel query streams (Text-as-Query and Image-as-Query) to explicitly pinpoint conflicts; and 2) a Hierarchical Semantic Aggregation (HSA) module that intelligently aggregates these multi-granularity conflict signals for task-specific reasoning. In addition, to extract fine-grained mask-label pairs, we introduce a set of diverse mask-label pair extraction parsers. MaLSF achieves state-of-the-art performance on both the DGM4 and multimodal fake news detection tasks. Extensive ablation studies and visualization results further verify its effectiveness and interpretability.[90] Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline
Qizhi Xie,Kun Yuan,Yunpeng Qu,Ming Sun,Chao Zhou,Jihong Zhu
Main category: cs.CV
TL;DR: 本文提出视频流畅度评估(VFA)作为独立感知任务,构建首个面向流畅度的数据集FluVid,建立包含23种方法的基准,并提出基线模型FluNet,显著提升视频流畅度建模能力。
Details
Motivation: 现有视频质量评估(VQA)方法将流畅度仅作为整体质量的一个子维度,无法准确反映人类对视频流畅度(如运动一致性和帧连续性)的主观感知,限制了其在流媒体和游戏等场景的应用。 Method: 1)构建面向流畅度的数据集FluVid(4606个真实视频),首次定义流畅度评分标准并开展人类主观研究;2)建立迄今最全面的VFA基准(涵盖23种方法);3)提出基线模型FluNet,引入时间置换自注意力(T-PSA)以增强帧间长程交互与流畅信息建模。 Result: FluNet在FluVid上达到SOTA性能;基准实验为VFA定制化模型设计提供了关键洞见;FluVid数据集和基准推动VFA成为独立研究方向。 Conclusion: 本文开创性地将视频流畅度评估确立为独立任务,通过新数据集、新基准和新模型,系统性推动该领域发展,并为社区提供可复现的研究范式与演进路线图。 Abstract: Accurately estimating humans' subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.[91] MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection
Peiyuan Jiang,Yao Liu,Yanglei Gan,Jiaye Yang,Lu Liu,Daibing Yao,Qiao Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于GSR引导的渐进式跨模态知识蒸馏方法(GPD),利用新构建的大规模多模态欺骗检测数据集MuDD,在非接触式欺骗检测任务中实现了SOTA性能。
Details
Motivation: 非接触式自动欺骗检测面临视觉和听觉线索跨被试稳定性差的问题,而GSR等生理信号更可靠;但缺乏支持GSR指导非接触模态学习的合适多模态数据集。 Method: 构建了包含视频、音频、GSR、PPG、心率及人格特征的大规模多模态欺骗检测数据集MuDD;提出GSR引导的渐进式蒸馏框架GPD,融合特征级与数字级蒸馏,并引入动态路由机制以自适应调控知识迁移过程。 Result: GPD在欺骗检测与隐藏数字识别两个任务上均超越现有方法,达到SOTA性能;实验与可视化验证了其有效性与稳定性。 Conclusion: 利用可靠生理信号(如GSR)指导非接触模态建模是可行且有效的路径;MuDD数据集与GPD框架为跨模态欺骗检测提供了新范式与实用工具。 Abstract: Non-contact automatic deception detection remains challenging because visual and auditory deception cues often lack stable cross-subject patterns. In contrast, galvanic skin response (GSR) provides more reliable physiological cues and has been widely used in contact-based deception detection. In this work, we leverage stable deception-related knowledge in GSR to guide representation learning in non-contact modalities through cross-modal knowledge distillation. A key obstacle, however, is the lack of a suitable dataset for this setting. To address this, we introduce MuDD, a large-scale Multimodal Deception Detection dataset containing recordings from 130 participants over 690 minutes. In addition to video, audio, and GSR, MuDD also provides Photoplethysmography, heart rate, and personality traits, supporting broader scientific studies of deception. Based on this dataset, we propose GSR-guided Progressive Distillation (GPD), a cross-modal distillation framework for mitigating the negative transfer caused by the large modality mismatch between GSR and non-contact signals. The core innovation of GPD is the integration of progressive feature-level and digit-level distillation with dynamic routing, which allows the model to adaptively determine how teacher knowledge should be transferred during training, leading to more stable cross-modal knowledge transfer. Extensive experiments and visualizations show that GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification.[92] R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting
Tianrui Lou,Siyuan Liang,Jiawei Liang,Yuze Gao,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出R-PGA框架,通过可重光照的3D高斯泼溅建模与硬物理配置挖掘,提升物理对抗伪装在动态复杂场景下的鲁棒性与泛化能力。
Details
Motivation: 现有物理对抗伪装方法在复杂动态场景(如多变视角、光照和大气散射)下泛化能力差,源于仿真失真和优化目标脆弱两大根本缺陷。 Method: 提出Relightable Physical 3D Gaussian Splatting(R-PGA)攻击框架:1)用3DGS实现高保真重建,并解耦材质与光照;2)混合渲染管线结合可重光照3DGS前景与图像翻译生成匹配背景;3)引入Hard Physical Configuration Mining(HPCM)模块主动挖掘最不利物理配置并压制其损失峰值。 Result: 显著提升对抗纹理在不同视角、光照和天气条件下的迁移性与稳定性,在CARLA及真实道路视频中验证了更强的跨配置鲁棒性。 Conclusion: R-PGA通过提升仿真物理保真度与优化景观平滑性,为物理世界对抗攻击提供了更可靠、泛化性更强的新范式。 Abstract: Physical adversarial camouflage poses a severe security threat to autonomous driving systems by mapping adversarial textures onto 3D objects. Nevertheless, current methods remain brittle in complex dynamic scenarios, failing to generalize across diverse geometric (e.g., viewing configurations) and radiometric (e.g., dynamic illumination, atmospheric scattering) variations. We attribute this deficiency to two fundamental limitations in simulation and optimization. First, the reliance on coarse, oversimplified simulations (e.g., via CARLA) induces a significant domain gap, confining optimization to a biased feature space. Second, standard strategies targeting average performance result in a rugged loss landscape, leaving the camouflage vulnerable to configuration shifts.To bridge these gaps, we propose the Relightable Physical 3D Gaussian Splatting (3DGS) based Attack framework (R-PGA). Technically, to address the simulation fidelity issue, we leverage 3DGS to ensure photo-realistic reconstruction and augment it with physically disentangled attributes to decouple intrinsic material from lighting. Furthermore, we design a hybrid rendering pipeline that leverages precise Relightable 3DGS for foreground rendering, while employing a pre-trained image translation model to synthesize plausible relighted backgrounds that align with the relighted foreground.To address the optimization robustness issue, we propose the Hard Physical Configuration Mining (HPCM) module, designed to actively mine worst-case physical configurations and suppress their corresponding loss peaks. This strategy not only diminishes the overall loss magnitude but also effectively flattens the rugged loss landscape, ensuring consistent adversarial effectiveness and robustness across varying physical configurations.[93] PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery
Elkhan Ismayilzada,Yufei Zhang,Zijun Cui
Main category: cs.CV
TL;DR: 本文提出了一种物理感知的条件扩散框架,用于从图像中生成符合物理规律的手部运动序列,并估计运动中各关节、各时刻的物理一致性方差。
Details
Motivation: 现有手部重建方法虽能提供准确的单帧估计,但缺乏物理一致性,且无法量化运动对物理规律的满足程度。 Method: 基于MeshCNN-Transformer主干网络,构建面向手部的欧拉-拉格朗日动力学模型;将动力学残差视为虚拟观测量以更好融合物理约束;通过最后一层拉普拉斯近似,输出每关节每时刻的物理一致性方差。 Result: 在两个主流手部数据集上,性能优于强图像初始化方法及竞争性视频方法;定性结果表明所估计的方差与图像重建运动的物理合理性高度一致。 Conclusion: 该方法不仅提升了手部运动的物理合理性,还提供了可解释的物理一致性不确定性度量,为可信手部重建提供了新范式。 Abstract: Significant advancements made in reconstructing hands from images have delivered accurate single-frame estimates, yet they often lack physics consistency and provide no notion of how confidently the motion satisfies physics. In this paper, we propose a novel physics-aware conditional diffusion framework that refines noisy pose sequences into physically plausible hand motion while estimating the physics variance in motion estimates. Building on a MeshCNN-Transformer backbone, we formulate Euler-Lagrange dynamics for articulated hands. Unlike prior works that enforce zero residuals, we treat the resulting dynamic residuals as virtual observables to more effectively integrate physics. Through a last-layer Laplace approximation, our method produces per-joint, per-time variances that measure physics consistency and offers interpretable variance maps indicating where physical consistency weakens. Experiments on two well-known hand datasets show consistent gains over strong image-based initializations and competitive video-based methods. Qualitative results confirm that our variance estimations are aligned with the physical plausibility of the motion in image-based estimates.[94] MUST: Modality-Specific Representation-Aware Transformer for Diffusion-Enhanced Survival Prediction with Missing Modality
Kyungwon Kim,Dosik Hwang
Main category: cs.CV
TL;DR: 本文提出MUST框架,通过代数约束分解模态特异性表示,并利用条件潜在扩散模型生成缺失模态的高质量表示,显著提升多模态医学数据在模态缺失情况下的生存预测性能。
Details
Motivation: 临床中多模态医学数据常存在缺失问题,现有方法缺乏对各模态独特贡献的显式建模,难以准确识别和补偿缺失模态所携带的不可替代信息。 Method: 提出MUST框架:1)在低秩共享子空间中通过代数约束将各模态表示分解为模态特异性与跨模态上下文化两部分;2)使用条件潜在扩散模型,基于恢复的共享信息和结构先验生成缺失模态的表示。 Result: 在五个TCGA癌症数据集上,MUST在完整数据下达到SOTA性能,并在病理或基因组模态缺失情况下保持鲁棒预测,且推理延迟满足临床要求。 Conclusion: MUST通过显式建模模态特异性信息与生成式补偿策略,有效缓解多模态缺失对生存预测的影响,为精准肿瘤学的临床部署提供了新思路。 Abstract: Accurate survival prediction from multimodal medical data is essential for precision oncology, yet clinical deployment faces a persistent challenge: modalities are frequently incomplete due to cost constraints, technical limitations, or retrospective data availability. While recent methods attempt to address missing modalities through feature alignment or joint distribution learning, they fundamentally lack explicit modeling of the unique contributions of each modality as opposed to the information derivable from other modalities. We propose MUST (Modality-Specific representation-aware Transformer), a novel framework that explicitly decomposes each modality's representation into modality-specific and cross-modal contextualized components through algebraic constraints in a learned low-rank shared subspace. This decomposition enables precise identification of what information is lost when a modality is absent. For the truly modality-specific information that cannot be inferred from available modalities, we employ conditional latent diffusion models to generate high-quality representations conditioned on recovered shared information and learned structural priors. Extensive experiments on five TCGA cancer datasets demonstrate that MUST achieves state-of-the-art performance with complete data while maintaining robust predictions in both missing pathology and missing genomics conditions, with clinically acceptable inference latency.[95] When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization
Zhihan Chen,Yuhuan Zhao,Yijie Zhu,Xinyu Yao
Main category: cs.CV
TL;DR: 本文揭示了当前主体驱动的文本到图像扩散模型在处理多个交互主体时存在的“可扩展性幻觉”问题,并提出了新的评估指标SCR来量化主体身份崩溃现象。
Details
Motivation: 现有模型在处理多个交互主体时存在身份崩溃问题,而传统CLIP评估指标无法准确反映这一问题,因此需要构建更严格的评测基准和更合理的评估指标。 Method: 构建包含75个提示的压力测试基准,涵盖不同主体数量与交互难度;提出基于DINOv2结构先验的Subject Collapse Rate(SCR)新评估指标,用于检测局部注意力泄漏与同质化;对MOSAIC、XVerse、PSR等SOTA模型进行系统评测。 Result: 实验表明,当主体数量增至6–10个或交互变复杂时,模型身份保真度急剧下降,SCR接近100%;CLIP指标易误判身份崩溃图像为高质量结果;崩溃根源在于全局注意力机制中的语义捷径。 Conclusion: 当前多主体生成能力被高估,亟需在生成架构中引入显式的物理解耦机制以提升主体身份一致性。 Abstract: Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive "Illusion of Scalability" in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2's structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.[96] Learnable Instance Attention Filtering for Adaptive Detector Distillation
Chen Liu,Qizhen Lan,Zhicheng Ding,Xinyu Chu,Qing Tian
Main category: cs.CV
TL;DR: 本文提出了一种名为LIAF-KD的新型知识蒸馏框架,通过可学习的实例选择器动态评估并重加权目标检测中各实例的重要性,使学生模型能根据自身学习状态参与蒸馏过程,在KITTI和COCO数据集上显著提升性能。
Details
Motivation: 现有基于特征的知识蒸馏方法通常忽略实例级差异,且注意力过滤机制多为启发式或由教师模型主导,缺乏学生驱动的自适应能力。 Method: 提出Learnable Instance Attention Filtering for Adaptive Detector Distillation (LIAF-KD),引入可学习的实例选择器,依据学生模型当前学习状态动态调整各检测实例在蒸馏中的权重。 Result: 在KITTI和COCO数据集上验证有效,GFL-ResNet50学生模型获得2%性能提升,且不增加计算复杂度,优于当前最优方法。 Conclusion: LIAF-KD实现了更细粒度、学生自适应的检测知识蒸馏,提升了轻量化检测器的精度与实用性。 Abstract: As deep vision models grow increasingly complex to achieve higher performance, deployment efficiency has become a critical concern. Knowledge distillation (KD) mitigates this issue by transferring knowledge from large teacher models to compact student models. While many feature-based KD methods rely on spatial filtering to guide distillation, they typically treat all object instances uniformly, ignoring instance-level variability. Moreover, existing attention filtering mechanisms are typically heuristic or teacher-driven, rather than learned with the student. To address these limitations, we propose Learnable Instance Attention Filtering for Adaptive Detector Distillation (LIAF-KD), a novel framework that introduces learnable instance selectors to dynamically evaluate and reweight instance importance during distillation. Notably, the student contributes to this process based on its evolving learning state. Experiments on the KITTI and COCO datasets demonstrate consistent improvements, with a 2% gain on a GFL ResNet-50 student without added complexity, outperforming state-of-the-art methods.[97] CD-Buffer: Complementary Dual-Buffer Framework for Test-Time Adaptation in Adverse Weather Object Detection
Youngjun Song,Hyeongyu Kim,Dosik Hwang
Main category: cs.CV
TL;DR: 本文提出CD-Buffer框架,通过统一的差异度量自适应地协同减法(去噪)与加法(精炼)机制,实现测试时对不同严重程度域偏移的鲁棒自适应。
Details
Motivation: 现有测试时自适应(TTA)方法中,减法型(去通道)和加法型(加模块)策略各自仅在特定偏移强度下有效,缺乏跨多样本偏移程度的泛化能力,亟需一种能根据特征级域偏移程度自动平衡二者的方法。 Method: 提出CD-Buffer互补双缓冲框架,引入统一的特征级差异度量,驱动减法(去除域敏感通道)与加法(轻量特征精炼)机制反向协同工作,实现通道粒度的自适应策略平衡。 Result: 在KITTI、Cityscapes和ACDC数据集上验证了CD-Buffer的优越性,在多种天气条件和偏移严重程度下均达到SOTA性能。 Conclusion: CD-Buffer通过差异驱动的耦合机制,首次实现了对不同严重程度域偏移的自适应、免调参、通道级动态平衡,显著提升了TTA方法的鲁棒性与泛化性。 Abstract: Test-Time Adaptation (TTA) enables real-time adaptation to domain shifts without off-line retraining. Recent TTA methods have predominantly explored additive approaches that introduce lightweight modules for feature refinement. Recently, a subtractive approach that removes domain-sensitive channels has emerged as an alternative direction. We observe that these paradigms exhibit complementary effectiveness patterns: subtractive methods excel under severe shifts by removing corrupted features, while additive methods are effective under moderate shifts requiring refinement. However, each paradigm operates effectively only within limited shift severity ranges, failing to generalize across diverse corruption levels. This leads to the following question: can we adaptively balance both strategies based on measured feature-level domain shift? We propose CD-Buffer, a novel complementary dual-buffer framework where subtractive and additive mechanisms operate in opposite yet coordinated directions driven by a unified discrepancy metric. Our key innovation lies in the discrepancy-driven coupling: Our framework couples removal and refinement through a unified discrepancy metric, automatically balancing both strategies based on feature-level shift severity. This establishes automatic channel-wise balancing that adapts differentiated treatment to heterogeneous shift magnitudes without manual tuning. Extensive experiments on KITTI, Cityscapes, and ACDC datasets demonstrate state-of-the-art performance, consistently achieving superior results across diverse weather conditions and severity levels.[98] SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection
Jiaming Liang,Yifeng Zhan,Chunlin Liu,Weihua Zheng,Bingye Peng,Qiwei Liang,Boyang Cai,Xiaochun Mai,Qiang Nie
Main category: cs.CV
TL;DR: 本文提出了一种面向伪装目标的开放词汇目标检测新方法,构建了OVCOD-D基准,并通过子描述去噪与区域弱对齐动态聚焦策略提升检测性能。
Details
Motivation: 现有开放词汇目标检测(OVOD)方法在处理伪装目标时表现不佳,因其视觉特征与背景高度相似,难以区分和定位。 Method: 构建伪装目标检测基准OVCOD-D;设计子描述主成分对比融合策略以去除文本噪声;提出特异性引导的区域弱对齐与动态聚焦方法增强伪装目标与背景的判别能力。 Result: 在OVCOD-D基准的开放集评估下,所提方法达到56.4的AP。 Conclusion: 本文有效缓解了伪装目标在开放词汇检测中的识别难题,验证了结合细粒度文本描述与视觉特异性建模的重要性。 Abstract: Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector's ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.[99] SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis
Zhangtianyi Chen,Yuhao Shen,Florensia Widjaja,Yan Xu,Liyuan Sun,Zijian Wang,Hongyi Chen,Wufei Dai,Juexiao Zhou
Main category: cs.CV
TL;DR: 本文提出SkinGPT-X,一种融合自演化皮肤病学记忆机制的多模态协作多智能体系统,旨在提升皮肤科诊断的准确性、可解释性与罕见病识别能力。
Details
Motivation: 单体大语言模型在细粒度多类皮肤疾病诊断及罕见病识别上表现不足,且缺乏临床推理所需的可解释性与可追溯性;现有多智能体系统则受限于静态知识库,难以适应复杂真实临床场景。 Method: 构建SkinGPT-X系统,模拟皮肤科医生诊断流程,集成多模态协作多智能体与自演化皮肤病学记忆机制。 Result: 在四个公开数据集上超越SOTA模型(DDI31准确率+9.6%,Dermnet加权F1+13%);在498类大规模多类数据集上验证细粒度分类能力;在首个罕见皮肤病数据集(564样本、8类)上实现准确率+9.8%、加权F1+7.1%、Cohen's Kappa+10%。 Conclusion: SkinGPT-X显著提升了复杂与罕见皮肤病诊断的准确性、可解释性与临床适用性,为多智能体辅助诊疗提供了新范式。 Abstract: While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.[100] Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR
Jinda Lu,Junkang Wu,Jinghan Li,Kexin Huang,Shuo Yang,Mingzhu Chen,Jiancan Wu,Kuien Liu,Xiang Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的强化学习方法TGRL,通过专家推理轨迹引导多模态大语言模型在细粒度推理过程中有效整合视觉证据,从而增强视觉感知与逻辑推理之间的联系。
Details
Motivation: 现有RLVR方法虽能关注相关视觉区域,但难以将视觉证据有效融入后续推理过程,导致推理链缺乏视觉事实支撑。 Method: 提出轨迹引导的强化学习(TGRL),利用更强模型生成的专家推理轨迹来指导策略模型;引入token级重加权和轨迹过滤机制以确保策略优化的稳定性与有效性。 Result: 在多个多模态推理基准上进行了大量实验,结果表明TGRL持续提升推理性能,并有效弥合了视觉感知与逻辑推理之间的鸿沟。 Conclusion: TGRL是一种有效提升多模态大语言模型视觉-推理对齐能力的新方法,为解决弱视觉接地问题提供了新思路。 Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for multimodal large language models (MLLMs) have mainly focused on improving final answer correctness and strengthening visual grounding. However, a critical bottleneck remains: although models can attend to relevant visual regions, they often fail to effectively incorporate visual evidence into subsequent reasoning, leading to reasoning chains that are weakly grounded in visual facts. To address this issue, we propose Trajectory-Guided Reinforcement Learning (TGRL), which guides the policy model to integrate visual evidence into fine-grained reasoning processes using expert reasoning trajectories from stronger models. We further introduce token-level reweighting and trajectory filtering to ensure stable and effective policy optimization. Extensive experiments on multiple multimodal reasoning benchmarks demonstrate that TGRL consistently improves reasoning performance and effectively bridges the gap between visual perception and logical reasoning.[101] TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life
Mridul Khurana,Amin Karimi Monsefi,Justin Lee,Medha Sawhney,David Carlyn,Julia Chae,Jianyang Gu,Rajiv Ramnath,Sara Beery,Wei-Lun Chao,Anuj Karpatne,Cheng Zhang
Main category: cs.CV
TL;DR: 本文提出TaxaAdapter,一种轻量级方法,利用视觉分类模型(如BioCLIP)引导文本到图像扩散模型生成具有高物种级保真度的图像,显著提升细粒度形态和物种身份准确性,并支持少样本甚至零样本物种生成。
Details
Motivation: 现有文本到图像模型难以准确捕捉定义物种身份的细微视觉特征,尤其在生物多样性极高的‘生命之树’背景下。 Method: 将Vision Taxonomy Models(如BioCLIP)的嵌入注入冻结的文本到图像扩散模型中,在不破坏文本对姿态、风格、背景等控制能力的前提下提升物种级生成 fidelity。 Result: TaxaAdapter 在形态保真度与物种身份准确性上持续超越强基线;提出基于多模态大语言模型的可解释评估指标;展现出优异的少样本及零样本物种泛化能力。 Conclusion: 视觉分类模型(VTMs)是实现可扩展、细粒度物种图像生成的关键组件。 Abstract: Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.[102] InstaVSR: Taming Diffusion for Efficient and Temporally Consistent Video Super-Resolution
Jintong Hu,Bin Chen,Zhenyu Hu,Jiayue Liu,Guo Wang,Lu Qi
Main category: cs.CV
TL;DR: InstaVSR 是一种轻量级扩散框架,通过剪枝单步扩散主干、光流引导的循环训练和双空间对抗学习,在保持高质量感知效果的同时显著提升视频超分辨率的效率与时间稳定性。
Details
Motivation: 扩散模型在视频超分辨率中面临时间不稳定性和计算开销大的双重挑战。 Method: 提出 InstaVSR:(1) 剪枝的单步扩散主干;(2) 光流引导的循环训练与时间正则化;(3) 潜在空间与像素空间的双空间对抗学习。 Result: 在 RTX 4090 上,处理 30 帧 2K×2K 视频耗时不到 1 分钟、仅需 7 GB 显存,时间过渡更平滑,感知质量保持良好。 Conclusion: InstaVSR 在效率、内存占用与时间稳定性之间取得更好平衡,为实用化扩散式视频超分辨率提供了新路径。 Abstract: Video super-resolution (VSR) seeks to reconstruct high-resolution frames from low-resolution inputs. While diffusion-based methods have substantially improved perceptual quality, extending them to video remains challenging for two reasons: strong generative priors can introduce temporal instability, and multi-frame diffusion pipelines are often too expensive for practical deployment. To address both challenges simultaneously, we propose InstaVSR, a lightweight diffusion framework for efficient video super-resolution. InstaVSR combines three ingredients: (1) a pruned one-step diffusion backbone that removes several costly components from conventional diffusion-based VSR pipelines, (2) recurrent training with flow-guided temporal regularization to improve frame-to-frame stability, and (3) dual-space adversarial learning in latent and pixel spaces to preserve perceptual quality after backbone simplification. On an NVIDIA RTX 4090, InstaVSR processes a 30-frame video at 2K$\times$2K resolution in under one minute with only 7 GB of memory usage, substantially reducing the computational cost compared to existing diffusion-based methods while maintaining favorable perceptual quality with significantly smoother temporal transitions.[103] Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT
Shuhei Tsuyuki,Reda Bensaid,Jérémy Morlier,Mathieu Léonardon,Naoya Onizawa,Vincent Gripon,Takahiro Hanyu
Main category: cs.CV
TL;DR: 本文提出了一种面向边缘计算的MobileViT预训练方法,通过知识蒸馏提升轻量级模型在少样本学习任务中的性能,在MiniImageNet上显著优于ResNet12基线,同时大幅降低参数量、计算量和能耗。
Details
Motivation: 边缘设备受限于算力、功耗与数据获取成本,亟需高效、低数据依赖的深度学习模型;少样本学习可缓解标注数据稀缺问题,但现有方法在边缘部署时仍面临效率瓶颈。 Method: 采用知识蒸馏技术,将大规模教师模型的泛化能力迁移到轻量级MobileViT学生模型,并在MiniImageNet上进行一/五样本分类评估,同时在Jetson Orin Nano平台上实测功耗与延迟。 Result: 相比ResNet12基线,一/五样本分类准确率分别提升14%和6.7%,参数量减少69%,FLOPs降低88%,动态能耗降低37%,推理延迟为2.6ms。 Conclusion: 该方法在保持高精度的同时显著提升效率与能效,是面向边缘AI硬件部署少样本学习模型的一种实用且有前景的解决方案。 Abstract: Efficient and adaptable deep learning models are an important area of deep learning research, driven by the need for highly efficient models on edge devices. Few-shot learning enables the use of deep learning models in low-data regimes, a capability that is highly sought after in real-world applications where collecting large annotated datasets is costly or impractical. This challenge is particularly relevant in edge scenarios, where connectivity may be limited, low-latency responses are required, or energy consumption constraints are critical. We propose and evaluate a pre-training method for the MobileViT backbone designed for edge computing. Specifically, we employ knowledge distillation, which transfers the generalization ability of a large-scale teacher model to a lightweight student model. This method achieves accuracy improvements of 14% and 6.7% for one-shot and five-shot classification, respectively, on the MiniImageNet benchmark, compared to the ResNet12 baseline, while reducing by 69% the number of parameters and by 88% the computational complexity of the model, in FLOPs. Furthermore, we deployed the proposed models on a Jetson Orin Nano platform and measured power consumption directly at the power supply, showing that the dynamic energy consumption is reduced by 37% with a latency of 2.6 ms. These results demonstrate that the proposed method is a promising and practical solution for deploying few-shot learning models on edge AI hardware.[104] IP-Bench: Benchmark for Image Protection Methods in Image-to-Video Generation Scenarios
Xiaofeng Li,Leyi Sheng,Zhen Sun,Zongmin Zhang,Jiaheng Wei,Xinlei He
Main category: cs.CV
TL;DR: 本文提出了IP-Bench,首个面向图像到视频(I2V)生成场景的系统性图像保护方法评估基准,涵盖6种代表性保护方法和5种先进I2V模型,并评估其在实际攻击与跨模型/跨模态场景下的鲁棒性与可迁移性。
Details
Motivation: 现有图像保护方法缺乏统一基准,未在I2V生成场景及预处理攻击下被系统评估,难以衡量其真实部署效果。 Method: 构建IP-Bench基准,包含多方法、多模型测试;设计两类鲁棒性攻击策略;分析保护方法的跨模型与跨模态迁移能力。 Result: IP-Bench成为首个系统、可复现、可扩展的I2V图像保护评估框架,揭示了现有方法在真实攻击下的性能局限。 Conclusion: IP-Bench填补了I2V图像保护评估的空白,为后续研究提供了标准化测试平台与改进方向。 Abstract: With the rapid advancement of image-to-video (I2V) generation models, their potential for misuse in creating malicious content has become a significant concern. For instance, a single image can be exploited to generate a fake video, which can be used to attract attention and gain benefits. This phenomenon is referred to as an I2V generation misuse. Existing image protection methods suffer from the absence of a unified benchmark, leading to an incomplete evaluation framework. Furthermore, these methods have not been systematically assessed in I2V generation scenarios and against preprocessing attacks, which complicates the evaluation of their effectiveness in real-world deployment scenarios.To address this challenge, we propose IP-Bench (Image Protection Bench), the first systematic benchmark designed to evaluate protection methods in I2V generation scenarios. This benchmark examines 6 representative protection methods and 5 state-of-the-art I2V models. Furthermore, our work systematically evaluates protection methods' robustness with two robustness attack strategies under practical scenarios and analyzes their cross-model & cross-modality transferability. Overall, IP-Bench establishes a systematic, reproducible, and extensible evaluation framework for image protection methods in I2V generation scenarios.[105] Gaussian Shannon: High-Precision Diffusion Model Watermarking Based on Communication
Yi Zhang,Hongbo Huang,Liang-Jie Zhang
Main category: cs.CV
TL;DR: 本文提出Gaussian Shannon水印框架,将扩散过程视为噪声信道,在初始高斯噪声中嵌入水印,无需微调且不损失图像质量,支持鲁棒追踪与精确比特恢复,适用于离线验证和需无损元数据的场景。
Details
Motivation: 现有基于阈值检测的水印方法仅支持模糊匹配,无法精确恢复结构化水印数据,难以满足离线验证和版权许可等对无损元数据的需求。 Method: 将扩散过程建模为噪声通信信道;在初始高斯噪声中嵌入水印;识别局部比特翻转与全局随机失真两类信道干扰;采用级联纠错码与多数投票机制实现语义载荷端到端可靠传输。 Result: 在三种Stable Diffusion变体和七种扰动下,Gaussian Shannon达到SOTA比特级准确率与高真正例率,支持可信的版权归属。 Conclusion: Gaussian Shannon实现了无需微调、无质量损失、兼具鲁棒性与精确比特恢复能力的扩散模型水印方案,适用于真实场景下的权利归属与内容溯源。 Abstract: Diffusion models generate high-quality images but pose serious risks like copyright violation and disinformation. Watermarking is a key defense for tracing and authenticating AI-generated content. However, existing methods rely on threshold-based detection, which only supports fuzzy matching and cannot recover structured watermark data bit-exactly, making them unsuitable for offline verification or applications requiring lossless metadata (e.g., licensing instructions). To address this problem, in this paper, we propose Gaussian Shannon, a watermarking framework that treats the diffusion process as a noisy communication channel and enables both robust tracing and exact bit recovery. Our method embeds watermarks in the initial Gaussian noise without fine-tuning or quality loss. We identify two types of channel interference, namely local bit flips and global stochastic distortions, and design a cascaded defense combining error-correcting codes and majority voting. This ensures reliable end-to-end transmission of semantic payloads. Experiments across three Stable Diffusion variants and seven perturbation types show that Gaussian Shannon achieves state-of-the-art bit-level accuracy while maintaining a high true positive rate, enabling trustworthy rights attribution in real-world deployment. The source code have been made available at: https://github.com/Rambo-Yi/Gaussian-Shannon[106] Provably Contractive and High-Quality Denoisers for Convergent Restoration
Shubhi Shukla,Pravin Nair
Main category: cs.CV
TL;DR: 本文提出了一种具有全局Lipschitz常数小于1的可证明收缩性(contractive)去噪网络,通过结合近端层与Lipschitz可控卷积改进,在保持SOTA性能的同时显著提升输入微小平移下的鲁棒性与稳定性,并验证其在Plug-and-Play算法中可保证收敛。
Details
Motivation: 现有先进图像去噪网络(如DnCNN、Restormer)虽性能优异,但缺乏对输入微小扰动的稳定性保证,存在鲁棒性与精度的权衡问题。 Method: 构建基于展开技术(unfolding)的近端层,并引入Lipschitz可控的卷积精调模块,构造全局Lipschitz常数严格小于1的收缩性去噪器。 Result: 所提模型在图像去噪任务上媲美SOTA非约束模型,是目前紧Lipschitz界(1-Lipschitz)下性能最优者;同时作为正则器可理论保证Plug-and-Play算法收敛;实验证明强Lipschitz约束不必然损害重建质量。 Conclusion: 严格Lipschitz控制与高质量图像恢复可兼得,挑战了领域内‘稳定即降质’的固有假设,推动可验证、稳定的视觉模型发展。 Abstract: Image restoration, the recovery of clean images from degraded measurements, has applications in various domains like surveillance, defense, and medical imaging. Despite achieving state-of-the-art (SOTA) restoration performance, existing convolutional and attention-based networks lack stability guarantees under minor shifts in input, exposing a robustness accuracy trade-off. We develop provably contractive (global Lipschitz $< 1$) denoiser networks that considerably reduce this gap. Our design composes proximal layers obtained from unfolding techniques, with Lipschitz-controlled convolutional refinements. By contractivity, our denoiser guarantees that input perturbations of strength $\|δ\|\le\varepsilon$ induce at most $\varepsilon$ change at the output, while strong baselines such as DnCNN and Restormer can exhibit larger deviations under the same perturbations. On image denoising, the proposed model is competitive with unconstrained SOTA denoisers, reporting the tightest gap for a provably 1-Lipschitz model and establishing that such gaps are indeed achievable by contractive denoisers. Moreover, the proposed denoisers act as strong regularizers for image restoration that provably effect convergence in Plug-and-Play algorithms. Our results show that enforcing strict Lipschitz control does not inherently degrade output quality, challenging a common assumption in the literature and moving the field toward verifiable and stable vision models. Codes and pretrained models are available at https://github.com/SHUBHI1553/Contractive-Denoisers[107] CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions
Chonghuinan Wang,Zihan Chen,Yuxiang Wei,Tianyi Jiang,Xiaohe Wu,Fan Li,Wangmeng Zuo,Hongxun Yao
Main category: cs.CV
TL;DR: 本文提出CREval,一种基于问答的自动化评估流程,用于评估复杂创意图像编辑任务;并构建CREval-Bench基准,涵盖三类九维共800+样本和13K查询,系统评测了多类SOTA模型,发现闭源模型略优但整体仍表现不足,且自动指标与人工评价高度一致。
Details
Motivation: 现有评估方法缺乏系统性与人类对齐框架,难以有效评估复杂创意图像编辑任务。 Method: 提出CREval——基于问答的全自动评估流程,以提升可解释性;构建CREval-Bench基准,覆盖三类九维、800+编辑样本与13K评估查询;系统评测多种开源与闭源多模态大模型。 Result: 闭源模型在复杂创意任务上总体优于开源模型,但所有模型完成效果仍不理想;CREval自动指标与人工判断具强一致性。 Conclusion: CREval为复杂创意图像编辑提供了可靠、可解释、人类对齐的自动化评估基础,并揭示了当前模型的关键短板与未来研究方向。 Abstract: Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.[108] Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning
Bozhao Li,Shaocong Wu,Tong Shao,Senqiao Yang,Qiben Shan,Zhuotao Tian,Jingyong Su
Main category: cs.CV
TL;DR: 本文提出上下文一致性学习(CCL)框架,通过上下文引导的数据生成(CBDG)和上下文一致性损失(CCLoss)提升开放词汇目标检测在不同场景下的鲁棒性与泛化能力,并在OmniLabel和D3数据集上取得SOTA性能。
Details
Motivation: 现有开放词汇目标检测方法忽视单模态(尤其是视觉模态)内部的一致性,导致模型在背景或环境变化时对同一物体的检测性能下降,存在鲁棒性差距。 Method: 提出Contextual Consistency Learning(CCL)框架,包含两部分:1)Contextual Bootstrapped Data Generation(CBDG),用于生成同一物体置于多样背景下的图像;2)Contextual Consistency Loss(CCLoss),约束对象特征在环境变化下保持不变。 Result: 在OmniLabel和D3数据集上分别提升+16.3 AP和+14.9 AP,达到当前最优性能。 Conclusion: 在单模态内强制上下文一致性可显著提升开放词汇目标检测模型在多样化环境中的泛化能力和鲁棒性。 Abstract: Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model's robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: https://github.com/bozhao-li/CCL.[109] GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport
Youngju Na,Jaeseong Yun,Soohyun Ryu,Hyunsu Kim,Sung-Eui Yoon,Suyong Yeon
Main category: cs.CV
TL;DR: 本文提出GLINT框架,通过显式分解高斯表示建模场景级透明度,分离反射与透射辐射,解决3D高斯点绘法无法处理玻璃等透明物体的问题。
Details
Motivation: 3D高斯点绘法无法建模透明物体(如玻璃),因其难以解耦透明界面的辐射贡献与透过玻璃观察到的透射几何。 Method: GLINT采用显式分解的高斯表示,分别重建主界面并独立建模反射与透射辐射;利用几何分离线索、预训练视频重光照模型提供的几何与材质先验,引导透明区域定位与优化。 Result: 在复杂透明场景重建任务中,GLINT相较先前方法展现出一致且显著的性能提升。 Conclusion: GLINT成功将透明度建模引入3D高斯点绘范式,通过辐射解耦与先验引导实现高质量、一致的透明场景重建。 Abstract: While 3D Gaussian splatting has emerged as a powerful paradigm, it fundamentally fails to model transparency such as glass panels. The core challenge lies in decoupling the intertwined radiance contributions from transparent interfaces and the transmitted geometry observed through the glass. We present GLINT, a framework that models scene-scale transparency through explicit decomposed Gaussian representation. GLINT reconstructs the primary interface and models reflected and transmitted radiance separately, enabling consistent radiance transport. During optimization, GLINT bootstraps transparency localization from geometry-separation cues induced by the decomposition, together with geometry and material priors from a pre-trained video relighting model. Extensive experiments demonstrate consistent improvements over prior methods for reconstructing complex transparent scenes.[110] DUGAE: Unified Geometry and Attribute Enhancement via Spatiotemporal Correlations for G-PCC Compressed Dynamic Point Clouds
Pan Zhao,Hui Yuan,Chang Sun,Chongzhen Tian,Raouf Hamzaoui,Sam Kwong
Main category: cs.CV
TL;DR: 本文提出了一种面向G-PCC压缩动态点云的统一几何与属性增强框架DUGAE,通过显式建模帧间时空相关性,在几何和属性两方面进行联合增强,显著提升重建质量并降低码率。
Details
Motivation: 现有后解码质量增强方法仅针对静态点云、逐帧独立处理,无法有效利用动态点云序列中的时空相关性。 Method: 提出DUGAE框架:1)基于稀疏卷积与几何运动补偿(GMC)的动态几何增强网络(DGE-Net);2)细节感知KNN重着色模块(DA-KNN)实现属性到增强几何的精准映射;3)含时序特征提取与属性运动补偿(AMC)的动态属性增强网络(DAE-Net)。 Result: 在8iVFB v2、Owlii和MVUB共7个动态点云上,相比GeS-TM v10,几何(D1)平均BD-PSNR提升11.03 dB、BD码率下降93.95%;亮度分量BD-PSNR提升4.23 dB、BD码率下降66.61%;PCQM感知质量提升,且优于V-PCC。 Conclusion: DUGAE首次实现了对G-PCC压缩动态点云的几何与属性协同增强,通过显式时空建模大幅提升了重建质量与编码效率,为动态点云后处理提供了新范式。 Abstract: Existing post-decoding quality enhancement methods for point clouds are designed for static data and typically process each frame independently. As a result, they cannot effectively exploit the spatiotemporal correlations present in point cloud sequences.We propose a unified geometry and attribute enhancement framework (DUGAE) for G-PCC compressed dynamic point clouds that explicitly exploits inter-frame spatiotemporal correlations in both geometry and attributes. First, a dynamic geometry enhancement network (DGE-Net) based on sparse convolution (SPConv) and feature-domain geometry motion compensation (GMC) aligns and aggregates spatiotemporal information. Then, a detail-aware k-nearest neighbors (DA-KNN) recoloring module maps the original attributes onto the enhanced geometry at the encoder side, improving mapping completeness and preserving attribute details. Finally, a dynamic attribute enhancement network (DAE-Net) with dedicated temporal feature extraction and feature-domain attribute motion compensation (AMC) refines attributes by modeling complex spatiotemporal correlations. On seven dynamic point clouds from the 8iVFB v2, Owlii, and MVUB datasets, DUGAE significantly enhanced the performance of the latest G-PCC geometry-based solid content test model (GeS-TM v10). For geometry (D1), it achieved an average BD-PSNR gain of 11.03 dB and a 93.95% BD-bitrate reduction. For the luma component, it achieved a 4.23 dB BD-PSNR gain with a 66.61% BD-bitrate reduction. DUGAE also improved perceptual quality (as measured by PCQM) and outperformed V-PCC. Our source code will be released on GitHub at: https://github.com/yuanhui0325/DUGAE[111] Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI
Jing Zhang,Bastien Bergere,Emilie Bollache,Jonas Leite,Mikaël Laredo,Alban Redheuil,Nadjia Kachenoura
Main category: cs.CV
TL;DR: 本文提出了一种受临床工作流程启发的三阶段渐进式学习框架(基于SwinUNETR),结合解剖感知的空间加权损失,提升心脏MRI LGE图像中左心房(LA)瘢痕的自动分割精度与可靠性。
Details
Motivation: 左心房瘢痕的自动分割面临低对比度、标注差异大及缺乏解剖约束等挑战,导致预测不可靠;而瘢痕空间分布与房颤严重程度和复发密切相关,亟需更鲁棒的方法。 Method: 构建基于SwinUNETR的三阶段渐进式学习框架:1)LA腔预学习;2)LA几何结构与瘢痕模式联合学习的双任务模型;3)瘢痕精细分割微调;并引入融合临床解剖先验的解剖感知空间加权损失,约束瘢痕预测于解剖合理的LA壁区域并缓解标注偏差。 Result: 在LASCARQS公开数据集5折交叉验证中,LA分割Dice达0.94,LA瘢痕分割Dice为0.50(优于单阶段方法的0.49),Hausdorff距离11.84 mm(优于13.02 mm),平均表面距离1.80 mm(优于1.96 mm)。 Conclusion: 将临床解剖先验与诊断逻辑显式嵌入深度学习模型,可显著提升LGE图像中LA瘢痕分割的准确性与可靠性,证实了临床导向建模的重要性。 Abstract: Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence. However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and the lack of anatomical constraints, often leading to non-reliable predictions. Accordingly, our aim was to propose a progressive learning strategy to segment LA scar from LGE images inspired from a clinical workflow. A 3-stage framework based on SwinUNETR was implemented, comprising: 1) a first LA cavity pre-learning model, 2) dual-task model which further learns spatial relationship between LA geometry and scar patterns, and 3) fine-tuning on precise segmentation of the scar. Furthermore, we introduced an anatomy-aware spatially weighted loss that incorporates prior clinical knowledge by constraining scar predictions to anatomically plausible LA wall regions while mitigating annotation bias. Our preliminary results obtained on validation LGE volumes from LASCARQS public dataset after 5-fold cross validation, LA segmentation had Dice score of 0.94, LA scar segmentation achieved Dice score of 0.50, Hausdorff Distance of 11.84 mm, Average Surface Distance of 1.80 mm, outperforming only a one-stage scar segmentation with 0.49, 13.02 mm, 1.96 mm, repectively. By explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning, the proposed approach improved the accuracy and reliability of LA scar segmentation from LGE, revealing the importance of clinically informed model design.[112] OSA: Echocardiography Video Segmentation via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement
Rui Wang,Huisi Wu,Jing Qin
Main category: cs.CV
TL;DR: 本文提出OSA框架,通过正交化状态更新(OSU)机制约束状态演化在Stiefel流形上,防止循环模型中的秩坍塌问题,并结合解剖先验感知的特征增强模块抑制超声斑点噪声,从而实现高精度、时间一致的心室分割。
Details
Motivation: 超声视频中左心室分割需兼顾准确性与时间一致性,但现有线性循环模型因无约束状态更新导致秩坍塌,且斑点噪声和非刚性形变加剧建模难度。 Method: 提出OSA框架:1)Orthogonalized State Update(OSU)机制,在Stiefel流形上进行欧氏投影梯度下降以约束状态演化;2)Anatomical Prior-aware Feature Enhancement模块,基于物理驱动分离解剖结构与斑点噪声。 Result: 在CAMUS和EchoNet-Dynamic数据集上达到SOTA分割精度与时间稳定性,同时保持实时推理效率。 Conclusion: OSA有效缓解了循环模型的秩坍塌问题,提升了超声视频心室分割的鲁棒性与时序一致性,具备临床部署潜力。 Abstract: Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non-rigid deformations. Existing linear recurrent models offer efficient in-context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior-aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics-driven process, providing the temporal tracker with noise-resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet-Dynamic datasets show that OSA achieves state-of-the-art segmentation accuracy and temporal stability, while maintaining real-time inference efficiency for clinical deployment. Codes are available at https://github.com/wangrui2025/OSA.[113] Dual-Stage Invariant Continual Learning under Extreme Visual Sparsity
Rangya Zhang,Jiaping Xiao,Lu Bai,Yuhang Zhang,Mir Feroskhan
Main category: cs.CV
TL;DR: 本文提出了一种双阶段不变性持续学习框架,通过联合蒸馏(骨干表征+检测预测)和稀疏感知数据调节策略,解决空间目标检测中因极端稀疏性导致的表征漂移问题,在RSO检测数据集上mAP提升4.0。
Details
Motivation: 现有持续学习方法在目标检测中多假设视觉条件相对平衡,但在空间目标(RSO)检测等极端稀疏场景下,背景主导导致梯度不稳定、表征漂移,暴露了仅依赖输出层蒸馏方法的结构性缺陷。 Method: 提出双阶段不变性持续学习框架:1)联合蒸馏——同时约束骨干网络中间表征(结构一致性)和检测预测(语义一致性);2)稀疏感知数据调节——结合基于patch的采样与分布感知增强以稳定梯度统计。 Result: 在高分辨率空间RSO检测数据集上,相比主流持续目标检测方法,在序列域偏移下取得+4.0 mAP的绝对提升。 Conclusion: 中间表征稳定性对持续目标检测至关重要;联合蒸馏与稀疏感知数据调节可有效抑制误差传播并保持适应性,为极端稀疏场景下的持续学习提供了新范式。 Abstract: Continual learning seeks to maintain stable adaptation under non-stationary environments, yet this problem becomes particularly challenging in object detection, where most existing methods implicitly assume relatively balanced visual conditions. In extreme-sparsity regimes, such as those observed in space-based resident space object (RSO) detection scenarios, foreground signals are overwhelmingly dominated by background observations. Under such conditions, we analytically demonstrate that background-driven gradients destabilize the feature backbone during sequential domain shifts, causing progressive representation drift. This exposes a structural limitation of continual learning approaches relying solely on output-level distillation, as they fail to preserve intermediate representation stability. To address this, we propose a dual-stage invariant continual learning framework via joint distillation, enforcing structural and semantic consistency on both backbone representations and detection predictions, respectively, thereby suppressing error propagation at its source while maintaining adaptability. Furthermore, to regulate gradient statistics under severe imbalance, we introduce a sparsity-aware data conditioning strategy combining patch-based sampling and distribution-aware augmentation. Experiments on a high-resolution space-based RSO detection dataset show consistent improvement over established continual object detection methods, achieving an absolute gain of +4.0 mAP under sequential domain shifts.[114] HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning
Xuerui Zhang,Xuehao Wang,Zhan Zhuang,Linglan Zhao,Ziyue Li,Xinmin Zhang,Zhihuan Song,Yu Zhang
Main category: cs.CV
TL;DR: 本文提出了终身异构学习(LHL)的新范式,聚焦于密集预测任务(LHL4DP),并设计了无需样本的异构感知蒸馏(HAD)方法,通过分布均衡与显著性引导的双蒸馏损失,在不使用过往样本的情况下有效保留异构任务知识。
Details
Motivation: 现有终身学习主要关注同构任务流(如仅分类),忽视了输出结构各异的异构任务序列学习问题,缺乏对跨任务类型知识保持的有效机制。 Method: 提出异构感知蒸馏(HAD)方法:1)分布均衡的异构感知蒸馏损失,缓解全局预测分布不平衡;2)基于Sobel算子提取边缘像素的显著性引导蒸馏损失,聚焦关键区域;全程无需存储历史任务样本(exemplar-free)。 Result: 在LHL4DP设定下,HAD在多个密集预测任务序列上显著超越现有终身学习方法,验证了其在异构任务持续学习中的有效性与鲁棒性。 Conclusion: 终身异构学习(LHL)是更贴近实际应用的扩展范式;HAD作为一种高效、免样本的蒸馏策略,为跨结构输出任务的持续学习提供了可行且高性能的解决方案。 Abstract: Lifelong learning aims to preserve knowledge acquired from previous tasks while incorporating knowledge from a sequence of new tasks. However, most prior work explores only streams of homogeneous tasks (\textit{e.g.}, only classification tasks) and neglects the scenario of learning across heterogeneous tasks that possess different structures of outputs. In this work, we formalize this broader setting as lifelong heterogeneous learning (LHL). Departing from conventional lifelong learning, the task sequence of LHL spans different task types, and the learner needs to retain heterogeneous knowledge for different output space structures. To instantiate the LHL, we focus on LHL in the context of dense prediction (LHL4DP), a realistic and challenging scenario. To this end, we propose the Heterogeneity-Aware Distillation (HAD) method, an exemplar-free approach that preserves previously gained heterogeneous knowledge by self-distillation in each training phase. The proposed HAD comprises two complementary components, including a distribution-balanced heterogeneity-aware distillation loss to alleviate the global imbalance of prediction distribution and a salience-guided heterogeneity-aware distillation loss that concentrates learning on informative edge pixels extracted with the Sobel operator. Extensive experiments demonstrate that the proposed HAD method significantly outperforms existing methods in this new scenario.[115] MemCam: Memory-Augmented Camera Control for Consistent Video Generation
Xinhang Gao,Junlin Guan,Shuhan Luo,Wenzhuo Li,Guanghuan Tan,Jiacheng Wang
Main category: cs.CV
TL;DR: MemCam是一种记忆增强的交互式视频生成方法,通过将已生成帧作为外部记忆并结合上下文压缩与共视性选择机制,在动态相机控制下显著提升长视频生成的场景一致性。
Details
Motivation: 现有交互式视频生成方法在动态相机控制下生成长视频时难以维持场景一致性,主要受限于上下文信息不足。 Method: 提出MemCam方法,将历史帧作为外部记忆用于条件建模;设计上下文压缩模块将记忆帧编码为紧凑表示,并基于共视性动态检索最相关历史帧。 Result: 在交互式视频生成任务中,MemCam在场景一致性方面显著优于基线及开源SOTA方法,尤其在大角度相机旋转的长视频场景中表现突出。 Conclusion: MemCam通过记忆增强与智能上下文检索,有效解决了动态相机下长视频生成的场景一致性难题,为高质量交互式视频生成提供了新思路。 Abstract: Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.[116] 4DRaL: Bridging 4D Radar with LiDAR for Place Recognition using Knowledge Distillation
Ningyuan Huang,Zhiheng Li,Zheng Fang
Main category: cs.CV
TL;DR: 本文提出4DRaL框架,利用知识蒸馏提升4D毫米波雷达在恶劣天气下的地点识别性能,通过LiDAR模型指导雷达模型训练,并支持雷达到雷达(R2R)和雷达到LiDAR(R2L)两种任务。
Details
Motivation: 主流视觉/LiDAR地点识别方法易受恶劣天气影响,而新兴的4D毫米波雷达虽具全天候潜力,但其数据稀疏与噪声严重制约性能,亟需有效增强方法。 Method: 提出基于知识蒸馏(KD)的4DRaL框架:以高性能LiDAR-to-LiDAR模型为教师,指导4D雷达-to-雷达(R2R)学生模型训练;包含局部图像增强、特征分布蒸馏和响应蒸馏三个核心模块;并支持灵活配置实现R2L任务。 Result: 实验表明,4DRaL在正常及恶劣天气下,R2R与R2L任务均达到当前最优性能。 Conclusion: 4DRaL有效克服了4D雷达数据稀疏与噪声问题,验证了知识蒸馏在跨模态、全天候地点识别中的可行性与优越性,为鲁棒机器人定位提供了新范式。 Abstract: Place recognition is crucial for loop closure detection and global localization in robotics. Although mainstream algorithms typically rely on cameras and LiDAR, these sensors are susceptible to adverse weather conditions. Fortunately, the recently developed 4D millimeter-wave radar (4D radar) offers a promising solution for all-weather place recognition. However, the inherent noise and sparsity in 4D radar data significantly limit its performance. Thus, in this paper, we propose a novel framework called 4DRaL that leverages knowledge distillation (KD) to enhance the place recognition performance of 4D radar. Its core is to adopt a high-performance LiDAR-to-LiDAR (L2L) place recognition model as a teacher to guide the training of a 4D radar-to-4D radar (R2R) place recognition model. 4DRaL comprises three key KD modules: a local image enhancement module to handle the sparsity of raw 4D radar points, a feature distribution distillation module that ensures the student model generates more discriminative features, and a response distillation module to maintain consistency in feature space between the teacher and student models. More importantly, 4DRaL can also be trained for 4D radar-to-LiDAR (R2L) place recognition through different module configurations. Experimental results prove that 4DRaL achieves state-of-the-art performance in both R2R and R2L tasks regardless of normal or adverse weather.[117] Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
Shrinidhi Kumbhar,Haofu Liao,Srikar Appalaraju,Kunwar Yashraj Singh
Main category: cs.CV
TL;DR: 本文探索了离散扩散视觉语言模型(DVLMs)在GUI界面定位任务中的应用潜力,提出了一种混合掩码策略以提升边界框预测精度,并通过多数据集实验验证其性能可媲美自回归模型。
Details
Motivation: 尽管离散扩散视觉语言模型(DVLMs)在多模态推理中展现出优势,但其在GUI界面定位任务中的适用性尚未被研究;本文旨在评估DVLMs能否替代主流的自回归(AR)模型用于GUI grounding。 Method: 将LLaDA-V模型适配于单轮动作与边界框预测任务,将GUI grounding建模为多模态输入下的文本生成问题;提出融合线性与确定性掩码的混合掩码调度策略以更好建模边界框几何结构;并系统分析扩散步数、生成长度、块长度及训练数据扩展对性能与延迟的影响。 Result: 混合掩码策略使Step Success Rate(SSR)最高提升6.1点;在Web、桌面和移动端共四个GUI数据集上,该方法持续优于线性掩码基线,并与自回归模型性能相当;扩展GUI领域训练数据可平均提升准确率20点、降低延迟约1.3秒;准确率随扩散步数增加而提升但存在饱和点。 Conclusion: 离散DVLMs是GUI grounding任务中一种有前景的建模范式,为构建基于扩散机制的GUI智能体提供了重要基础。 Abstract: Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.[118] Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment
Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
Main category: cs.CV
TL;DR: 本文提出并评估了基于DEFOM-Stereo的多种变体模型,用于无人机自主修剪树木任务中的实时、高精度立体深度估计;在合成数据上训练后部署于Jetson Orin平台,其中DEFOM-PrunePlus在精度与速度间取得最佳权衡,适合实际部署。
Details
Motivation: 无人机自主修剪需实时、精确地估计切割工具到细树枝的米级距离,以避免碰撞并实现闭环控制,但现有方法难以兼顾精度与嵌入式实时性。 Method: 在Unreal Engine 5中构建含115棵树、多视角的合成立体数据集(5520对图像+精确EXR深度图),训练五种DEFOM-Stereo变体,并在NVIDIA Jetson Orin Super上实测推理速度与深度估计精度(EPE、D1-all、delta-1、MAE);同时验证零样本迁移到真实照片的效果。 Result: DEFOM-Stereo ViT-S精度最高(depth MAE 23.40 cm),但仅2.2 FPS;DEFOM-PrunePlus(~3.3 FPS,depth MAE 64.26 cm)在2米作业距离下兼顾安全与实时性;更轻量模型(PruneStereo/PruneNano)速度更快但精度不足(MAE > 57 cm);零样本测试表明大模型能较好保持枝条几何结构。 Conclusion: DEFOM-PrunePlus是当前嵌入式平台上最实用的方案,而ViT-S可作为未来硬件升级后的精度基准。 Abstract: Autonomous tree pruning with unmanned aerial vehicles (UAVs) is a safety-critical real-world task: the onboard perception system must estimate the metric distance from a cutting tool to thin tree branches in real time so that the UAV can approach, align, and actuate the pruner without collision. We address this problem by training five variants of DEFOM-Stereo - a recent foundation-model-based stereo matcher - on a task-specific synthetic dataset and deploying the checkpoints on an NVIDIA Jetson Orin Super 16 GB. The training corpus is built in Unreal Engine 5 with a simulated ZED Mini stereo camera capturing 5,520 stereo pairs across 115 tree instances from three viewpoints at 2m distance; dense EXR depth maps provide exact, spatially complete supervision for thin branches. On the synthetic test set, DEFOM-Stereo ViT-S achieves the best depth-domain accuracy (EPE 1.74 px, D1-all 5.81%, delta-1 95.90%, depth MAE 23.40 cm) but its Jetson inference speed of ~2.2 FPS (~450 ms per frame) remains too slow for responsive closed-loop tool control. A newly introduced balanced variant, DEFOM-PrunePlus (~21M backbone, ~3.3 FPS on Jetson), offers the best deployable accuracy-speed trade-off (EPE 5.87 px, depth MAE 64.26 cm, delta-1 87.59%): its frame rate is sufficient for real-time guidance and its depth accuracy supports safe branch approach planning at the 2m operating range. The lightweight DEFOM-PruneStereo (~6.9 FPS) and DEFOM-PruneNano (~8.5 FPS) run fast but sacrifice substantial accuracy (depth MAE > 57 cm), making estimates too unreliable for safe actuation. Zero-shot inference on real photographs confirms that full-capacity models preserve branch geometry, validating the sim-to-real transfer. We conclude that DEFOM-PrunePlus provides the most practical accuracy-latency balance for onboard distance estimation, while ViT-S serves as the reference for future hardware.[119] ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction
David Hagerman,Roman Naeem,Erik Brorsson,Fredrik Kahl,Lennart Svensson
Main category: cs.CV
TL;DR: ARTA是一种混合分辨率的粗到细视觉Transformer,通过轻量级分配器动态分配高分辨率token到语义边界区域,实现高效密集特征提取。
Details
Motivation: 传统模型从高分辨率密集token开始,计算开销大;ARTA旨在提升密集预测任务(如语义分割)的效率与精度平衡,尤其在边界敏感场景下减少冗余计算。 Method: 提出粗到细的混合分辨率架构:先用低分辨率token建模全局结构,再由轻量级allocator基于语义边界分数迭代分配高分辨率token至关键区域;引入混合分辨率注意力机制,使粗/细token交互,聚焦语义复杂区域。 Result: 在ADE20K和COCO-Stuff上达到SOTA性能且FLOPs显著降低;在Cityscapes上以更低计算代价获得有竞争力的结果;ARTA-Base在~100M参数量级下于ADE20K达54.6 mIoU,同时FLOPs和内存占用更少。 Conclusion: ARTA通过语义感知的动态token分配与混合分辨率建模,有效提升了密集特征提取的计算效率与边界建模能力,为高效视觉Transformer提供了新范式。 Abstract: We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.[120] GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation
Xujing Tao,Chuxin Wang,Yubo Ai,Zhixin Cheng,Zhuoyuan Li,Liangsheng Liu,Yujia Chen,Xinjun Li,Qiao Li,Wenfei Yang,Tianzhu Zhang
Main category: cs.CV
TL;DR: 本文提出GeoGuide框架,利用预训练3D模型实现开放词汇3D语义分割,通过不确定性引导的超点蒸馏、实例级掩码重建和跨实例关系一致性模块,提升几何-语义一致性,显著优于现有方法。
Details
Motivation: 现有方法依赖2D开放词汇模型知识蒸馏,但将3D特征对齐到2D表征空间限制了内在3D几何学习,并继承2D预测误差。 Method: 提出GeoGuide框架,包含三个核心模块:1)基于不确定性的超点蒸馏模块,融合几何与语义特征并自适应加权2D特征;2)实例级掩码重建模块,利用几何先验增强实例内语义一致性;3)跨实例关系一致性模块,对齐几何与语义相似性矩阵以缓解视角引起的语义漂移。 Result: 在ScanNet v2、Matterport3D和nuScenes数据集上实验表明,GeoGuide性能显著优于现有方法。 Conclusion: GeoGuide通过充分利用3D几何先验与语义信息的多层次一致性建模,有效克服了依赖2D模型带来的局限性,为开放词汇3D语义分割提供了新范式。 Abstract: Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.[121] GLASS: Geometry-aware Local Alignment and Structure Synchronization Network for 2D-3D Registration
Zhixin Cheng,Jiacheng Deng,Xinjun Li,Bohao Liao,Li Liu,Xiaotian Yin,Baoqun Yin,Tianzhu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的图像到点云配准方法,通过局部几何增强(LGE)和图分布一致性(GDC)模块提升匹配精度,尤其在重复纹理场景下有效缓解结构线索不足与结构不一致问题,在RGB-D Scenes v2和7-Scenes上达到SOTA。
Details
Motivation: 现有图像到点云配准方法在重复纹理场景中因图像缺乏3D结构线索、对齐困难及忽视结构一致性,易产生错误匹配。 Method: 提出两个新模块:1)局部几何增强(LGE)模块,利用法向量增强图像和点云特征,将几何结构注入图像特征;2)图分布一致性(GDC)模块,基于匹配点构建图结构,更新特征并显式约束相似性分布。 Result: 在RGB-D Scenes v2和7-Scenes两个基准上进行了大量实验和消融研究,结果表明该方法在图像到点云配准任务中达到当前最优性能。 Conclusion: LGE与GDC模块协同提升了几何感知能力与结构一致性建模能力,有效解决了重复模式场景下的误匹配问题,为图像-点云跨模态配准提供了新思路。 Abstract: Image-to-point cloud registration methods typically follow a coarse-to-fine pipeline, extracting patch-level correspondences and refining them into dense pixel-to-point matches. However, in scenes with repetitive patterns, images often lack sufficient 3D structural cues and alignment with point clouds, leading to incorrect matches. Moreover, prior methods usually overlook structural consistency, limiting the full exploitation of correspondences. To address these issues, we propose two novel modules: the Local Geometry Enhancement (LGE) module and the Graph Distribution Consistency (GDC) module. LGE enhances both image and point cloud features with normal vectors, injecting geometric structure into image features to reduce mismatches. GDC constructs a graph from matched points to update features and explicitly constrain similarity distributions. Extensive experiments and ablations on two benchmarks, RGB-D Scenes v2 and 7-Scenes, demonstrate that our approach achieves state-of-the-art performance in image-to-point cloud registration.[122] DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation
Tomoya Miyawaki,Kazuto Nakashima,Yumi Iwashita,Ryo Kurazume
Main category: cs.CV
TL;DR: 本文提出DRUM框架,利用预训练扩散模型作为生成先验,将合成LiDAR数据翻译为更贴近真实世界的数据,通过建模反射强度和raydrop噪声缩小仿真到现实的域差距,显著提升语义分割性能。
Details
Motivation: 大规模标注LiDAR点云成本高昂;仿真数据虽可提供标签,但因数据层面的域差距导致模型在真实数据上表现不佳。 Method: 提出DRUM Sim2Real翻译框架:利用在无标签真实数据上预训练的扩散模型作为生成先验,建模反射强度与raydrop噪声两个关键测量特性;引入raydrop感知的掩码引导机制,兼顾输入合成数据一致性与真实raydrop噪声保留。 Result: 实验表明DRUM在多种LiDAR数据表示上均稳定提升Sim2Real语义分割性能。 Conclusion: DRUM通过结合扩散先验与物理驱动的噪声建模,有效弥合了仿真与真实LiDAR数据间的域差距,为低成本、高质量LiDAR语义分割提供了可行路径。 Abstract: LiDAR-based semantic segmentation is a key component for autonomous mobile robots, yet large-scale annotation of LiDAR point clouds is prohibitively expensive and time-consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real-world data due to a data-level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre-trained on unlabeled real-world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop-aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at https://miya-tomoya.github.io/drum.[123] PhysVid: Physics Aware Local Conditioning for Generative Video Models
Saurabh,Pathak,Elahe Arani,Mykola Pechenizkiy,Bahram Zonooz
Main category: cs.CV
TL;DR: PhysVid 提出了一种物理感知的局部条件生成方法,通过物理标注的帧块和负向物理提示提升生成视频的物理合理性。
Details
Motivation: 现有生成式视频模型虽视觉保真度高,但常违反基本物理规律,而以往基于帧级信号或全局文本提示的物理注入方法存在领域依赖、短时程或粗粒度、噪声大等问题。 Method: 提出 PhysVid 方法:在训练阶段,对时间连续的帧块进行物理状态、交互与约束的细粒度标注,并通过 chunk-aware cross-attention 将其与全局提示融合;在推理阶段,引入负向物理提示(描述局部物理定律违反)以抑制不合理的运动轨迹。 Result: 在 VideoPhy 基准上,PhysVid 相比基线模型物理常识得分提升约 33%,在 VideoPhy2 上提升最高达约 8%。 Conclusion: 局部、物理感知的条件引导能显著提升生成视频的物理合理性,是迈向物理可 grounding 视频生成模型的重要一步。 Abstract: Generative video models achieve high visual fidelity but often violate basic physical principles, limiting reliability in real-world settings. Prior attempts to inject physics rely on conditioning: frame-level signals are domain-specific and short-horizon, while global text prompts are coarse and noisy, missing fine-grained dynamics. We present PhysVid, a physics-aware local conditioning scheme that operates over temporally contiguous chunks of frames. Each chunk is annotated with physics-grounded descriptions of states, interactions, and constraints, which are fused with the global prompt via chunk-aware cross-attention during training. At inference, we introduce negative physics prompts (descriptions of locally relevant law violations) to steer generation away from implausible trajectories. On VideoPhy, PhysVid improves physical commonsense scores by $\approx 33\%$ over baseline video generators, and by up to $\approx 8\%$ on VideoPhy2. These results show that local, physics-aware guidance substantially increases physical plausibility in generative video and marks a step toward physics-grounded video models.[124] Preference-Aligned LoRA Merging: Preserving Subspace Coverage and Addressing Directional Anisotropy
Wooseong Jeong,Wonyoung Lee,Kuk-Jin Yoon
Main category: cs.CV
TL;DR: 本文提出TARA-Merging方法,通过任务偏好加权的伪交叉熵损失对LoRA模块进行方向级重加权融合,在保证任务相关子空间保留的同时提升子空间覆盖广度、缓解各方向影响不均衡(各向异性)问题,从而在多任务泛化中取得更优性能。
Details
Motivation: 现有LoRA模块融合方法因各模块更新方向处于不同子空间且贡献不均,导致关键任务方向被削弱、次要方向被放大,损害多任务表征能力。 Method: 提出TARA-Merging(Task-Rank Anisotropy Alignment),基于子空间覆盖与各向异性两个视角,采用偏好加权的交叉熵伪损失进行方向级合并权重对齐,兼顾任务相关子空间保留与方向重要性再校准。 Result: 在8个视觉和6个自然语言推理基准上,TARA-Merging持续优于基线方法(包括vanilla和LoRA-aware融合),展现出更强鲁棒性与泛化能力。 Conclusion: LoRA融合需协同优化子空间覆盖广度与方向影响均衡性(即缓解各向异性),TARA-Merging为此提供了有效解决方案。 Abstract: Merging multiple Low-Rank Adaptation (LoRA) modules is promising for constructing general-purpose systems, yet challenging because LoRA update directions span different subspaces and contribute unevenly. When merged naively, such mismatches can weaken the directions most critical to certain task losses while overemphasizing relatively less important ones, ultimately reducing the model's ability to represent all tasks faithfully. We revisit this problem through two perspectives: subspace coverage, which captures how broadly LoRA directions cover diverse representational directions, and anisotropy, which reflects the imbalance of influence across those directions. We propose TARA-Merging (Task-Rank Anisotropy Alignment), which aligns merging weights using a preference-weighted cross-entropy pseudo-loss while preserving task-relevant LoRA subspaces. This ensures broad subspace coverage and mitigates anisotropy via direction-wise reweighting. Across eight vision and six NLI benchmarks, TARA-Merging consistently outperforms vanilla and LoRA-aware baselines, demonstrating strong robustness and generalization, and highlighting the importance of addressing both subspace coverage and anisotropy in LoRA merging.[125] SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning
Cai Selvas-Sala,Lei Kang,Lluis Gomez
Main category: cs.CV
TL;DR: 本文提出SALMUBench,一个用于评估对比学习多模态模型(如CLIP)敏感关联级遗忘效果的新基准,包含合成数据集、污染/干净模型及新型结构化评估协议,揭示当前遗忘方法存在遗忘不足或过度泛化的问题。
Details
Motivation: 随着CLIP等多模态模型在下游系统中广泛应用,亟需安全地移除敏感信息;但对比学习编码器的机器遗忘研究尚属空白,且现有评估无法诊断细粒度的关联级遗忘效果。 Method: 构建SALMUBench基准:基于60K人工构造的persona-attribute关联合成数据集,训练两个从零开始、共享400M对保留数据基础的模型(Compromised含敏感数据,Clean不含),并设计结构化holdout集(holdout identity / holdout association)以精准评估遗忘效果与副作用。 Result: 发现实用高效的敏感信息删除是可行的,但当前遗忘方法存在两类典型失败模式:遗忘不充分,或过度泛化导致非目标信息被误删。 Conclusion: SALMUBench为多模态模型的关联级机器遗忘提供了首个全面、可复现的评估标准,并开源全部资源以推动后续研究。 Abstract: As multimodal models like CLIP become integral to downstream systems, the need to remove sensitive information is critical. However, machine unlearning for contrastively-trained encoders remains underexplored, and existing evaluations fail to diagnose fine-grained, association-level forgetting. We introduce SALMUBench (Sensitive Association-Level Multimodal Unlearning), a benchmark built upon a synthetic dataset of 60K persona-attribute associations and two foundational models: a Compromised model polluted with this data, and a Clean model without it. To isolate unlearning effects, both are trained from scratch on the same 400M-pair retain base, with the Compromised model additionally trained on the sensitive set. We propose a novel evaluation protocol with structured holdout sets (holdout identity, holdout association) to precisely measure unlearning efficacy and collateral damage. Our benchmark reveals that while utility-efficient deletion is feasible, current methods exhibit distinct failure modes: they either fail to forget effectively or over-generalize by erasing more than intended. SALMUBench sets a new standard for comprehensive unlearning evaluation, and we publicly release our dataset, models, evaluation scripts, and leaderboards to foster future research.[126] Label-Free Cross-Task LoRA Merging with Null-Space Compression
Wonyoung Lee,Wooseong Jeong,Kuk-Jin Yoon
Main category: cs.CV
TL;DR: 本文提出Null-Space Compression (NSC) Merging方法,一种无需标签、输出无关的LoRA适配器融合技术,利用适配器几何结构(特别是下投影矩阵A的零空间压缩程度)作为融合权重优化信号,在分类、回归和生成任务上均表现优异。
Details
Motivation: 现有LoRA合并方法在异构任务(如分类与回归混合)上表现不佳,且基于熵的代理指标不适用于回归、计算开销大;亟需一种通用、高效、任务无关的合并策略。 Method: 提出Null-Space Compression (NSC) Merging:基于LoRA微调中下投影矩阵A压缩其零空间的现象,将零空间压缩程度作为性能相关信号,据此自适应设定各LoRA适配器的合并权重;整个过程无需标签、不依赖输出形式。 Result: 在20个异构视觉任务上达到SOTA,均衡提升、无过拟合;在6个NLI基准、VQA和图像描述等跨模态任务上也显著优于基线,验证了其泛化性与可扩展性。 Conclusion: NSC是一种原理简洁、几何驱动、任务无关的LoRA合并新范式,有效克服了现有方法在任务异构性与计算效率上的局限,为多任务模型融合提供了可靠且可扩展的解决方案。 Abstract: Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor $A$ in $ΔW = BA$ compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.[127] Verify Claimed Text-to-Image Models via Boundary-Aware Prompt Optimization
Zidong Zhao,Yihao Huang,Qing Guo,Tianlin Li,Anran Li,Kailong Wang,Jin Song Dong,Geguang Pu
Main category: cs.CV
TL;DR: 本文提出了一种无需参考模型的文本到图像(T2I)模型验证方法——边界感知提示优化(BPO),利用不同T2I模型在语义边界区域输出不稳定性差异进行模型识别,显著提升了验证准确率。
Details
Motivation: 第三方平台常虚假宣称使用某官方T2I模型,损害用户信任与模型所有者声誉;现有验证方法依赖多参考模型生成验证提示,计算开销大且对参考模型选择敏感。 Method: 提出参考-free的Boundary-aware Prompt Optimization(BPO),通过搜索嵌入空间中两类概念(如'柯基'与'贝果')语义边界的邻近提示,利用目标模型在该区域输出不稳定(跨类波动)、而其他模型保持稳定的现象,提取模型特异性验证线索。 Result: 在5个主流T2I模型和4种基线方法上的实验表明,BPO显著优于现有方法,验证准确率更高。 Conclusion: BPO是一种高效、鲁棒、无需参考模型的T2I模型验证新范式,为API服务真实性的可信评估提供了实用解决方案。 Abstract: As Text-to-Image (T2I) generation becomes widespread, third-party platforms increasingly integrate multiple model APIs for convenient image creation. However, false claims of using official models can mislead users and harm model owners' reputations, making model verification essential to confirm whether an API's underlying model matches its claim. Existing methods address this by using verification prompts generated by official model owners, but the generation relies on multiple reference models for optimization, leading to high computational cost and sensitivity to model selection. To address this problem, we propose a reference-free T2I model verification method called Boundary-aware Prompt Optimization (BPO). It directly explores the intrinsic characteristics of the target model. The key insight is that although different T2I models produce similar outputs for normal prompts, their semantic boundaries in the embedding space (transition zones between two concepts such as "corgi" and "bagel") are distinct. Prompts near these boundaries generate unstable outputs (e.g., sometimes a corgi and sometimes a bagel) on the target model but remain stable on other models. By identifying such boundary-adjacent prompts, BPO captures model-specific behaviors that serve as reliable verification cues for distinguishing T2I models. Experiments on five T2I models and four baselines demonstrate that BPO achieves superior verification accuracy.[128] Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation
Yiming Ren,Yujiu Yang,Junjie Wang
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的输入自适应深度聚合机制(IADA),通过恢复视觉语言模型(VLMs)中跨深度表征的访问能力,缓解监督微调(SFT)带来的推理性能下降问题,在保持甚至提升感知能力的同时显著提升推理得分。
Details
Motivation: 监督微调(SFT)在提升VLM感知能力的同时常导致推理性能下降(即‘推理税’),作者探究其根源是否与深度方向表征访问被破坏有关。 Method: 发现固定跨深度聚合即可显著恢复推理能力,进而提出Input-Adaptive Depth Aggregation(IADA):一种输入自适应、模态感知、低秩参数化的轻量级跨深度聚合机制。 Result: 在Qwen3-VL-2B上,IADA相比仅用LoRA微调,平均推理分提升9.5分、感知分提升3.3分,仅增加0.14M参数,且在参数高效低秩设置下增益最显著。 Conclusion: 保留并增强跨深度表征访问是VLM微调中提升推理能力的关键;IADA提供了一种高效、可扩展的解决方案,有效缓解推理税问题。 Abstract: Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.[129] From Pixels to Privacy: Temporally Consistent Video Anonymization via Token Pruning for Privacy Preserving Action Recognition
Nazia Aslam,Abhisek Ray,Joakim Bruslund Haurum,Lukas Esterle,Kamal Nasrollahi
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力机制的时空视频匿名化框架,通过在ViT中设计双分类token(动作CLS和隐私CLS)来解耦效用与隐私特征,并依据效用-隐私得分选择性保留关键tubelet,在保持动作识别性能的同时显著降低隐私泄露。
Details
Motivation: 现有大规模视频模型虽提升了视频理解能力,但也放大了隐私风险(如人脸、种族、性别等敏感属性泄露);而视频匿名化研究远少于图像匿名化,且视频中的时空运动模式本身可作为生物特征被利用。 Method: 提出基于注意力驱动的时空视频匿名化框架:在ViT中引入两个任务特定的分类token(动作CLS token和隐私CLS token),共享主干网络;对比二者注意力分布计算每个spatiotemporal tubelet的utility-privacy score;仅保留top-k高分tubelet,实现隐私敏感tubelet的选择性剪枝。 Result: 实验表明该方法在动作识别性能上接近原始视频训练的模型,同时大幅降低隐私泄露,验证了注意力驱动的时空剪枝在隐私保护视频分析中的有效性与合理性。 Conclusion: 注意力机制可被显式结构化以分离效用与隐私信息;所提双CLS token与tubelet级效用-隐私评分机制为视频隐私保护提供了有效且原理清晰的新范式。 Abstract: Recent advances in large-scale video models have significantly improved video understanding across domains such as surveillance, healthcare, and entertainment. However, these models also amplify privacy risks by encoding sensitive attributes, including facial identity, race, and gender. While image anonymization has been extensively studied, video anonymization remains relatively underexplored, even though modern video models can leverage spatiotemporal motion patterns as biometric identifiers. To address this challenge, we propose a novel attention-driven spatiotemporal video anonymization framework based on systematic disentanglement of utility and privacy features. Our key insight is that attention mechanisms in Vision Transformers (ViTs) can be explicitly structured to separate action-relevant information from privacy-sensitive content. Building on this insight, we introduce two task-specific classification tokens, an action CLS token and a privacy CLS token, that learn complementary representations within a shared Transformer backbone. We contrast their attention distributions to compute a utility-privacy score for each spatiotemporal tubelet, and keep the top-k tubelets with the highest scores. This selectively prunes tubelets dominated by privacy cues while preserving those most critical for action recognition. Extensive experiments demonstrate that our approach maintains action recognition performance comparable to models trained on raw videos, while substantially reducing privacy leakage. These results indicate that attention-driven spatiotemporal pruning offers an effective and principled solution for privacy-preserving video analytics.[130] HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network
Mingyu Zhang,Zixu Li,Zhiwei Chen,Zhiheng Fu,Xiaowei Zhu,Jiajia Nie,Yinwei Wei,Yupeng Hu
Main category: cs.CV
TL;DR: 本文提出了一种名为HINT的双路径上下文化网络,用于提升组合图像检索(CIR)任务中对上下文信息的建模能力,并增强匹配与非匹配样本间的相似度差异,从而在两个基准数据集上取得最优性能。
Details
Motivation: 现有CIR方法忽视了上下文信息在区分匹配样本中的关键作用,且面临隐式依赖和缺乏差异放大机制两大挑战。 Method: 提出双路径组合上下文化网络(HINT),实现上下文化编码,并放大匹配与非匹配样本间的相似度差异。 Result: HINT在两个CIR基准数据集的所有指标上均达到最优性能。 Conclusion: HINT有效解决了CIR中上下文建模不足和相似度区分弱的问题,显著提升了复杂场景下的检索性能。 Abstract: Composed Image Retrieval (CIR) is a challenging image retrieval paradigm. It aims to retrieve target images from large-scale image databases that are consistent with the modification semantics, based on a multimodal query composed of a reference image and modification text. Although existing methods have made significant progress in cross-modal alignment and feature fusion, a key flaw remains: the neglect of contextual information in discriminating matching samples. However, addressing this limitation is not an easy task due to two challenges: 1) implicit dependencies and 2) the lack of a differential amplification mechanism. To address these challenges, we propose a dual-patH composItional coNtextualized neTwork (HINT), which can perform contextualized encoding and amplify the similarity differences between matching and non-matching samples, thus improving the upper performance of CIR models in complex scenarios. Our HINT model achieves optimal performance on all metrics across two CIR benchmark datasets, demonstrating the superiority of our HINT model. Codes are available at https://github.com/zh-mingyu/HINT.[131] Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification
Shuai Lv,Chang Liu,Feng Tang,Yujie Yuan,Aojun Zhou,Kui Zhang,Xi Yang,Yangqiu Song
Main category: cs.CV
TL;DR: 本文提出Visual Re-Examination (VRE)框架,通过模型自主视觉反思机制缓解多模态大语言模型在长文本生成中因脱离图像证据而导致的幻觉问题。
Details
Motivation: MLLMs在长文本生成中易偏离图像证据、依赖文本先验,导致幻觉;但注意力分析发现其具备未被激活的晚期视觉验证能力。 Method: 提出VRE自演化训练框架,让MLLM在推理过程中自主进行视觉内省,利用自身生成的反思轨迹实现视觉信息的可操作化,无需额外视觉输入或更强教师模型蒸馏。 Result: 在多个多模态基准上显著提升推理准确率与感知可靠性,尤其大幅降低长链推理中的幻觉。 Conclusion: VRE有效激活MLLM潜在的视觉验证能力,是一种无需外部监督的自演化增强方法,为提升多模态生成的忠实性提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance, yet we identify a recurring failure mode in long-form generation: as outputs grow longer, models progressively drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations. Interestingly, Based on attention analysis, we find that MLLMs have a latent capability for late-stage visual verification that is present but not consistently activated. Motivated by this observation, we propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs. Rather than distilling visual capabilities from a stronger teacher, VRE promotes iterative self-improvement by leveraging the model itself to generate reflection traces, making visual information actionable through information gain. Extensive experiments across diverse multimodal benchmarks demonstrate that VRE consistently improves reasoning accuracy and perceptual reliability, while substantially reducing hallucinations, especially in long-chain settings. Code is available at https://github.com/Xiaobu-USTC/VRE.[132] DuSCN-FusionNet: An Interpretable Dual-Channel Structural Covariance Fusion Framework for ADHD Classification Using Structural MRI
Qurat Ul Ain,Alptekin Temizel,Soyiba Jawed
Main category: cs.CV
TL;DR: 本文提出了一种可解释的sMRI框架DuSCN-FusionNet,通过双通道结构协方差网络(SCN)结合区域强度与异质性特征,实现ADHD分类,并利用Grad-CAM定位潜在解剖生物标志物。
Details
Motivation: ADHD缺乏可靠的影像学生物标志物,尤其是解剖学标记;现有深度学习模型多为黑箱,临床可信度与可解释性不足。 Method: 构建基于ROI均值强度和区域内变异性的双通道结构协方差网络(SCN),分别输入SCN-CNN编码器;同时融合辅助ROI变异特征与全局统计描述符进行晚期融合;采用分层10折交叉验证与5种子集成策略评估;适配Grad-CAM于SCN以生成ROI级重要性得分。 Result: 在ADHD-200数据集北大站点上达到平均平衡准确率80.59%,AUC为0.778,精确率、召回率和F1分数分别为81.66%、80.59%和80.27%;并识别出若干结构相关脑区作为潜在生物标志物。 Conclusion: DuSCN-FusionNet兼顾分类性能与可解释性,为ADHD的神经影像学诊断提供了兼具临床实用性与生物学意义的新路径。 Abstract: Attention Deficit Hyperactivity Disorder (ADHD) is a highly prevalent neurodevelopmental condition; however, its neurobiological diagnosis remains challenging due to the lack of reliable imaging-based biomarkers, particularly anatomical markers. Structural MRI (sMRI) provides a non-invasive modality for investigating brain alterations associated with ADHD; nevertheless, most deep learning approaches function as black-box systems, limiting clinical trust and interpretability. In this work, we propose DuSCN-FusionNet, an interpretable sMRI-based framework for ADHD classification that leverages dual-channel Structural Covariance Networks (SCNs) to capture inter-regional morphological relationships. ROI-wise mean intensity and intra-regional variability descriptors are used to construct intensity-based and heterogeneity-based SCNs, which are processed through an SCN-CNN encoder. In parallel, auxiliary ROI-wise variability features and global statistical descriptors are integrated via late-stage fusion to enhance performance. The model is evaluated using stratified 10-fold cross-validation with a 5-seed ensemble strategy, achieving a mean balanced accuracy of 80.59% and an AUC of 0.778 on the Peking University site of the ADHD-200 dataset. DuSCN-FusionNet further achieves precision, recall, and F1-scores of 81.66%, 80.59%, and 80.27%, respectively. Moreover, Grad-CAM is adapted to the SCN domain to derive ROI-level importance scores, enabling the identification of structurally relevant brain regions as potential biomarkers.[133] Only Whats Necessary: Pareto Optimal Data Minimization for Privacy Preserving Video Anomaly Detection
Nazia Aslam,Abhisek Ray,Thomas B. Moeslund,Kamal Nasrollahi
Main category: cs.CV
TL;DR: 本文提出了一种名为'Only What's Necessary'的隐私优先设计框架,用于视频异常检测(VAD),通过广度与深度两种数据最小化机制,在保护个人可识别信息(PII)的同时维持检测性能。
Details
Motivation: 视频异常检测系统在安全关键场景中广泛应用,但其训练和运行依赖大量含个人敏感信息(如人脸、人口统计属性)的视频数据,与GDPR‘数据最小化’原则相冲突。 Method: 提出隐私优先的VAD框架,融合广度(如区域遮蔽)与深度(如特征扰动)的数据最小化机制;结合VAD模型与隐私推断模型评估不同配置,并采用排序法与Pareto分析刻画隐私-效用权衡,定位最优操作点。 Result: 在公开数据集上的实验表明,该框架能在显著降低PII暴露的同时,仅造成有限的检测性能下降;成功识别出多个隐私保护与检测效用之间的‘甜蜜点’。 Conclusion: ‘Only What's Necessary’框架有效实现了GDPR合规的数据最小化目标,为隐私敏感场景下的视频分析提供了可行且实用的解决方案。 Abstract: Video anomaly detection (VAD) systems are increasingly deployed in safety critical environments and require a large amount of data for accurate detection. However, such data may contain personally identifiable information (PII), including facial cues and sensitive demographic attributes, creating compliance challenges under the EU General Data Protection Regulation (GDPR). In particular, GDPR requires that personal data be limited to what is strictly necessary for a specified processing purpose. To address this, we introduce Only What's Necessary, a privacy-by-design framework for VAD that explicitly controls the amount and type of visual information exposed to the detection pipeline. The framework combines breadth based and depth based data minimization mechanisms to suppress PII while preserving cues relevant to anomaly detection. We evaluate a range of minimization configurations by feeding the minimized videos to both a VAD model and a privacy inference model. We employ two ranking based methods, along with Pareto analysis, to characterize the resulting trade off between privacy and utility. From the non-dominated frontier, we identify sweet spot operating points that minimize personal data exposure with limited degradation in detection performance. Extensive experiments on publicly available datasets demonstrate the effectiveness of the proposed framework.[134] From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter
Zhenghao Xu,Mengning Yang
Main category: cs.CV
TL;DR: 本文提出HDpy-13手绘图表数据集和轻量级适配器Plot-Adapter,以提升非专业用户通过手绘图推荐图形API的准确性和效率。
Details
Motivation: 现有Plot2API方法主要面向标准图表图像,难以处理更易获取的手绘图表图像,且多模态大模型因域差异和专业知识缺乏而表现不佳。 Method: 构建手绘图表数据集HDpy-13;提出Plot-Adapter,包含轻量CNN模块以增强局部特征提取能力,并采用投影矩阵共享策略减少微调参数量。 Result: 实验验证了HDpy-13数据集的有效性及Plot-Adapter在多领域、多语言场景下的高效性,显著降低了参数增长与计算开销。 Conclusion: HDpy-13和Plot-Adapter共同提升了手绘图表到图形API推荐任务的实用性与可扩展性,有助于降低非专业人士使用数据可视化的门槛。 Abstract: As plots play a critical role in modern data visualization and analysis, Plot2API is launched to help non-experts and beginners create their desired plots by directly recommending graphical APIs from reference plot images by neural networks. However, previous works on Plot2API have primarily focused on the recommendation for standard plot images, while overlooking the hand-drawn plot images that are more accessible to non-experts and beginners. To make matters worse, both Plot2API models trained on standard plot images and powerful multi-modal large language models struggle to effectively recommend APIs for hand-drawn plot images due to the domain gap and lack of expertise. To facilitate non-experts and beginners, we introduce a hand-drawn plot dataset named HDpy-13 to improve the performance of graphical API recommendations for hand-drawn plot images. Additionally, to alleviate the considerable strain of parameter growth and computational resource costs arising from multi-domain and multi-language challenges in Plot2API, we propose Plot-Adapter that allows for the training and storage of separate adapters rather than requiring an entire model for each language and domain. In particular, Plot-Adapter incorporates a lightweight CNN block to improve the ability to capture local features and implements projection matrix sharing to reduce the number of fine-tuning parameters further. Experimental results demonstrate both the effectiveness of HDpy-13 and the efficiency of Plot-Adapter.[135] MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model
Quan Dao,Dimitris Metaxas
Main category: cs.CV
TL;DR: 本文提出了一种多尺度分块Transformer(MPDiT)架构,通过在早期层使用大尺寸图像块捕捉全局粗粒度信息、后期层使用小尺寸块细化局部细节,显著降低计算开销(最高减少50% GFLOPs),同时保持优异的生成性能;并改进了时间和类别嵌入设计以加速训练收敛。
Details
Motivation: 现有Diffusion Transformer(DiT)采用各向同性设计,每层处理相同数量的patch token,导致训练计算开销大。 Method: 提出多patch Transformer(MPDiT):早期block使用较大图像patch以建模全局上下文,后期block逐步减小patch尺寸以增强局部细节建模;同时改进时间嵌入和类别嵌入的设计。 Result: 在ImageNet上实验表明,MPDiT在降低最多50% GFLOPs的同时,生成质量与标准DiT相当甚至更优;训练收敛速度提升。 Conclusion: MPDiT通过层次化patch尺度设计和嵌入优化,在显著降低计算成本的前提下维持甚至提升了扩散模型的生成性能,是一种高效且实用的DiT改进方案。 Abstract: Transformer architectures, particularly Diffusion Transformers (DiTs), have become widely used in diffusion and flow-matching models due to their strong performance compared to convolutional UNets. However, the isotropic design of DiTs processes the same number of patchified tokens in every block, leading to relatively heavy computation during training process. In this work, we introduce a multi-patch transformer design in which early blocks operate on larger patches to capture coarse global context, while later blocks use smaller patches to refine local details. This hierarchical design could reduces computational cost by up to 50\% in GFLOPs while achieving good generative performance. In addition, we also propose improved designs for time and class embeddings that accelerate training convergence. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of our architectural choices. Code is released at \url{https://github.com/quandao10/MPDiT}[136] HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
MD Khalequzzaman Chowdhury Sayem,Mubarrat Tajoar Chowdhury,Yihalem Yimolal Tiruneh,Muneeb A. Khan,Muhammad Salman Ali,Binod Bhattarai,Seungryul Baek
Main category: cs.CV
TL;DR: 本文提出HandVQA基准,用于评估视觉语言模型(VLMs)对手部精细空间关系的理解能力,揭示其在几何推理和泛化上的系统性缺陷,并验证了3D空间知识迁移对下游任务的显著提升效果。
Details
Motivation: 当前VLMs虽在通用视觉语言任务上表现优异,但在需要精细空间推理的手部姿态理解等高要求场景中仍存在明显不足,亟需专用基准来诊断和改进。 Method: 构建基于FreiHAND、InterHand2.6M和FPHA等高质量3D手部数据集的大规模诊断基准HandVQA,包含160万道多选题,聚焦关节间角度、距离与相对位置等空间关系;采用LoRA对LLaVA、DeepSeek、Qwen-VL等主流VLM进行轻量微调并系统评测。 Result: 发现现有VLM普遍存在手指部件幻觉、几何理解错误及泛化能力差等问题;经HandVQA训练后,模型在手势识别和手物交互等下游任务中实现零样本迁移提升(分别+10.33%和+2.63%)。 Conclusion: HandVQA有效暴露了VLM在细粒度手部空间推理上的关键缺陷,并为引入3D结构先验、提升具身AI的空间理解能力提供了可验证的路径。 Abstract: Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).[137] Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning
Shida Wang,YongXiang Hua,Zhou Tao,Haoyu Cao,Linli Xu
Main category: cs.CV
TL;DR: 本文提出SCORE框架,通过强化学习实现自适应视频token压缩,显著提升计算效率并保持高性能。
Details
Motivation: 现有视频理解模型面临计算成本高和'上下文腐烂'问题,且传统压缩策略缺乏任务自适应性。 Method: 提出SCORE框架,包含基于惊喜增强状态表示的轻量级策略网络,并采用分组强化学习与两阶段课程学习进行优化。 Result: 在多个视频理解基准上显著超越现有方法,在10%保留率下实现16倍预填充加速并保持99.5%原始性能。 Conclusion: SCORE为长视频高效理解提供了可扩展的解决方案。 Abstract: Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ''context rot'' due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via REinforcement learning), a unified framework that learns an adaptive token compression policy. SCORE introduces a lightweight policy network conditioned on a surprise-augmented state representation that incorporates inter-frame residuals to explicitly capture temporal dynamics and motion saliency. We optimize this policy using a group-wise reinforcement learning scheme with a split-advantage estimator, stabilized by a two-stage curriculum transferring from static pseudo-videos to real dynamic videos. Extensive experiments on diverse video understanding benchmarks demonstrate that SCORE significantly outperforms state-of-the-art baselines. Notably, SCORE achieves a 16x prefill speedup while preserving 99.5% of original performance at a 10% retention ratio, offering a scalable solution for efficient long-form video understanding.[138] Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
I-Hsiang Chen,Isma Hadji,Enrique Sanchez,Adrian Bulat,Sy-Yen Kuo,Radu Timofte,Georgios Tzimiropoulos,Brais Martinez
Main category: cs.CV
TL;DR: 本文提出RAR(Restore, Assess and Repeat)框架,将图像质量评估(IQA)与图像恢复(IR)统一建模,在潜在空间中联合完成退化识别、图像恢复与质量验证,实现端到端可训练、动态自适应的高效迭代恢复。
Details
Motivation: 现有图像恢复方法在面对未知或复合退化时泛化能力弱、效率低,且IQA与IR模块分离导致延迟和信息损失。 Method: 提出RAR流程:在潜在域中联合建模退化识别、图像恢复与质量评估;端到端可训练;IQA与IR深度耦合,避免模块解耦带来的冗余。 Result: 在单一、未知及复合退化任务上均取得一致性能提升,达到新SOTA。 Conclusion: RAR通过紧密融合IQA与IR,实现了更鲁棒、高效、自适应的图像恢复,为通用图像复原提供了新范式。 Abstract: Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.[139] SHANDS: A Multi-View Dataset and Benchmark for Surgical Hand-Gesture and Error Recognition Toward Medical Training
Le Ma,Thiago Freitas dos Santos,Nadia Magnenat-Thalmann,Katarzyna Wac
Main category: cs.CV
TL;DR: 本文提出Surgical-Hands(SHands)数据集,用于外科手术手部动作与错误识别,支持AI驱动的外科培训评估;该数据集包含52名参与者(20名专家、32名新手)在五视角下完成线性切口与缝合操作的视频,并标注了15种手势原语和8类经验证的新手错误类型,同时定义了单视角、多视角与跨视角泛化评测协议。
Details
Motivation: 外科手术培训中专家评估成本高、难以规模化;现有AI评估方法受限于缺乏含真实新手错误和多视角变化的数据集。 Method: 构建大规模多视角视频数据集SHands,涵盖52名参与者(20专家+32新手)在五台RGB相机下完成标准化切口与缝合任务的视频,帧级标注15类手势原语及8类经临床验证的新手错误,并设计单/多/跨视角评估协议,对前沿深度学习模型进行基准测试。 Result: 发布了首个面向外科培训、含真实新手错误与多视角数据的大规模手部动作与错误识别数据集SHands,并提供了系统性评测基准。 Conclusion: SHands填补了外科AI评估领域高质量、临床相关多视角训练数据的空白,为开发鲁棒、可扩展且基于临床知识的AI外科培训系统提供了关键资源。 Abstract: In surgical training for medical students, proficiency development relies on expert-led skill assessment, which is costly, time-limited, difficult to scale, and its expertise remains confined to institutions with available specialists. Automated AI-based assessment offers a viable alternative, but progress is constrained by the lack of datasets containing realistic trainee errors and the multi-view variability needed to train robust computer vision approaches. To address this gap, we present Surgical-Hands (SHands), a large-scale multi-view video dataset for surgical hand-gesture and error recognition for medical training. \textsc{SHands} captures linear incision and suturing using five RGB cameras from complementary viewpoints, performed by 52 participants (20 experts and 32 trainees), each completing three standardized trials per procedure. The videos are annotated at the frame level with 15 gesture primitives and include a validated taxonomy of 8 trainee error types, enabling both gesture recognition and error detection. We further define standardized evaluation protocols for single-view, multi-view, and cross-view generalization, and benchmark state-of-the-art deep learning models on the dataset. SHands is publicly released to support the development of robust and scalable AI systems for surgical training grounded in clinically curated domain knowledge.[140] CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities
Moritz Nottebaum,Matteo Dunnhofer,Christian Micheloni
Main category: cs.CV
TL;DR: 本文提出CPUBone,一种专为CPU优化的视觉骨干网络,通过分组卷积和减小卷积核尺寸,在保持硬件效率的同时降低计算量,实现CPU上最优的速度-精度权衡。
Details
Motivation: 现有视觉骨干网络多针对高并行硬件(如GPU、AI加速器)优化,而CPU缺乏同等并行能力,需兼顾MACs数量与硬件执行效率(MACpS),因此亟需面向CPU特性的新设计范式。 Method: 提出两种轻量化标准卷积的方法:分组卷积与减小卷积核尺寸,并基于此构建专为CPU推理优化的CPUBone模型家族,强调低延迟与高MACpS。 Result: 在多种CPU设备上验证了所提改进能维持高硬件效率;CPUBone在速度-精度权衡(SAT)上达到SOTA,并成功迁移到目标检测与语义分割等下游任务。 Conclusion: 面向CPU的视觉骨干设计应优先考虑硬件执行效率(MACpS)而非单纯减少MACs;CPUBone证明了该思路的有效性,为边缘CPU部署提供了高效、可迁移的新架构方案。 Abstract: Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.[141] Image-based Quantification of Postural Deviations on Patients with Cervical Dystonia: A Machine Learning Approach Using Synthetic Training Data
Roland Stenger,Sebastian Löns,Nele Brügge,Feline Hamami,Alexander Münchau,Theresa Paulus,Anne Weissbach,Tatiana Usnich,Max Borsche,Martje G. Pauly,Lara M. Lange,Markus A. Hobert,Rebecca Herzog,Ana Luísa de Almeida Marcelino,Tina Mainka,Friederike Schumann,Lukas L. Goede,Johanna Reimer,Julienne Haas,Jos Becktepe,Alexander Baumann,Robin Wolke,Chi Wang Ip,Thorsten Odorfer,Daniel Zeller,Lisa Harder-Rauschenberger,John-Ih Lee,Philipp Albrecht,Tristan Kölsche,Joachim K. Krauss,Johanna M. Nagel,Joachim Runge,Johanna Doll-Lee,Simone Zittel,Kai Grimm,Pawel Tacik,André Lee,Tobias Bäumer,Sebastian Fudickar
Main category: cs.CV
TL;DR: 本研究开发了一种基于图像的自动化头位与偏移估计系统,用于客观评估颈肌张力障碍(CD)患者的旋转和侧向偏移症状,通过合成数据训练深度学习模型,并在多中心临床验证中展现出与专家评分高度一致的结果。
Details
Motivation: 现有CD评估依赖主观性强、信度低的临床量表(如TWSTRS),缺乏客观、可重复的量化工具。 Method: 结合预训练头位估计算法(用于旋转症状)与专为侧向偏移训练的深度学习模型(使用约16,000张合成虚拟人图像),并在真实患者图像(n=100)和标注合成头像(n=100)上由20名专家共识评分进行多中心验证。 Result: 旋转症状评估与专家评分高度相关(torticollis r=0.91,laterocollis r=0.81,anteroretrocollis r=0.78);侧向偏移评估中相关性中等(r=0.55),但在合成头像基准测试中准确率高于人工评分者。 Conclusion: 该系统是首个经多中心临床验证的客观CD姿势评估工具,利用合成数据成功泛化至真实患者,有望推动标准化临床决策与临床试验评估。 Abstract: Cervical dystonia (CD) is the most common form of dystonia, yet current assessment relies on subjective clinical rating scales, such as the Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), which requires expertise, is subjective and faces low inter-rater reliability some items of the score. To address the lack of established objective tools for monitoring disease severity and treatment response, this study validates an automated image-based head pose and shift estimation system for patients with CD. We developed an assessment tool that combines a pretrained head-pose estimation algorithm for rotational symptoms with a deep learning model trained exclusively on ~16,000 synthetic avatar images to evaluate rare translational symptoms, specifically lateral shift. This synthetic data approach overcomes the scarcity of clinical training examples. The system's performance was validated in a multicenter study by comparing its predicted scores against the consensus ratings of 20 clinical experts using a dataset of 100 real patient images and 100 labeled synthetic avatars. The automated system demonstrated strong agreement with expert clinical ratings for rotational symptoms, achieving high correlations for torticollis (r=0.91), laterocollis (r=0.81), and anteroretrocollis (r=0.78). For lateral shift, the tool achieved a moderate correlation (r=0.55) with clinical ratings and demonstrated higher accuracy than human raters in controlled benchmark tests on avatars. By leveraging synthetic training data to bridge the clinical data gap, this model successfully generalizes to real-world patients, providing a validated, objective tool for CD postural assessment that can enable standardized clinical decision-making and trial evaluation.[142] Meta-Learned Adaptive Optimization for Robust Human Mesh Recovery with Uncertainty-Aware Parameter Updates
Shaurjya Mandal,Nutan Sharma,John Galeotti
Main category: cs.CV
TL;DR: 本文提出了一种结合元学习与不确定性感知自适应优化的新框架,用于单图像人体网格恢复,显著提升了精度、泛化性与不确定性估计能力。
Details
Motivation: 单图像人体网格恢复面临深度模糊性和跨域泛化能力差的挑战,现有回归+优化方法存在初始化差、优化效率低的问题。 Method: 提出元学习框架,包含三部分创新:(1) 训练时模拟测试时优化以学习优化友好的初始化;(2) 选择性参数缓存机制,冻结已收敛关节点以降低计算开销;(3) 基于分布的自适应更新,从学习到的分布中采样参数变化,并结合随机逼近技术处理复杂损失梯度。 Result: 在3DPW和Human3.6M上MPJPE分别降低10.3和8.0,达到SOTA;具备强领域自适应能力,且不确定性估计与实际误差高度相关。 Conclusion: 融合元学习与自适应优化可实现高精度人体网格恢复,并在复杂场景下保持鲁棒泛化与可信不确定性建模。 Abstract: Human mesh recovery from single images remains challenging due to inherent depth ambiguity and limited generalization across domains. While recent methods combine regression and optimization approaches, they struggle with poor initialization for test-time refinement and inefficient parameter updates during optimization. We propose a novel meta-learning framework that trains models to produce optimization-friendly initializations while incorporating uncertainty-aware adaptive updates during test-time refinement. Our approach introduces three key innovations: (1) a meta-learning strategy that simulates test-time optimization during training to learn better parameter initializations, (2) a selective parameter caching mechanism that identifies and freezes converged joints to reduce computational overhead, and (3) distribution-based adaptive updates that sample parameter changes from learned distributions, enabling robust exploration while quantifying uncertainty. Additionally, we employ stochastic approximation techniques to handle intractable gradients in complex loss landscapes. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance, reducing MPJPE by 10.3 on 3DPW and 8.0 on Human3.6M compared to strong baselines. Our approach shows superior domain adaptation capabilities with minimal performance degradation across different environmental conditions, while providing meaningful uncertainty estimates that correlate with actual prediction errors. Combining meta-learning and adaptive optimization enables accurate mesh recovery and robust generalization to challenging scenarios.[143] HyVIC: A Metric-Driven Spatio-Spectral Hyperspectral Image Compression Architecture Based on Variational Autoencoders
Martin Hermann Paul Fuchs,Behnood Rasti,Begüm Demir
Main category: cs.CV
TL;DR: 本文提出了一种面向高光谱图像(HSI)的新型变分压缩架构HyVIC,通过可调的空-谱编解码器与超网络设计,显式建模HSI特有的空-谱冗余,在多个基准数据集上显著提升重建质量(最高提升4.66dB BD-PSNR)。
Details
Motivation: 现有基于学习的HSI压缩方法多直接沿用自然图像压缩模型,未充分建模HSI中独特的空-谱冗余,尤其缺乏对空间与光谱特征学习平衡的显式架构设计。 Method: 提出空-谱变分HSI压缩架构HyVIC,包含四个核心组件:可调空-谱编码器、空-谱超编码器、空-谱超解码器和可调空-谱解码器;并引入指标驱动的超参数选择策略以优化空-谱特征学习权衡。 Result: 在两个基准数据集上验证了HyVIC的有效性,在宽范围压缩比下均实现高空间与光谱重建保真度,BD-PSNR较当前最优方法最高提升4.66dB。 Conclusion: 显式建模空-谱联合冗余并动态平衡二者学习对HSI压缩至关重要;HyVIC为学习型变分HSI压缩提供了新范式,并开源代码与预训练模型。 Abstract: The rapid growth of hyperspectral data archives in remote sensing (RS) necessitates effective compression methods for storage and transmission. Recent advances in learning-based hyperspectral image (HSI) compression have significantly enhanced both reconstruction fidelity and compression efficiency. However, existing methods typically adapt variational image compression models designed for natural images, without adequately accounting for the distinct spatio-spectral redundancies inherent in HSIs. In particular, they lack explicit architectural designs to balance spatial and spectral feature learning, limiting their ability to effectively leverage the unique characteristics of hyperspectral data. To address this issue, we introduce spatio-spectral variational hyperspectral image compression architecture (HyVIC). The proposed model comprises four main components: 1) adjustable spatio-spectral encoder; 2) spatio-spectral hyperencoder; 3) spatio-spectral hyperdecoder; and 4) adjustable spatio-spectral decoder. We demonstrate that the trade-off between spatial and spectral feature learning is crucial for the reconstruction fidelity, and therefore present a metric-driven strategy to systematically select the hyperparameters of the proposed model. Extensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed model, achieving high spatial and spectral reconstruction fidelity across a wide range of compression ratios (CRs) and improving the state of the art by up to 4.66dB in terms of BD-PSNR. Based on our results, we offer insights and derive practical guidelines to guide future research directions in learning-based variational HSI compression. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hyvic .[144] SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
Weihong Pan,Xiaoyu Zhang,Zhuang Zhang,Zhichao Ye,Nan Wang,Haomin Liu,Guofeng Zhang
Main category: cs.CV
TL;DR: 本文提出了一种基于稀疏、未标定相机的动态场景4D重建框架,利用生成式观测和新提出的时空畸变场来建模空间与时间维度上的不一致性,实现了高保真、时空一致的渲染效果,并在多相机动态场景基准上显著优于现有方法。
Details
Motivation: 现有高质量4D动态场景重建依赖昂贵、密集同步相机阵列,严重限制了实际可扩展性。 Method: 提出时空畸变场(Spatio-Temporal Distortion Field)统一建模生成式观测在时空维度的不一致性,并构建完整流程实现从稀疏、未标定相机输入进行4D重建。 Result: 在多相机动态场景基准上实现了时空一致的高保真渲染,性能显著优于现有方法。 Conclusion: 该稀疏相机动态重建框架有效降低了硬件依赖,提升了4D重建的实用性与可扩展性。 Abstract: High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.[145] ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better
Mriganka Nath,Anurag Das,Jiahao Xie,Bernt Schiele
Main category: cs.CV
TL;DR: 本文提出ClipTTT方法,利用预训练CLIP模型的图文对齐能力,在测试时单样本自适应大型视觉语言模型(LVLMs),以缓解视觉输入退化导致的幻觉问题。
Details
Motivation: 大型视觉语言模型在测试时面对被破坏的视觉输入容易产生幻觉,而这类破坏构成额外分布偏移,加剧现实场景中的幻觉问题。 Method: 提出CLIP引导的测试时训练(ClipTTT),利用预训练CLIP模型作为稳定指导信号,从单个测试样本中识别可靠的自监督目标,实现对LVLMs的快速、免修改适配。 Result: 在包含15种常见视觉退化的标准幻觉评测基准上,ClipTTT显著降低幻觉率,提升描述忠实性。 Conclusion: ClipTTT是一种轻量、即插即用的测试时适应方法,能有效增强LVLMs在视觉退化条件下的鲁棒性和可靠性。 Abstract: Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.[146] Conditional Diffusion for 3D CT Volume Reconstruction from 2D X-rays
Martin Rath,Morteza Ghahremani,Yitong Li,Ashkan Taghipour,Marcus Makowski,Christian Wachinger
Main category: cs.CV
TL;DR: 本文提出AXON框架,利用多阶段扩散模型从真实X光片直接重建高保真3D CT影像,提升诊断可及性。
Details
Motivation: CT虽提供丰富3D解剖信息,但受限于辐射高、成本高和设备少;X光便宜易得但仅提供2D投影且病理信息有限;现有2D→3D重建方法多依赖合成X光,临床泛化能力差。 Method: AXON为多阶段扩散框架:第一阶段用Brownian Bridge扩散模型进行全局结构粗合成;第二阶段用ControlNet进行局部强度精修;支持双视角X光输入缓解深度歧义;集成超分辨率网络提升至诊断级分辨率。 Result: 在公开与外部数据集上显著优于SOTA方法,PSNR提升11.9%,SSIM提升11.0%,且在不同临床分布下泛化性强。 Conclusion: AXON实现了从真实X光到高质量CT的端到端重建,有望提升基层医疗的影像诊断能力。 Abstract: Computed tomography (CT) provides rich 3D anatomical details but is often constrained by high radiation exposure, substantial costs, and limited availability. While standard chest X-rays are cost-effective and widely accessible, they only provide 2D projections with limited pathological information. Reconstructing 3D CT volumes from 2D X-rays offers a transformative solution to increase diagnostic accessibility, yet existing methods predominantly rely on synthetic X-ray projections, limiting clinical generalization. In this work, we propose AXON, a multi-stage diffusion-based framework that reconstructs high-fidelity 3D CT volumes directly from real X-rays. AXON employs a coarse-to-fine strategy, with a Brownian Bridge diffusion model-based initial stage for global structural synthesis, followed by a ControlNet-based refinement stage for local intensity optimization. It also supports bi-planar X-ray input to mitigate depth ambiguities inherent in 2D-to-3D reconstruction. A super-resolution network is integrated to upscale the generated volumes to achieve diagnostic-grade resolution. Evaluations on both public and external datasets demonstrate that AXON significantly outperforms state-of-the-art baselines, achieving a 11.9% improvement in PSNR and a 11.0% increase in SSIM with robust generalizability across disparate clinical distributions. Our code is available at https://github.com/ai-med/AXON.[147] Learnable Quantum Efficiency Filters for Urban Hyperspectral Segmentation
Imad Ali Shah,Jiarong Li,Ethan Delaney,Enda Ward,Martin Glavin,Edward Jones,Brian Deegan
Main category: cs.CV
TL;DR: 本文提出了一种名为可学习量子效率(LQE)的物理启发式、可解释的降维方法,用于高光谱城市驾驶场景理解,通过参数化平滑高阶光谱响应函数模拟传感器量子效率曲线,并在多个数据集和语义分割模型上显著提升mIoU,同时保持高参数效率和可解释性。
Details
Motivation: 高光谱感知虽提供丰富光谱信息,但其高维性带来解释与高效学习的挑战,亟需兼顾物理合理性与学习能力的降维方法。 Method: 提出Learnable Quantum Efficiency(LQE),一种基于物理约束(单主峰、平滑性、带宽有界)的可微、端到端可训练光谱降维方法,参数化高阶谱响应函数以建模传感器量子效率。 Result: 在HyKo、HSI-Drive和Hyperspectral City三个数据集上,LQE平均mIoU分别超越传统方法2.45%、0.45%、1.04%,超越可学习方法1.18%、1.56%、0.81%;仅需12–36参数,远少于竞品(51–22K),且推理延迟具竞争力。 Conclusion: 物理信息引导的光谱学习可在提升性能的同时增强可解释性,为高光谱感知与面向自动驾驶的多光谱传感器设计提供了原理性桥梁。 Abstract: Hyperspectral sensing provides rich spectral information for scene understanding in urban driving, but its high dimensionality poses challenges for interpretation and efficient learning. We introduce Learnable Quantum Efficiency (LQE), a physics-inspired, interpretable dimensionality reduction (DR) method that parameterizes smooth high-order spectral response functions that emulate plausible sensor quantum efficiency curves. Unlike conventional methods or unconstrained learnable layers, LQE enforces physically motivated constraints, including a single dominant peak, smooth responses, and bounded bandwidth. This formulation yields a compact spectral representation that preserves discriminative information while remaining fully differentiable and end-to-end trainable within semantic segmentation models (SSMs). We conduct systematic evaluations across three publicly available multi-class hyperspectral urban driving datasets, comparing LQE against six conventional and seven learnable baseline DR methods across six SSMs. Averaged across all SSMs and configurations, LQE achieves the highest average mIoU, improving over conventional methods by 2.45\%, 0.45\%, and 1.04\%, and over learnable methods by 1.18\%, 1.56\%, and 0.81\% on HyKo, HSI-Drive, and Hyperspectral City, respectively. LQE maintains strong parameter efficiency (12--36 parameters compared to 51--22K for competing learnable approaches) and competitive inference latency. Ablation studies show that low-order configurations are optimal, while the learned spectral filters converge to dataset-intrinsic wavelength patterns. These results demonstrate that physics-informed spectral learning can improve both performance and interpretability, providing a principled bridge between hyperspectral perception and data-driven multispectral sensor design for automotive vision systems.[148] OVI-MAP:Open-Vocabulary Instance-Semantic Mapping
Zilong Deng,Federico Tombari,Marc Pollefeys,Johanna Wald,Daniel Barath
Main category: cs.CV
TL;DR: OVI-MAP提出解耦实例重建与语义推断,构建类无关的增量式3D实例地图,并利用多视角视觉语言模型实现零样本语义标注,支持实时、稳定、开放词汇的在线三维语义映射。
Details
Motivation: 现有方法受限于闭集假设或密集像素级语言融合,难以兼顾鲁棒实例分割、实时处理与开放集推理,无法满足复杂日常环境中自主智能体对增量式开词汇3D实例-语义映射的需求。 Method: OVI-MAP解耦实例重建与语义推理:1)基于RGB-D输入增量构建类无关3D实例地图;2)仅从自动选取的关键视角提取语义特征,利用视觉语言模型实现零样本语义标注。 Result: 系统实现实时运行,在标准基准上超越当前最优的开词汇映射方法,具备稳定的实例跟踪与零样本语义标注能力。 Conclusion: 解耦设计显著提升了开放词汇3D语义映射的可扩展性、时间一致性与实用性,为真实场景中自主代理的长期在线感知提供了新范式。 Abstract: Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.[149] AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing
Tianyu Liu,Weitao Xiong,Kunming Luo,Manyuan Zhang,Peng Liu,Yuan Liu,Ping Tan
Main category: cs.CV
TL;DR: 本文提出AutoWeather4D,一种前馈式3D感知天气编辑框架,通过G-buffer双通路机制显式解耦几何与光照,实现高效、物理可控的恶劣天气视频合成,无需逐场景优化,适用于自动驾驶数据生成。
Details
Motivation: 现有生成式视频模型依赖大量数据学习罕见天气场景,而3D感知编辑方法受限于逐场景优化开销大及几何-光照纠缠问题。 Method: 提出AutoWeather4D框架,核心为G-buffer双通路编辑机制:几何通路利用显式结构基础支持表面锚定的物理交互;光照通路解析光传输,将局部光源贡献累加至全局光照,实现动态3D局部重打光。 Result: 实验表明,AutoWeather4D在照片真实感和结构一致性上媲美生成式基线,同时支持细粒度参数化物理控制。 Conclusion: AutoWeather4D是一种实用、高效的自动驾驶天气数据引擎,突破了传统方法在数据依赖性、优化成本与几何-光照解耦方面的瓶颈。 Abstract: Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.[150] Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones
Moritz Nottebaum,Matteo Dunnhofer,Christian Micheloni
Main category: cs.CV
TL;DR: 本文指出MACs作为视觉骨干网络效率评估指标的局限性,特别是在边缘设备上,并提出了一种新型高效视觉骨干网络LowFormer,其核心是轻量级注意力机制Lowtention,在ImageNet等任务上实现了更快的速度和更好的性能。
Details
Motivation: 现有研究多用MACs预测模型执行时间,但该指标在边缘设备上存在明显偏差,需更贴近实际硬件的效率评估与设计方法。 Method: 通过对比常见架构模块的MAC数与真实执行时间,分析影响边缘设备效率的关键因素;据此设计LowFormer骨干网络家族,引入轻量级Lowtention替代多头自注意力,并开发适配边缘GPU的优化版本。 Result: LowFormer在ImageNet上超越现有SOTA骨干网络,同时在对象检测、语义分割、图像检索、视觉目标跟踪等多个下游任务中实现显著加速,且在多种硬件平台(边缘GPU/桌面GPU)上均表现优异。 Conclusion: MACs不是可靠的效率代理指标;基于真实硬件性能洞察所设计的LowFormer验证了兼顾效率与精度的新骨干设计范式,为边缘视觉模型提供了实用解决方案。 Abstract: Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.[151] HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching
Lanmiao Liu,Esam Ghaleb,Aslı Özyürek,Zerrin Yumak
Main category: cs.CV
TL;DR: 本文提出了一种基于对比流匹配(Contrastive Flow Matching)的共语手势生成模型,通过引入不匹配的音文条件作为负样本,提升手势的语义准确性与稀疏运动建模能力,并利用跨模态对比与余弦目标实现文本、音频与整体运动的一致性嵌入。
Details
Motivation: 现有共语手势生成方法存在三大问题:依赖预定义语言规则导致泛化能力差;流匹配方法缺乏负样本训练,易学得节奏性而非具象/隐喻性稀疏动作;孤立建模身体部位导致跨模态不一致。 Method: 提出对比流匹配框架:1)以不匹配的音频-文本对为负样本,优化速度场区分语义一致/不一致的运动轨迹;2)通过余弦相似度与对比损失,将文本、音频和整体运动联合嵌入统一潜在空间,保障跨模态一致性。 Result: 在BEAT2和SHOW两个数据集上显著超越SOTA方法;用户研究验证了生成手势在语义准确性和自然度上的优势。 Conclusion: 对比流匹配结合跨模态联合嵌入有效提升了共语手势生成的语义接地性、稀疏动作建模能力和跨模态一致性,为更自然、可解释的手势合成提供了新范式。 Abstract: While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.[152] Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow
Ziyue Zeng,Xun Su,Haoyuan Liu,Bingyu Lu,Yui Tatsumi,Hiroshi Watanabe
Main category: cs.CV
TL;DR: 本文提出Generative Video Codec (GVC),一种零样本框架,将预训练视频生成模型直接用作编解码器,无需重新训练;通过将确定性ODE转化为等效SDE,实现每步随机注入点以支持码本驱动压缩,并设计三种条件策略(I2V、T2V、FLF2V)在空间保真度、时间一致性与压缩效率间实现权衡;实验表明GVC在低于0.002 bpp下仍能实现高质量重建,并支持单超参灵活码率控制。
Details
Motivation: 现有生成式视频压缩方法仅将生成模型作为传统编解码器后处理模块,未能充分发挥生成模型的端到端建模能力,限制了压缩效率与重建质量的进一步提升。 Method: 将视频基础模型的确定性rectified-flow ODE在推理时转化为等效SDE,引入每步随机注入点以支持码本驱动压缩;在此统一骨干上,设计三种条件生成策略:图像到视频(I2V)采用自适应尾帧原子分配、文本到视频(T2V)作为近零辅助信息的纯生成先验、首尾帧到视频(FLF2V)采用边界共享GOP链式结构实现双锚点时序控制。 Result: 在标准基准上,GVC在低于0.002 bpp码率下实现高质量视频重建,支持单超参数调节的灵活码率控制,且三类变体在空间保真度、时间一致性和压缩效率之间形成可解释的权衡空间。 Conclusion: GVC首次实现了将预训练视频生成模型直接作为编解码器使用的零样本范式,无需微调或重训练,为生成式视频压缩提供了新架构思路和实用技术路径。 Abstract: Existing generative video compression methods use generative models only as post-hoc reconstruction modules atop conventional codecs. We propose \emph{Generative Video Codec} (GVC), a zero-shot framework that turns a pretrained video generative model into the codec itself: the transmitted bitstream directly specifies the generative decoding trajectory, with no retraining required. To enable this, we convert the deterministic rectified-flow ODE of modern video foundation models into an equivalent SDE at inference time, unlocking per-step stochastic injection points for codebook-driven compression. Building on this unified backbone, we instantiate three complementary conditioning strategies -- \emph{Image-to-Video} (I2V) with adaptive tail-frame atom allocation, \emph{Text-to-Video} (T2V) operating at near-zero side information as a pure generative prior, and \emph{First-Last-Frame-to-Video} (FLF2V) with boundary-sharing GOP chaining for dual-anchor temporal control. Together, these variants span a principled trade-off space between spatial fidelity, temporal coherence, and compression efficiency. Experiments on standard benchmarks show that GVC achieves high-quality reconstruction below 0.002\,bpp while supporting flexible bitrate control through a single hyperparameter.[153] Scene Grounding In the Wild
Tamir Cohen,Leo Segre,Shay Shomer-Chai,Shai Avidan,Hadar Averbuch-Elor
Main category: cs.CV
TL;DR: 本文提出一种新框架,通过将部分重建结果与基于Google Earth Studio生成的伪合成参考模型对齐,解决无重叠图像下大规模场景3D重建的全局一致性难题;利用语义增强的3D高斯泼溅表示参考模型,并设计逆特征优化实现6DoF+尺度估计;同时发布WikiEarth数据集。
Details
Motivation: 现有重建方法在输入图像缺乏重叠时易产生断开或错误融合的几何,难以实现全局一致对齐。 Method: 使用Google Earth Studio生成地理精确的伪合成渲染作为完整参考模型;以语义增强的3D高斯泼溅表示该模型;构建逆特征优化框架,固定参考模型,估计待重建部分相对于它的全局6自由度位姿和尺度。 Result: 在多种经典与学习型重建流程初始化下均显著提升全局对齐精度,缓解了当前端到端模型的失败模式;并在新发布的WikiEarth数据集上验证有效性。 Conclusion: 语义共享可跨越真实图像与伪合成渲染间的巨大域差异,使基于固定参考模型的逆特征对齐成为解决弱重叠场景重建一致性的可行路径。 Abstract: Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.[154] MA-Bench: Towards Fine-grained Micro-Action Understanding
Kun Li,Jihao Gu,Fei Wang,Zhiliang Wu,Hehe Fan,Dan Guo
Main category: cs.CV
TL;DR: 本文提出了MA-Bench,首个面向微动作理解的多模态大模型评测基准,包含1000个视频与12000个QA对,并构建了配套训练数据集MA-Bench-Train(20.5K视频),显著提升了Qwen3-VL-8B在微动作推理与解释任务上的性能。
Details
Motivation: 现有MLLMs在微动作理解(对人类情绪分析至关重要)方面缺乏专用评测基准,限制了该方向的发展。 Method: 构建MA-Bench评测基准(含三层次评估架构、1000视频、12000结构化QA对);进一步构建大规模训练数据集MA-Bench-Train(20.5K带结构化微动作标注的视频),用于MLLM微调。 Result: 23个主流MLLM在MA-Bench上表现欠佳,暴露其在运动粒度与身体部位动态建模上的不足;Qwen3-VL-8B经MA-Bench-Train微调后,在微动作推理与解释任务上性能明显提升。 Conclusion: MA-Bench及其训练集为推动MLLM深入理解微动作与人类行为奠定了基础,填补了该领域基准缺失的空白。 Abstract: With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io[155] From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion
Dávid Pukanec,Tibor Kubík,Michal Španěl
Main category: cs.CV
TL;DR: ToothCraft是一个基于扩散模型的牙齿牙冠上下文生成方法,通过人工构建不完整牙齿数据集进行训练,在合成和真实数据上均表现出优异的牙冠补全性能。
Details
Motivation: 缺乏用于牙齿牙冠自动补全任务的带缺陷牙齿训练数据。 Method: 提出ToothCraft,一种基于条件扩散模型的3D牙冠生成方法;设计数据增强流程,从公开完整牙弓数据集(3DS, ODD)中合成多样化的不完整牙齿几何样本。 Result: 在合成损坏测试样本上达到81.8%的IoU和0.00034的Chamfer Distance;可直接应用于真实病例,生成牙冠与对颌牙列交叠极少,降低咬合干扰风险。 Conclusion: ToothCraft能有效、鲁棒地完成牙齿冠部重建,解决了训练数据稀缺问题,并具备临床应用潜力。 Abstract: We present ToothCraft, a diffusion-based model for the contextual generation of tooth crowns, trained on artificially created incomplete teeth. Building upon recent advancements in conditioned diffusion models for 3D shapes, we developed a model capable of an automated tooth crown completion conditioned on local anatomical context. To address the lack of training data for this task, we designed an augmentation pipeline that generates incomplete tooth geometries from a publicly available dataset of complete dental arches (3DS, ODD). By synthesising a diverse set of training examples, our approach enables robust learning across a wide spectrum of tooth defects. Experimental results demonstrate the strong capability of our model to reconstruct complete tooth crowns, achieving an intersection over union (IoU) of 81.8% and a Chamfer Distance (CD) of 0.00034 on synthetically damaged testing restorations. Our experiments demonstrate that the model can be applied directly to real-world cases, effectively filling in incomplete teeth, while generated crowns show minimal intersection with the opposing dentition, thus reducing the risk of occlusal interference. Access to the code, model weights, and dataset information will be available at: https://github.com/ikarus1211/VISAPP_ToothCraft[156] The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding
Gillian Rosenberg,Skylar Stadhard,Bruce C. Hansen,Michelle R. Greene
Main category: cs.CV
TL;DR: 本文通过对比18个视觉语言模型(VLMs)与2000多名人类被试在15项高阶场景理解任务上的表现,发现VLMs虽在一般知识任务上接近人类水平,但在物体功能(affordance)理解上存在稳定且难以消除的缺陷;该缺陷源于训练语料中缺乏以行动者为中心的功能性语言描述,表明仅靠图文统计共现不足以习得具身认知所需的功能性知识。
Details
Motivation: 检验分布假设(distributional hypothesis)——即仅靠语言与图像的统计共现是否足以支撑人类场景理解的全部维度,特别是需要具身经验的功能性理解(affordances)。 Method: 开展两项实验:实验1对比18个VLM与2000+人类在15类场景理解任务(含常识、功能、感知、情感、预测)上的生成描述,引入新指标Human-Calibrated Cosine Distance(HCD)评估模型输出与人类响应分布的相似性;实验2检验六种机制假说,并结合语料库分析探究功能语言缺失现象。 Result: VLMs在一般知识任务上接近人类水平,但在功能任务上存在稳健缺陷,该缺陷不受提示工程或模型更新改善;缺陷具有结构性而非风格性,且不因显式提供空间信息而缓解;图像描述数据集中确实严重缺乏以行动者为导向的功能性语言表达。 Conclusion: 仅依赖图文配对数据的分布学习不足以支持功能性场景理解,某些人类视觉认知维度(如affordance)必须依赖具身的、三维的、以行动者为中心的交互经验,这是静态图像与文本无法编码的。 Abstract: What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.[157] From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Xilin Zhao,Qingming Huang
Main category: cs.CV
TL;DR: 本文提出Co-Settle框架,通过在冻结图像预训练编码器上添加轻量投影层,并结合时间循环一致性目标与语义可分性约束,缓解图像到视频迁移中时序一致性与跨视频语义可分性之间的权衡问题。
Details
Motivation: 图像预训练模型迁移到视频任务时,复杂时序模块和视频微调易损害跨视频语义可分性;而减少可调参数又会削弱视频内时序一致性,二者存在权衡困境。 Method: 提出Co-Settle框架:在冻结图像预训练编码器上添加轻量投影层,联合优化时间循环一致性损失与语义可分性约束;并提供理论分析证明其能改善权衡。 Result: 在8个图像预训练模型上验证,仅5轮自监督训练即在多个层级视频任务中实现一致性能提升。 Conclusion: 轻量投影+双目标优化可在不破坏图像先验的前提下,有效协调视频内时序一致性与跨视频语义可分性,为图像到视频迁移提供新思路。 Abstract: Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.[158] VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward
Zhaochong An,Orest Kupyn,Théo Uscidda,Andrea Colaco,Karan Ahuja,Serge Belongie,Mar Gonzalez-Franco,Marta Tintore Gazulla
Main category: cs.CV
TL;DR: 本文提出VGGRPO框架,通过引入潜在几何模型(LGM)将视频扩散模型的潜在表示与几何基础模型对齐,并在潜在空间中进行几何感知的强化学习优化,从而在不修改预训练模型结构的前提下提升视频生成的几何一致性,尤其适用于动态场景。
Details
Motivation: 现有方法在提升视频生成几何一致性时,要么需修改模型架构(损害预训练模型泛化能力),要么依赖RGB空间奖励(计算开销大、仅适用于静态场景)。 Method: 提出VGGRPO:1)构建Latent Geometry Model(LGM),将扩散潜变量与4D几何重建模型对接;2)在潜空间中执行Group Relative Policy Optimization,结合相机运动平滑性奖励和几何重投影一致性奖励。 Result: 在静态与动态基准上均显著提升相机稳定性、几何一致性和视频质量,且避免了重复VAE解码,大幅降低计算开销。 Conclusion: VGGRPO实现了无需架构修改、不依赖RGB空间、支持动态场景的高效潜空间几何引导视频后训练,为世界一致的视频生成提供了新范式。 Abstract: Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.[159] Think over Trajectories: Leveraging Video Generation to Reconstruct GPS Trajectories from Cellular Signaling
Ruixing Zhang,Hanzhang Jiang,Leilei Sun,Liangzhe Han,Jibin Wang,Weifeng Lv
Main category: cs.CV
TL;DR: 本文提出Sig2GPS方法,将蜂窝信令记录到GPS轨迹的重建问题重新定义为地图视觉域中的图像到视频生成任务,通过渲染信令轨迹于地图并训练视频生成模型绘制连续GPS路径,结合强化学习优化提升精度。
Details
Motivation: 蜂窝信令记录虽覆盖广,但仅提供粗粒度位置信息(如服务小区ID),难以直接支持需高精度GPS轨迹的应用。 Method: 将Sig2GPS建模为图像到视频生成任务:先将信令轨迹渲染为地图图像,再用视频生成模型生成对应GPS路径视频;构建配对的信令-轨迹视频数据集微调开源视频模型,并引入轨迹感知的强化学习优化方法提升生成保真度。 Result: 在大规模真实数据集上显著优于强工程化及学习型基线;在下一GPS点预测任务中展现出良好可扩展性与跨城市迁移能力。 Conclusion: 基于地图视觉的视频生成范式为轨迹数据挖掘提供了实用接口,支持在地图约束下直接生成和精细化连续路径。 Abstract: Mobile devices continuously interact with cellular base stations, generating massive volumes of signaling records that provide broad coverage for understanding human mobility. However, such records offer only coarse location cues (e.g., serving-cell identifiers) and therefore limit their direct use in applications that require high-precision GPS trajectories. This paper studies the Sig2GPS problem: reconstructing GPS trajectories from cellular signaling. Inspired by domain experts often lay the signaling trace on the map and sketch the corresponding GPS route, unlike conventional solutions that rely on complex multi-stage engineering pipelines or regress coordinates, Sig2GPS is reframed as an image-to-video generation task that directly operates in the map-visual domain: signaling traces are rendered on a map, and a video generation model is trained to draw a continuous GPS path. To support this paradigm, a paired signaling-to-trajectory video dataset is constructed to fine-tune an open-source video model, and a trajectory-aware reinforcement learning-based optimization method is introduced to improve generation fidelity via rewards. Experiments on large-scale real-world datasets show substantial improvements over strong engineered and learning-based baselines, while additional results on next GPS prediction indicate scalability and cross-city transferability. Overall, these results suggest that map-visual video generation provides a practical interface for trajectory data mining by enabling direct generation and refinement of continuous paths under map constraints.[160] Drive-Through 3D Vehicle Exterior Reconstruction via Dynamic-Scene SfM and Distortion-Aware Gaussian Splatting
Nitin Kulkarni,Akhil Devarashetti,Charlie Cluss,Livio Forte,Philip Schneider,Chunming Qiao,Alina Vereshchaka
Main category: cs.CV
TL;DR: 本文提出了一种面向真实汽车经销场景的端到端高保真3D重建方法,通过双目相机系统、动态车辆分割、畸变鲁棒特征匹配、CAD引导的SfM优化及畸变感知的3D高斯泼溅渲染,在无专业影棚条件下实现了高质量、可交互的车辆三维模型。
Details
Motivation: 在线汽车交易平台需高保真车辆3D模型以提升买家信心,但真实经销场景(车辆运动、背景杂乱、广角畸变、车漆高光、车轮非刚性变形)严重挑战传统重建方法。 Method: 提出四阶段端到端流程:1)基于SAM3与运动门控的动态车辆实例分割(显式遮蔽非刚性车轮);2)利用RoMa v2在原始畸变4K图像上提取语义置信掩码引导的鲁棒特征匹配;3)结合CAD先验的相机阵列感知SfM优化以消除尺度漂移;4)采用畸变感知3D高斯泼溅(3DGUT)与随机MCMC致密化策略渲染反射表面。 Result: 在10家经销商共25辆实车数据上,PSNR达28.66 dB,SSIM为0.89,LPIPS为0.21,相比标准3D-GS提升3.85 dB PSNR,生成可交互、检验级3D模型。 Conclusion: 该方法突破了动态、畸变、反射等多重现实约束,无需受控影棚即可实现高保真车辆三维重建,显著提升了在线汽车交易的可视化体验与可信度。 Abstract: High-fidelity 3D reconstruction of vehicle exteriors improves buyer confidence in online automotive marketplaces, but generating these models in cluttered dealership drive-throughs presents severe technical challenges. Unlike static-scene photogrammetry, this setting features a dynamic vehicle moving against heavily cluttered, static backgrounds. This problem is further compounded by wide-angle lens distortion, specular automotive paint, and non-rigid wheel rotations that violate classical epipolar constraints. We propose an end-to-end pipeline utilizing a two-pillar camera rig. First, we resolve dynamic-scene ambiguities by coupling SAM 3 for instance segmentation with motion-gating to cleanly isolate the moving vehicle, explicitly masking out non-rigid wheels to enforce strict epipolar geometry. Second, we extract robust correspondences directly on raw, distorted 4K imagery using the RoMa v2 learned matcher guided by semantic confidence masks. Third, these matches are integrated into a rig-aware SfM optimization that utilizes CAD-derived relative pose priors to eliminate scale drift. Finally, we use a distortion-aware 3D Gaussian Splatting framework (3DGUT) coupled with a stochastic Markov Chain Monte Carlo (MCMC) densification strategy to render reflective surfaces. Evaluations on 25 real-world vehicles across 10 dealerships demonstrate that our full pipeline achieves a PSNR of 28.66 dB, an SSIM of 0.89, and an LPIPS of 0.21 on held-out views, representing a 3.85 dB improvement over standard 3D-GS, delivering inspection-grade interactive 3D models without controlled studio infrastructure.[161] Make Geometry Matter for Spatial Reasoning
Shihua Zhang,Qiuhong Shen,Shizun Wang,Tianbo Pan,Xinchao Wang
Main category: cs.CV
TL;DR: 本文提出GeoSR框架,通过几何解耦掩码和几何引导融合机制,增强视觉语言模型对几何信息的利用,显著提升静态与动态场景下的空间推理能力。
Details
Motivation: 现有视觉语言模型在空间推理方面表现有限,尽管引入了3D基础模型的几何标记,但因过度依赖2D视觉线索而未能有效利用几何信息。 Method: 提出GeoSR框架,包含两个核心组件:(1) 几何解耦掩码(Geometry-Unleashing Masking),在训练中掩蔽部分2D视觉标记以削弱非几何捷径;(2) 几何引导融合(Geometry-Guided Fusion),采用门控路由机制自适应增强关键区域的几何标记贡献。 Result: 在多个静态与动态空间推理基准上显著超越先前方法,达到新的SOTA性能。 Conclusion: GeoSR成功促使视觉语言模型主动利用几何标记进行空间推理,验证了几何信息在提升空间理解能力中的关键作用。 Abstract: Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.[162] Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision
Ling Li,Bowen Liu,Zinuo Zhan,Peng Jie,Jianhui Zhong,Kenglun Chang,Zhidong Deng
Main category: cs.CV
TL;DR: 本文提出EgoPoint-Ground——首个面向自我中心场景的手势+语音指代表达定位的大规模多模态数据集,并构建基准评测;同时提出SV-CoT方法,将定位建模为融合手势与语言线索的视觉链式推理过程,在任务上提升11.7%。
Details
Motivation: 传统视觉定位(VG)依赖文本描述,难以处理语言歧义,且忽略真实交互中常见的非语言指示性线索(如手部指向);而自然自我中心交互中,手部指向结合语音是最直观的指代方式,亟需构建适配该场景的数据与方法。 Method: 构建EgoPoint-Ground数据集(15k+样本,含手-目标框对及细粒度语义标注),建立手部指向指代表达解析基准;提出SV-CoT框架,通过视觉链式推理(Visual Chain-of-Thought)协同建模手势与语言线索,实现结构化定位推理。 Result: SV-CoT在该任务上取得11.7%的绝对性能提升,显著缓解语义歧义,提升智能体对多模态物理意图的理解能力。 Conclusion: 本工作首次系统推动自我中心、多模态(手+语)指代表达视觉定位的研究,通过新数据集、新基准与新方法(SV-CoT),为具身智能中的自然人机交互提供了关键基础支撑。 Abstract: Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.[163] Tunable Soft Equivariance with Guarantees
Md Ashiqur Rahman,Lim Jun Hao,Jeremiah Jiang,Teck-Yian Lim,Raymond A. Yeh
Main category: cs.CV
TL;DR: 本文提出了一种通过将模型权重投影到设计子空间来构建软等变模型的通用框架,适用于任意预训练架构,并在多个任务上验证了其有效性。
Details
Motivation: 严格等变性在真实数据中很少满足,限制了模型性能,因此需要控制等变性的程度。 Method: 通过将模型权重投影到一个设计好的子空间,构建软等变模型,并提供理论上的等变误差界。 Result: 在ViT、ResNet等多个预训练骨干网络上验证了方法的有效性,在图像分类、语义分割和人类轨迹预测任务中均取得提升;在ImageNet基准上同时提升了性能并降低了等变误差。 Conclusion: 该框架是一种通用且有效的软等变建模方法,兼顾性能提升与等变性控制。 Abstract: Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model's performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.[164] Zero-Shot Depth from Defocus
Yiming Zuo,Hongyu Wen,Venkat Subramanian,Patrick Chen,Karhan Kayan,Mario Bijelic,Felix Heide,Jia Deng
Main category: cs.CV
TL;DR: 本文提出零样本深度估计新基准ZEDD、Transformer架构FOSSA及合成数据生成流程,在深度从散焦(DfD)任务中实现显著性能提升。
Details
Motivation: 解决现有DfD方法在特定数据集上过拟合、泛化能力差的问题,聚焦更具挑战性和实用性的零样本泛化场景。 Method: 构建高质量真实世界DfD基准ZEDD;设计基于Transformer的FOSSA网络,引入带焦距嵌入的堆栈注意力层;开发利用大规模RGBD数据生成合成焦距堆栈的新训练数据流程。 Result: 在ZEDD及其他基准上显著超越基线方法,误差最高降低55.7%;开源ZEDD基准、代码与模型检查点。 Conclusion: FOSSA结合ZEDD基准与合成数据策略,有效提升了DfD模型的零样本泛化能力与实际部署潜力。 Abstract: Depth from Defocus (DfD) is the task of estimating a dense metric depth map from a focus stack. Unlike previous works overfitting to a certain dataset, this paper focuses on the challenging and practical setting of zero-shot generalization. We first propose a new real-world DfD benchmark ZEDD, which contains 8.3x more scenes and significantly higher quality images and ground-truth depth maps compared to previous benchmarks. We also design a novel network architecture named FOSSA. FOSSA is a Transformer-based architecture with novel designs tailored to the DfD task. The key contribution is a stack attention layer with a focus distance embedding, allowing efficient information exchange across the focus stack. Finally, we develop a new training data pipeline allowing us to utilize existing large-scale RGBD datasets to generate synthetic focus stacks. Experiment results on ZEDD and other benchmarks show a significant improvement over the baselines, reducing errors by up to 55.7%. The ZEDD benchmark is released at https://zedd.cs.princeton.edu. The code and checkpoints are released at https://github.com/princeton-vl/FOSSA.[165] GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation
Nicolas von Lützow,Barbara Rössle,Katharina Schmid,Matthias Nießner
Main category: cs.CV
TL;DR: 本文提出GaussianGPT,一种基于Transformer的全自回归3D生成模型,通过离散化高斯体素并序列化建模,实现可控、可扩展的3D场景生成。
Details
Motivation: 探索区别于主流扩散/流匹配方法的全自回归范式,以支持更灵活的3D生成任务(如补全、外绘、可控采样)并兼容神经渲染管线。 Method: 设计稀疏3D卷积自编码器+向量量化压缩高斯原语为离散体素网格;将体素序列化后输入带3D旋转位置编码的因果Transformer进行自回归建模。 Result: 实现了端到端的3D高斯序列生成,在场景完成、可控采样和生成灵活性方面展现出优势,并保持与神经渲染的兼容性。 Conclusion: 自回归Transformer可作为扩散模型的互补范式,为可控、上下文感知的3D生成提供新路径。 Abstract: Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.[166] Detailed Geometry and Appearance from Opportunistic Motion
Ryosuke Hirai,Kohei Yamashita,Antoine Guédon,Ryo Kawahara,Vincent Lepetit,Ko Nishino
Main category: cs.CV
TL;DR: 本文提出了一种利用物体运动(如人操控物体)为静态稀疏相机提供虚拟视角的新方法,通过联合优化物体位姿与几何形状,并分解漫反射与镜面反射外观模型,显著提升了稀疏视角下的3D重建精度。