Skip to content

Table of Contents

cs.CL [Back]

[1] Trust, Safety, and Accuracy: Assessing LLMs for Routine Maternity Advice

V Sai Divya,A Bhanusree,Rimjhim,K Venkata Krishna Rao

Main category: cs.CL

TL;DR: 本研究评估了大型语言模型(如ChatGPT-4o、Perplexity AI和GeminiAI)在为印度农村孕妇提供可靠、易懂的孕产健康信息方面的潜力,发现Perplexity在语义上最接近专家回答,而ChatGPT-4o在可读性和医学术语使用上表现更优。

Details Motivation: 印度农村地区医疗资源与基础设施有限,导致孕妇难以获取可靠的孕产健康信息;与此同时,农村女性互联网使用率持续上升,为利用数字工具开展健康教育提供了新契机。 Method: 针对17个孕产相关问题,分别向ChatGPT-4o、Perplexity AI和GeminiAI提问,并与母胎健康专业人员的回答进行对比;采用语义相似度、名词重叠率和可读性指标评估生成内容质量。 Result: Perplexity AI在语义层面最接近专家回答;ChatGPT-4o生成文本更清晰、易懂,且医学术语使用更准确;GeminiAI表现相对逊色。 Conclusion: LLMs有望成为面向印度农村等资源匮乏地区的可扩展式孕产健康教育辅助工具,但需在准确性与可理解性之间取得平衡。 Abstract: Access to reliable maternal healthcare information is a major challenge in rural India due to limited medical resources and infrastructure. With over 830 million internet users and nearly half of rural women online, digital tools offer new opportunities for health education. This study evaluates large language models (LLMs) like ChatGPT-4o, Perplexity AI, and GeminiAI to provide reliable and understandable pregnancy-related information. Seventeen pregnancy-focused questions were posed to each model and compared with responses from maternal health professionals. Evaluations used semantic similarity, noun overlap, and readability metrics to measure content quality. Results show Perplexity closely matched expert semantics, while ChatGPT-4o produced clearer, more understandable text with better medical terminology. As internet access grows in rural areas, LLMs could serve as scalable aids for maternal health education. The study highlights the need for AI tools that balance accuracy and clarity to improve healthcare communication in underserved regions.

[2] Enhancing Financial Report Question-Answering: A Retrieval-Augmented Generation System with Reranking Analysis

Zhiyuan Cheng,Longying Lai,Yue Liu,Kai Cheng,Xiaoxi Qi

Main category: cs.CL

TL;DR: 本文提出了一种用于S&P 500公司10-K财报问答的检索增强生成(RAG)系统,通过引入神经重排序(neural reranking)显著提升了答案质量,在FinDER基准上正确率提升15.5个百分点。

Details Motivation: 金融分析师需从超百页的10-K报告中提取信息,传统方法效率低、准确率不足,亟需更高效的智能问答系统。 Method: 构建基于混合检索(全文+语义)的RAG系统,并在检索后引入基于交叉编码器的神经重排序模块;在FinDER数据集(1500个问题、5组实验)上系统评估。 Result: 重排序使得分≥8的答案正确率从33.5%提升至49.0%,完全错误答案率从35.3%降至22.5%;验证了重排序对金融RAG性能的关键提升作用。 Conclusion: 神经重排序是提升金融领域RAG系统问答质量的关键组件,结合现代语言模型与优化的检索策略可显著超越基线方法。 Abstract: Financial analysts face significant challenges extracting information from lengthy 10-K reports, which often exceed 100 pages. This paper presents a Retrieval-Augmented Generation (RAG) system designed to answer questions about S&P 500 financial reports and evaluates the impact of neural reranking on system performance. Our pipeline employs hybrid search combining full-text and semantic retrieval, followed by an optional reranking stage using a cross-encoder model. We conduct systematic evaluation using the FinDER benchmark dataset, comprising 1,500 queries across five experimental groups. Results demonstrate that reranking significantly improves answer quality, achieving 49.0 percent correctness for scores of 8 or above compared to 33.5 percent without reranking, representing a 15.5 percentage point improvement. Additionally, the error rate for completely incorrect answers decreases from 35.3 percent to 22.5 percent. Our findings emphasize the critical role of reranking in financial RAG systems and demonstrate performance improvements over baseline methods through modern language models and refined retrieval strategies.

[3] Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment

Aditya Kamlesh Parikh,Cristian Tejedor-Garcia,Catia Cucchiarini,Helmer Strik

Main category: cs.CL

TL;DR: 本文提出了一种基于评分标准引导和不确定性校准的语音评估框架,用于提升大语音语言模型(SpeechLLMs)在二语口语自动评分中的可靠性与可解释性。

Details Motivation: 大型语音语言模型在二语口语自动评估中难以与人类评分员的细微评分差异对齐,亟需更可靠、可解释的评估方法。 Method: 构建了基于准确性、流利度和韵律三维度评分标准的推理框架;采用多评分员标注数据微调Qwen2-Audio-7B-Instruct模型;结合高斯不确定性建模与共形校准实现不确定性校准回归。 Result: 该方法在与人类评分一致性上显著优于回归与分类基线;能可靠评估流利度与韵律,但准确性评估仍具挑战;输出具备可解释的置信区间。 Conclusion: 基于评分标准引导并校准不确定性的推理范式,为构建可信、可解释的SpeechLLM语音评估系统提供了可行路径。 Abstract: Reliable and interpretable automated assessment of second-language (L2) speech remains a central challenge, as large speech-language models (SpeechLLMs) often struggle to align with the nuanced variability of human raters. To address this, we introduce a rubric-guided reasoning framework that explicitly encodes multi-aspect human assessment criteria: accuracy, fluency, and prosody, while calibrating model uncertainty to capture natural rating variability. We fine-tune the Qwen2-Audio-7B-Instruct model using multi-rater human judgments and develop an uncertainty-calibrated regression approach supported by conformal calibration for interpretable confidence intervals. Our Gaussian uncertainty modeling and conformal calibration approach achieves the strongest alignment with human ratings, outperforming regression and classification baselines. The model reliably assesses fluency and prosody while highlighting the inherent difficulty of assessing accuracy. Together, these results demonstrate that rubric-guided, uncertainty-calibrated reasoning offers a principled path toward trustworthy and explainable SpeechLLM-based speech assessment.

[4] LLM NL2SQL Robustness: Surface Noise vs. Linguistic Variation in Traditional and Agentic Settings

Lifu Tu,Rongguang Wang,Tao Sheng,Sujjith Ravi,Dan Roth

Main category: cs.CL

TL;DR: 本文提出了一种面向NL2SQL系统的鲁棒性评估基准,涵盖约十类扰动,并在传统与智能体(agentic)设置下评估多个主流大语言模型;结果表明模型对表层噪声和语义保持型语言变异较敏感,且两类扰动在不同设置下影响模式不同。

Details Motivation: 现实数据库环境动态、嘈杂、持续演化,而传统NL2SQL评测假设静态schema和规范用户输入,缺乏鲁棒性评估。 Method: 构建包含约十种扰动类型的鲁棒性评测基准,在传统和agentic两种设置下评估Grok-4.1、Gemini-3-Pro、Claude-Opus-4.6、GPT-5.2等SOTA LLMs。 Result: 模型在多数扰动下表现稳健,但在字符级噪声和语义不变的语言变异上显著下降;表层噪声更影响传统pipeline,语言变异更挑战agentic设置。 Conclusion: 当前NL2SQL系统在应对语言变异性方面仍存在明显鲁棒性瓶颈,需针对性提升。 Abstract: Robustness evaluation for Natural Language to SQL (NL2SQL) systems is essential because real-world database environments are dynamic, noisy, and continuously evolving, whereas conventional benchmark evaluations typically assume static schemas and well-formed user inputs. In this work, we introduce a robustness evaluation benchmark containing approximately ten types of perturbations and conduct evaluations under both traditional and agentic settings. We assess multiple state-of-the-art large language models (LLMs), including Grok-4.1, Gemini-3-Pro, Claude-Opus-4.6, and GPT-5.2. Our results show that these models generally maintain strong performance under several perturbations; however, notable performance degradation is observed for surface-level noise (e.g., character-level corruption) and linguistic variation that preserves semantics while altering lexical or syntactic forms. Furthermore, we observe that surface-level noise causes larger performance drops in traditional pipelines, whereas linguistic variation presents greater challenges in agentic settings. These findings highlight the remaining challenges in achieving robust NL2SQL systems, particularly in handling linguistic variability.

[5] Evaluating Ill-Defined Tasks in Large Language Models

Yi Zhou,Basel Shbita

Main category: cs.CL

TL;DR: 本文分析了当前大语言模型(LLM)在评估模糊定义任务(如复杂指令遵循、自然语言转Mermaid图)时的缺陷,指出现有基准与指标存在覆盖不足、度量不一致、LLM裁判不稳定等问题,主张构建更鲁棒、可解释的评估方法。

Details Motivation: 现有LLM评估常面向定义不清的任务,其输入输出空间模糊、成功标准不明确,导致评估信号不可靠、不具诊断性。 Method: 通过两个案例研究进行分析:1)复杂指令遵循(CIF),识别覆盖范围窄、表述敏感、指标不可比、LLM裁判不稳定等问题;2)自然语言转Mermaid序列图(NL2Mermaid),展示多维度评估如何提供超越总分的可操作洞见。 Result: 发现当前评估常混淆不同失败模式,导致分数不稳定、缺乏诊断价值、难以指导改进。 Conclusion: 揭示了针对模糊任务的评估实践存在根本性局限,呼吁设计更鲁棒、透明、可解释的评估框架。 Abstract: Many evaluations of Large Language Models (LLMs) target tasks that are inherently ill-defined, with unclear input and output spaces and ambiguous success criteria. We analyze why existing evaluation benchmarks and metrics fail to provide reliable or diagnostic signals of model capability for such tasks. We examine two case studies: Complex Instruction Following (CIF), where we identify recurring issues including limited coverage of real-world instruction complexity, sensitivity to instruction phrasing, inconsistent and non-comparable metrics, and instability introduced by LLM-based judges; and Natural Language to Mermaid Sequence Diagrams (NL2Mermaid), where we show how multi-faceted evaluation criteria can yield actionable insights beyond aggregate scores. Together, these case studies show that current evaluations frequently conflate distinct failure modes, yielding scores that are unstable, non-diagnostic, and difficult to act upon. Our findings expose fundamental limitations in existing evaluation practices for ill-defined tasks and motivate more robust, interpretable evaluation designs.

[6] Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts

Lucas Bandarkar,Alan Ansell,Trevor Cohn

Main category: cs.CL

TL;DR: 本文揭示了现代大型推理语言模型在跨语言知识迁移中存在脚本障碍,而非语言或语系差异;通过实证分析和针对性微调策略(如源语言实体提示与合成SFT数据训练),显著缩小了跨脚本知识迁移差距。

Details Motivation: 探究现代大型推理语言模型在跨语言知识迁移中表现不佳的根本原因,澄清是语言/语系差异还是其他因素(如文字系统)导致性能下降。 Method: 1)在ECLeKTic和MultiLoKo数据集上进行观测性数据分析与回归建模;2)引入源语言关键实体提示实验;3)构建合成生成流程,设计SFT样本以增强模型对音译歧义的推理能力。 Result: 发现文字系统匹配(script match)是知识迁移失败的最主要预测因子;源语言实体提示显著提升跨脚本问题性能;经针对性SFT训练后,两个模型的跨脚本迁移差距明显缩小。 Conclusion: 跨语言参数化知识迁移的瓶颈主要在于脚本障碍,可通过后训练阶段的针对性优化(如推理增强型微调)有效改善。 Abstract: In this work, we analyze shortcomings in cross-lingual knowledge transfer in large, modern reasoning LLMs. We demonstrate that the perceived gap in knowledge transfer is primarily a script barrier. First, we conduct an observational data analysis on the performance of thinking models on two datasets with local knowledge from around the world, ECLeKTic and MultiLoKo. Our regression analysis shows that script match - not language or family - is the primary predictor of knowledge transfer failure once model capability and question difficulty are accounted for. We further this finding by providing the LLMs with the key entities of the questions in their source language and find that this disproportionately improves cross-script questions. We then posit that these LLMs could be reasoning better at test-time. To evaluate this, we develop a synthetic generation pipeline to design SFT samples to encourage the model to better reason about transliteration ambiguities when trying to fetch parametric knowledge at inference-time. We show that teaching two models to reason better reduces the cross-script transfer gap. As a result, we conclude that there is potential to improve cross-lingual parametric knowledge transfer during post-training.

[7] Ensemble Self-Training for Unsupervised Machine Translation

Ido Aharon,Jonathan Shaki,Sarit Kraus

Main category: cs.CL

TL;DR: 本文提出了一种基于集成驱动的自训练框架用于无监督神经机器翻译(UNMT),通过引入辅助语言构建多样化模型,利用词元级集成解码生成伪翻译数据,并用其进一步训练各模型,最终在保持单模型推理开销的同时显著提升性能。

Details Motivation: 解决无监督神经机器翻译中单一模型多样性不足、伪数据质量不高导致性能受限的问题。 Method: 设计多模型集成自训练框架:多个UNMT模型共享主语言对任务但使用不同辅助语言以增强结构化多样性;采用双向词元级集成解码(平均预测)生成高质量伪平行语料;用该合成数据联合优化所有模型;部署时仅选验证性能最优的单模型。 Result: 在英→X和X→英方向上分别取得平均1.7和0.67 chrF的显著提升(统计显著),优于单模型UNMT基线。 Conclusion: 集成驱动的自训练能有效提升UNMT性能,兼顾模型多样性与推理效率,为无监督翻译提供了新思路。 Abstract: We present an ensemble-driven self-training framework for unsupervised neural machine translation (UNMT). Starting from a primary language pair, we train multiple UNMT models that share the same translation task but differ in an auxiliary language, inducing structured diversity across models. We then generate pseudo-translations for the primary pair using token-level ensemble decoding, averaging model predictions in both directions. These ensemble outputs are used as synthetic parallel data to further train each model, allowing the models to improve via shared supervision. At deployment time, we select a single model by validation performance, preserving single-model inference cost. Experiments show statistically significant improvements over single-model UNMT baselines, with mean gains of 1.7 chrF when translating from English and 0.67 chrF when translating into English.

[8] Evaluating LLM-Simulated Conversations in Modeling Inconsistent and Uncollaborative Behaviors in Human Social Interaction

Ryo Kamoi,Ameya Godbole,Longqi Yang,Rui Zhang,Mengting Wan,Pei Zhou

Main category: cs.CL

TL;DR: 本文提出CoCoEval评估框架,用于检测LLM生成对话中10种不一致与不合作行为,并发现当前LLMs在模拟真实人类对话(尤其在误解、打断等复杂社交行为)方面仍存在显著差距,提示其作为人类社交互动代理的局限性。

Details Motivation: 现有LLM模拟人类对话难以复现真实对话中固有的不一致与不合作行为(如误解、打断),而这类行为对建模复杂社会互动至关重要;目前缺乏系统性对比分析。 Method: 构建CoCoEval评估框架,利用LLM-as-a-Judge在回合级检测10类不一致与不合作行为;在学术、商业、政府会议及辩论等多场景下,对比GPT-4.1、GPT-5.1、Claude Opus 4生成对话与真实人类对话的行为频率,并测试不同提示工程与监督微调策略的影响。 Result: (1)基础提示下,LLM生成对话中不一致/不合作行为远少于人类对话;(2)提示工程无法稳定调控这些行为,易导致过少或过多;(3)基于人类对话的监督微调反而导致特定行为(如重复)过度生成。 Conclusion: 当前LLM难以可靠模拟人类对话中的复杂社交动态,将其直接用作人类社会互动代理存在根本性风险与局限。 Abstract: Simulating human conversations using large language models (LLMs) has emerged as a scalable methodology for modeling human social interaction. However, simulating human conversations is challenging because they inherently involve inconsistent and uncollaborative behaviors, such as misunderstandings and interruptions. Analysis comparing inconsistent and uncollaborative behaviors in human- and LLM-generated conversations remains limited, although reproducing these behaviors is integral to simulating human-like and complex social interaction. In this work, we introduce CoCoEval, an evaluation framework that analyzes LLM-simulated conversations by detecting 10 types of inconsistent and uncollaborative behaviors at the turn level using an LLM-as-a-Judge. Using CoCoEval, we evaluate GPT-4.1, GPT-5.1, and Claude Opus 4 by comparing the frequencies of detected behaviors in conversations simulated by each model and in human conversations across academic, business, and governmental meetings, as well as debates. Our analysis shows that (1) under vanilla prompting, LLM-simulated conversations exhibit far fewer inconsistent and uncollaborative behaviors than human conversations; (2) prompt engineering does not provide reliable control over these behaviors, as our results show that different prompts lead to their under- or overproduction; and (3) supervised fine-tuning on human conversations can lead LLMs to overproduce a narrow set of behaviors, such as repetition. Our findings highlight the difficulty of simulating human conversations, raising concerns about the use of LLMs as a proxy for human social interaction.

[9] Knowledge Localization in Mixture-of-Experts LLMs Using Cross-Lingual Inconsistency

Lucas Bandarkar,Alan Ansell,Trevor Cohn

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LLM)在多语言间事实回忆能力不一致性的知识定位框架,用于解释混合专家(MoE)模型中特定知识的存储与调用机制。

Details Motivation: 现代大语言模型在不同语言中表现差异显著(如某些语言能准确回忆事实,另一些则不能),传统上视其为缺陷需缓解;本文反其道而行之,将这种跨语言不一致性作为可利用的可解释性信号。 Method: 提出知识定位框架:首先在多语言下对困难事实问题进行查询,构建‘成功’与‘失败’激活桶;然后对MoE路由器logits做统计对比分析,识别对特定知识起关键作用的专家;最后通过消融这些专家验证其必要性。 Result: 仅停用约20个专家(共6000个)即导致模型在超40%案例中无法正确回答,证明所识别专家确为知识调用的关键组件。 Conclusion: 该方法提供了一种现实可行、可扩展的知识定位手段,有助于理解日益复杂的MoE大语言模型内部知识组织机制。 Abstract: Modern LLMs continue to exhibit significant variance in behavior across languages, such as being able to recall factual information in some languages but not others. While typically studied as a problem to be mitigated, in this work, we propose leveraging this cross-lingual inconsistency as a tool for interpretability in mixture-of-experts (MoE) LLMs. Our knowledge localization framework contrasts routing for sets of languages where the model correctly recalls information from languages where it fails. This allows us to isolate model components that play a functional role in answering about a piece of knowledge. Our method proceeds in two stages: (1) querying the model with difficult factual questions across a diverse set of languages to generate "success" and "failure" activation buckets and then (2) applying a statistical contrastive analysis to the MoE router logits to identify experts important for knowledge. To validate the necessity of this small number of experts for answering a knowledge question, we deactivate them and re-ask the question. We find that despite only deactivating about 20 out of 6000 experts, the model no longer answers correctly in over 40% of cases. Generally, this method provides a realistic and scalable knowledge localization approach to address increasingly complex LLMs.

[10] Exploiting the English Grammar Profile for L2 grammatical analysis with LLMs

Stefano Bannò,Penny Karanasou,Kate Knill,Mark Gales

Main category: cs.CL

TL;DR: 本文提出了一种基于英语语法档案(EGP)的新框架,用于检测和分类二语学习者对语法结构的尝试(成功/失败),以提供细粒度反馈并预测CEFR整体语言水平;比较了基于规则与大语言模型(LLM)的方法,发现LLM在语义/语用复杂结构上更优,而规则方法在形态/句法简单结构上仍有竞争力;混合规则预筛选+LLM的方案在水平评估中表现最佳;全自动语法纠错流水线接近人工校正效果。

Details Motivation: 评估二语学习者的语法能力对提供针对性反馈和语言水平测评至关重要,但现有方法多关注错误而忽视成功尝试,缺乏基于权威语法资源(如EGP)的细粒度、发展性评估框架。 Method: 构建基于EGP的语法结构检测与分类框架,对比规则系统与LLM对语法结构尝试(成功/失败)的识别性能;将检测结果作为特征输入模型进行CEFR整体水平预测;设计并评估规则型、混合型(规则预筛选+LLM)及全自动(结合自动语法纠错)三种处理流程。 Result: LLM在语义/语用复杂结构识别上显著优于规则方法;规则方法在纯形态/句法结构上仍具竞争力;混合流程在CEFR水平预测中性能最强;全自动纠错流程在成功尝试检测上接近人工校正系统性能。 Conclusion: 该框架通过强调学习者成功尝试,支持形成性、正向反馈;结合EGP确保教学相关性;混合与全自动方案提升了实用性与可扩展性,为智能语言测评与教学反馈提供了新范式。 Abstract: Evaluating the grammatical competence of second language (L2) learners is essential both for providing targeted feedback and for assessing proficiency. To achieve this, we propose a novel framework leveraging the English Grammar Profile (EGP), a taxonomy of grammatical constructs mapped to the proficiency levels of the Common European Framework of Reference (CEFR), to detect learners' attempts at grammatical constructs and classify them as successful or unsuccessful. This detection can then be used to provide fine-grained feedback. Moreover, the grammatical constructs are used as predictors of proficiency assessment by using automatically detected attempts as predictors of holistic CEFR proficiency. For the selection of grammatical constructs derived from the EGP, rule-based and LLM-based classifiers are compared. We show that LLMs outperform rule-based methods on semantically and pragmatically nuanced constructs, while rule-based approaches remain competitive for constructs that rely purely on morphological or syntactic features and do not require semantic interpretation. For proficiency assessment, we evaluate both rule-based and hybrid pipelines and show that a hybrid approach combining a rule-based pre-filter with an LLM consistently yields the strongest performance. Since our framework operates on pairs of original learner sentences and their corrected counterparts, we also evaluate a fully automated pipeline using automatic grammatical error correction. This pipeline closely approaches the performance of semi-automated systems based on manual corrections, particularly for the detection of successful attempts at grammatical constructs. Overall, our framework emphasises learners' successful attempts in addition to unsuccessful ones, enabling positive, formative feedback and providing actionable insights into grammatical development.

[11] Tabular LLMs for Interpretable Few-Shot Alzheimer's Disease Prediction with Multimodal Biomedical Data

Sophie Kearney,Shu Yang,Zixuan Wen,Weimin Lyu,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Jason H. Moore,Marylyn D. Ritchie,Chao Chen,Li Shen

Main category: cs.CL

TL;DR: 本文提出TAP-GPT,一种面向阿尔茨海默病(AD)诊断的领域适配型表格大语言模型,利用表格提示在小样本、缺失数据场景下实现高精度、可解释的多模态生物标志物预测。

Details Motivation: 传统深度学习在小样本、不完整临床表格数据上表现不佳;而大语言模型具备少样本泛化、结构化推理与可解释输出能力,有望革新临床预测范式。 Method: 基于TableGPT2构建TAP-GPT框架,采用表格格式提示(非纯文本)进行领域微调,支持多模态(MRI、PET等)和单模态AD二分类;引入特征选择缓解高维输入退化,并验证其在模拟与真实缺失数据下的鲁棒性。 Result: 在四个ADNI衍生数据集上,TAP-GPT在少样本设置下优于基线模型与传统机器学习方法,媲美通用大模型;具备模态感知的生物学合理推理能力,且自反思稳定性更强。 Conclusion: 这是首个系统应用表格专用大语言模型于多模态AD预测的工作,证明了预训练表格LLM可有效解决结构化临床预测任务,为多智能体临床决策支持系统奠定基础。 Abstract: Accurate diagnosis of Alzheimer's disease (AD) requires handling tabular biomarker data, yet such data are often small and incomplete, where deep learning models frequently fail to outperform classical methods. Pretrained large language models (LLMs) offer few-shot generalization, structured reasoning, and interpretable outputs, providing a powerful paradigm shift for clinical prediction. We propose TAP-GPT Tabular Alzheimer's Prediction GPT, a domain-adapted tabular LLM framework built on TableGPT2 and fine-tuned for few-shot AD classification using tabular prompts rather than plain texts. We evaluate TAP-GPT across four ADNI-derived datasets, including QT-PAD biomarkers and region-level structural MRI, amyloid PET, and tau PET for binary AD classification. Across multimodal and unimodal settings, TAP-GPT improves upon its backbone models and outperforms traditional machine learning baselines in the few-shot setting while remaining competitive with state-of-the-art general-purpose LLMs. We show that feature selection mitigates degradation in high-dimensional inputs and that TAP-GPT maintains stable performance under simulated and real-world missingness without imputation. Additionally, TAP-GPT produces structured, modality-aware reasoning aligned with established AD biology and shows greater stability under self-reflection, supporting its use in iterative multi-agent systems. To our knowledge, this is the first systematic application of a tabular-specialized LLM to multimodal biomarker-based AD prediction, demonstrating that such pretrained models can effectively address structured clinical prediction tasks and laying the foundation for tabular LLM-driven multi-agent clinical decision-support systems. The source code is publicly available on GitHub: https://github.com/sophie-kearney/TAP-GPT.

[12] CODMAS: A Dialectic Multi-Agent Collaborative Framework for Structured RTL Optimization

Che-Ming Chang,Prashanth Vijayaraghavan,Ashutosh Jadhav,Charles Mackin,Vandana Mukherjee,Hsinyu Tsai,Ehsan Degan

Main category: cs.CL

TL;DR: 本文提出CODMAS框架,通过辩证多智能体系统实现RTL代码的自动化优化,在关键路径延迟和功耗方面分别取得约25%和22%的改进。

Details Motivation: RTL代码优化对EDA中提升功耗、性能和面积(PPA)至关重要,但现有方法在自动化、鲁棒性和可扩展性方面存在不足。 Method: 构建基于辩证推理的多智能体系统CODMAS,包含Articulator(生成变换计划)、Hypothesis Partner(预测与偏差修正)、Domain-Specific Coding Agent(生成Verilog修改)和Code Evaluation Agent(验证语法、功能与PPA),并引入RTLOPT基准(120组Verilog三元组)进行评估。 Result: 在pipelining和clock-gating任务上,相比强提示和智能体基线,CODMAS实现约25%关键路径延迟降低和约22%功耗降低,并显著减少功能与编译失败。 Conclusion: 结构化多智能体推理能有效提升RTL自动优化能力,具备向更复杂设计和更广优化任务扩展的潜力。 Abstract: Optimizing Register Transfer Level (RTL) code is a critical step in Electronic Design Automation (EDA) for improving power, performance, and area (PPA). We present CODMAS (Collaborative Optimization via a Dialectic Multi-Agent System), a framework that combines structured dialectic reasoning with domain-aware code generation and deterministic evaluation to automate RTL optimization. At the core of CODMAS are two dialectic agents: the Articulator, inspired by rubber-duck debugging, which articulates stepwise transformation plans and exposes latent assumptions; and the Hypothesis Partner, which predicts outcomes and reconciles deviations between expected and actual behavior to guide targeted refinements. These agents direct a Domain-Specific Coding Agent (DCA) to generate architecture-aware Verilog edits and a Code Evaluation Agent (CEA) to verify syntax, functionality, and PPA metrics. We introduce RTLOPT, a benchmark of 120 Verilog triples (unoptimized, optimized, testbench) for pipelining and clock-gating transformations. Across proprietary and open LLMs, CODMAS achieves ~25% reduction in critical path delay for pipelining and ~22% power reduction for clock gating, while reducing functional and compilation failures compared to strong prompting and agentic baselines. These results demonstrate that structured multi-agent reasoning can significantly enhance automated RTL optimization and scale to more complex designs and broader optimization tasks.

[13] SYMDIREC: A Neuro-Symbolic Divide-Retrieve-Conquer Framework for Enhanced RTL Synthesis and Summarization

Prashanth Vijayaraghavan,Apoorva Nitsure,Luyao Shi,Charles Mackin,Ashutosh Jadhav,David Beymer,Ehsan Degan,Vandana Mukherjee

Main category: cs.CL

TL;DR: 本文提出SYMDIREC,一种无需微调大语言模型(LLM)的神经符号框架,通过符号化子目标分解、细调检索器和LLM推理组装,显著提升RTL综合与摘要任务性能。

Details Motivation: RTL综合与摘要对大语言模型而言仍具挑战性,原因在于硬件描述语言(HDL)语法严格、监督数据有限、且与自然语言对齐弱;现有提示工程与RAG方法缺乏符号规划,导致结构精度不足。 Method: 提出SYMDIREC神经符号框架:将RTL任务分解为符号子目标,使用微调的检索器获取相关代码,并通过LLM进行验证式输出组装;支持Verilog和VHDL,无需LLM微调。 Result: 在RTL综合任务中Pass@1提升约20%,在摘要任务中ROUGE-L提升15–20%,显著优于纯提示和RAG基线方法。 Conclusion: 符号引导能有效提升LLM在RTL任务中的结构精度与泛化能力,SYMDIREC为硬件设计自动化提供了可扩展、无需微调的新范式。 Abstract: Register-Transfer Level (RTL) synthesis and summarization are central to hardware design automation but remain challenging for Large Language Models (LLMs) due to rigid HDL syntax, limited supervision, and weak alignment with natural language. Existing prompting and retrieval-augmented generation (RAG) methods have not incorporated symbolic planning, limiting their structural precision. We introduce SYMDIREC, a neuro-symbolic framework that decomposes RTL tasks into symbolic subgoals, retrieves relevant code via a fine-tuned retriever, and assembles verified outputs through LLM reasoning. Supporting both Verilog and VHDL without LLM fine-tuning, SYMDIREC achieves ~20% higher Pass@1 rates for synthesis and 15-20% ROUGE-L improvements for summarization over prompting and RAG baselines, demonstrating the benefits of symbolic guidance in RTL tasks.

[14] Anonymous-by-Construction: An LLM-Driven Framework for Privacy-Preserving Text

Federico Albanese,Pablo Ronco,Nicolás D'Ippolito

Main category: cs.CL

TL;DR: 本文提出了一种基于本地大语言模型(LLM)的文本匿名化替代流水线,通过在组织内部部署、用类型一致且逼真的替代词替换个人身份信息(PII),实现在不泄露数据前提下兼顾隐私保护与语义实用性。实验表明该方法在隐私性、语义保真度和下游可训练性三方面均达到SOTA水平。

Details Motivation: 在大语言模型广泛应用背景下,需在保障敏感信息(如PII)安全的同时维持数据实用性,尤其避免数据外泄至第三方API,实现负责任的AI部署。 Method: 构建端侧、基于本地LLM的PII替代流水线,执行类型一致、语义合理的匿名化替换;评估采用多指标跨技术对比(vs. Presidio、Google DLP、ZSTS),涵盖隐私性、语义效用、下游微调能力(BERT+LoRA)及代理式问答性能(插入匿名层后评估LLM响应质量)。 Result: 在Action-Based Conversation Dataset上,该方法在隐私保护、主题一致性、事实效用和可训练性综合指标上优于规则系统、NER基线及ZSTS变体,实现了SOTA的隐私-效用-可训练性平衡。 Conclusion: 本地LLM驱动的类型保持替代方案可生成既安全又实用的匿名语料,适用于隐私敏感的智能体管道与下游微调任务,为负责任的数据使用提供了可行路径。 Abstract: Responsible use of AI demands that we protect sensitive information without undermining the usefulness of data, an imperative that has become acute in the age of large language models. We address this challenge with an on-premise, LLM-driven substitution pipeline that anonymizes text by replacing personally identifiable information (PII) with realistic, type-consistent surrogates. Executed entirely within organizational boundaries using local LLMs, the approach prevents data egress while preserving fluency and task-relevant semantics. We conduct a systematic, multi-metric, cross-technique evaluation on the Action-Based Conversation Dataset, benchmarking against industry standards (Microsoft Presidio and Google DLP) and a state-of-the-art approach (ZSTS, in redaction-only and redaction-plus-substitution variants). Our protocol jointly measures privacy, semantic utility, and trainability under privacy via a lifecycle-ready criterion obtained by fine-tuning a compact encoder (BERT+LoRA) on sanitized text. In addition, we assess agentic Q&A performance by inserting an on-premise anonymization layer before the answering LLM and evaluating the quality of its responses. This intermediate, type-preserving substitution stage ensures that no sensitive content is exposed to third-party APIs, enabling responsible deployment of Q\&A agents without compromising confidentiality. Our method attains state-of-the-art privacy, minimal topical drift, strong factual utility, and low trainability loss, outperforming rule-based approaches and named-entity recognition (NER) baselines and ZSTS variants on the combined privacy--utility--trainability frontier. These results show that local LLM substitution yields anonymized corpora that are both responsible to use and operationally valuable: safe for agentic pipelines and suitable for downstream fine-tuning with limited degradation.

[15] Alignment Makes Language Models Normative, Not Descriptive

Eilam Shapira,Moshe Tennenholtz,Roi Reichart

Main category: cs.CL

TL;DR: 本文发现后训练对齐虽提升模型在规范性场景中预测人类行为的能力,但在多轮策略博弈等描述性动态强的场景中反而显著降低预测准确性,揭示了对齐与行为建模之间的根本权衡。

Details Motivation: 后训练对齐旨在使语言模型符合人类偏好,但该目标不等价于建模真实人类行为;作者旨在实证检验对齐是否真正提升对人类决策的预测能力。 Method: 在超10,000条真实人类决策数据上,系统比较120组基础模型与对齐模型在多轮策略博弈(议价、说服、谈判、重复矩阵博弈)及单轮教科书博弈、彩票选择等任务中的行为预测性能。 Result: 在多轮策略博弈中,基础模型预测准确率约为对齐模型的10倍;而在单轮教科书博弈和非策略性选择中,对齐模型全面优于基础模型;且在多轮博弈首回合(无历史交互时),对齐模型亦占优。 Conclusion: 对齐引入了规范性偏差:它提升模型在符合理性规范的人类行为上的预测能力,却损害其在受描述性心理机制(如互惠、报复、历史依赖)驱动的行为预测能力;因此,优化模型用于人类交互与用其模拟人类行为存在本质冲突。 Abstract: Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

[16] TharuChat: Bootstrapping Large Language Models for a Low-Resource Language via Synthetic Data and Human Validation

Prajwal Panth,Agniva Maiti

Main category: cs.CL

TL;DR: 本文提出Tharu-LLaMA(3B)模型及TharuChat数据集,通过LLM-to-Human引导方式生成面向濒危Tharu语言的指令微调数据,验证小规模合成数据在资源匮乏语言建模中的有效性,并实现可在消费级硬件上部署的轻量级语言模型。

Details Motivation: 解决大型语言模型发展加剧全球南方土著语言数字鸿沟的问题,特别是Tharu语因数据稀缺与方言碎片化而被主流多语言模型忽视甚至错误生成(幻觉)的现状。 Method: 构建TharuChat数据集:利用提示工程的Gemini模型,结合Rana Tharu语法与民间故事,采用LLM-to-Human引导流程生成训练数据;模型方面微调3B参数规模的LLaMA架构,专注指令遵循能力;并通过消融实验评估数据量对困惑度的影响。 Result: TharuChat数据集以Rana Tharu为主(~70%),融合Dangaura和Kochila方言,含代码混用与残余Awadhi/Hindi影响;数据量从25%增至100%时,困惑度线性下降(6.42→2.88);最终模型Tharu-LLaMA(3B)可在消费级硬件部署。 Conclusion: 小规模、真实噪声环境下的合成数据足以显著提升低资源语言建模效果,为喜马拉雅地区濒危语言的AI保护提供了可复现、低成本的技术路径。 Abstract: The rapid proliferation of Large Language Models (LLMs) has created a profound digital divide, effectively excluding indigenous languages of the Global South from the AI revolution. The Tharu language, an Indo-Aryan vernacular spoken by approximately 1.7 million people across the Terai belt of Nepal and India, exemplifies this crisis. Despite a rich oral tradition, Tharu suffers from severe data scarcity and linguistic fragmentation, causing state-of-the-art multilingual models to routinely "hallucinate" or default to dominant high-resource neighbors like Hindi and Nepali due to contamination in pre-training corpora. This paper presents Tharu-LLaMA (3B), a specialized instruction-following model designed to address this exclusion. We introduce TharuChat, a novel dataset constructed via a LLM-to-Human bootstrapping pipeline. We utilized prompt-engineered Gemini models, fed with Rana Tharu grammar and folklore, to synthesize training data. Unlike curated gold-standard corpora, TharuChat reflects the noisy, heterogeneous linguistic reality of the region: it is predominantly anchored in Rana Tharu (~70%) while integrating elements of Dangaura and Kochila dialects. We provide a transparent analysis of the dataset's limitations, including dialectal code-mixing and residual Awadhi/Hindi influence. Through a rigorous empirical ablation study, we demonstrate that despite these imperfections, small-scale synthetic data is highly effective, increasing the dataset volume from 25% to 100% results in a linear reduction in perplexity from 6.42 to 2.88. The resulting model serves as a proof-of-concept for the preservation of under-resourced Himalayan languages via generative AI, achievable on consumer-grade hardware.

[17] Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models

Xiutian Zhao,Ismail Rasim Ulgen,Philipp Koehn,Björn Schuller,Berrak Sisman

Main category: cs.CL

TL;DR: 本文首次在神经元层面研究了大型音频-语言模型(LALMs)中的情感控制问题,发现并定位了紧凑的情感敏感神经元(ESNs),并提出无需训练即可在推理阶段进行情感操控的方法。

Details Motivation: 现有大型音频-语言模型虽能生成富有表现力的语音,但难以可靠地控制语音情感:情感转换常偏离目标情绪,且易损害语言保真度(如拒绝响应、幻觉或改写)。 Method: 通过成功过滤的激活聚合方法识别兼具情感实现与内容保持能力的情感敏感神经元(ESNs),并在推理阶段对这些神经元进行干预,实现无需训练的情感控制。 Result: 在三个LALMs(Qwen2.5-Omni-7B、MiniCPM-o 4.5、Kimi-Audio)上验证了ESN干预的有效性,情感控制效果具有跨说话人泛化能力,并经自动评估与人工评估双重验证。 Conclusion: 本研究建立了首个面向语音生成的、无需训练的情感控制机制框架,揭示了可控性的关键影响因素(选择器设计、掩码稀疏性、过滤策略和干预强度)。 Abstract: Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and intervention strength. Our results establish a mechanistic framework for training-free emotion control in speech generation.

[18] From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Bangju Han,Yingqi Wang,Huang Qing,Tiyuan Li,Fengyi Yang,Ahtamjan Ahmat,Abibulla Atawulla,Yating Yang,Xi Zhou

Main category: cs.CL

TL;DR: 本文提出了CulT-Eval基准,用于系统评估机器翻译模型对文化负载表达(如习语、俚语、文化专有项)的处理能力,并提出一种新指标以弥补现有自动评测指标在文化意义偏差方面的不足。

Details Motivation: 现有机器翻译基准碎片化,缺乏对文化负载表达(如习语、俚语、文化专有项)系统性评测的框架,导致模型在文化含义保持上的缺陷难以被识别和量化。 Method: 构建包含7959个样本的CulT-Eval基准,涵盖多种文化负载表达类型,并设计覆盖文化表达错误的细粒度错误分类体系;基于大规模模型评测与人工分析,提出针对文化诱导语义偏移的互补评测指标。 Result: 实验证明当前主流大语言模型在保留文化内涵和捕捉文化语境细微差别方面表现不佳;现有自动评测指标(如BLEU、COMET)无法有效反映文化意义层面的错误。 Conclusion: 文化负载表达的准确翻译仍是机器翻译的重大挑战;CulT-Eval为该领域提供了首个系统性评测基准与专用指标,推动更文化敏感的翻译模型发展。 Abstract: Culture-expressions, such as idioms, slang, and culture-specific items (CSIs), are pervasive in natural language and encode meanings that go beyond literal linguistic form. Accurately translating such expressions remains challenging for machine translation systems. Despite this, existing benchmarks remain fragmented and do not provide a systematic framework for evaluating translation performance on culture-loaded expressions. To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions. CulT-Eval comprises over 7,959 carefully curated instances spanning multiple types of culturally grounded expressions, with a comprehensive error taxonomy covering culturally grounded expressions. Through extensive evaluation of large language models and detailed analysis, we identify recurring and systematic failure modes that are not adequately captured by existing automatic metrics. Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics. The results indicate that current models struggle to preserve culturally grounded meaning and to capture the cultural and contextual nuances essential for accurate translation. Our benchmark and code are available at https://anonymous.4open.science/r/CulT-Eval-E75D/.

[19] Beyond bouba/kiki: Multidimensional semantic signals are deeply woven into the fabric of natural language

Gexin Zhao

Main category: cs.CL

TL;DR: 本文通过最小对范式和大型语言模型,系统揭示了英语中每个音素都携带结构化的多维语义信号,并发现这些信号可由发音特征(如发音方式和部位)系统预测,且在行为实验和跨语言初步验证中得到支持,表明语音-意义的象似性是语音信号中普遍而系统的属性。

Details Motivation: 挑战语言学中“音义任意性”的基础假设,系统探索音素与多维语义之间的关联。 Method: 采用覆盖220对字母对比的最小对范式,结合三个大型语言模型分析九个感知维度;用发音-语音特征建模预测;并通过英语母语者行为实验及五种类型学差异语言的初步跨语言验证。 Result: 发现英语单个音素携带结构化、多维语义信号;这些信号可被发音方式和部位系统预测;行为实验确认识别率达80.8%;跨语言数据提示核心映射具有泛化性。 Conclusion: 语音-意义的象似性不是偶然现象,而是语音信号中普遍、系统、可计算的固有属性,甚至仅靠文本输入的大型语言模型即可复现。 Abstract: A foundational assumption in linguistics holds that the relationship between a word's sound and its meaning is arbitrary. Accumulating evidence from sound symbolism challenges this view, yet no study has systematically mapped the multidimensional semantic profile of every phonological unit within a language. Here we show that individual letter-phonemes in English carry structured, multidimensional semantic signals. Using a minimal-pair paradigm spanning all 220 pairwise letter contrasts, three large language models independently recover consistent phoneme-meaning associations across nine perceptual dimensions. These associations are systematically predicted by articulatory-phonetic features, with manner and place of articulation mapping onto distinct semantic dimensions. Behavioral data from English speakers confirm these patterns at rates well above chance (80.8%), and preliminary cross-linguistic evidence from five typologically diverse languages suggests that core mappings generalize beyond English. Our findings indicate that sound-meaning iconicity is not an occasional curiosity but a pervasive, structured property of the phonological signal, one so systematic that large language models recover it when given only text input, without exposure to speech or articulation during the task.

[20] Ruyi2.5 Technical Report

Huan Song,Shuyu Tian,Qingfei Zhao,Wenhao Hong,Jiang Liu,Ting Long,Jiawei Shao,Xuelong Li

Main category: cs.CL

TL;DR: Ruyi2.5是一个基于AI Flow框架的多模态家族模型,采用共享骨干网络实现一次训练、多场景部署;其衍生模型Ruyi2.5-Camera面向隐私保护摄像头应用,通过边缘-云两级识别架构保障数据脱敏与深度行为分析;同时提出BPPO算法加速强化学习微调,训练速度提升2–3倍;实验表明其在通用多模态任务上媲美Qwen3-VL,在隐私受限监控任务上显著超越。

Details Motivation: 解决多模态模型在不同部署层级(如边缘与云端)间语义不一致、隐私敏感场景下原始视频数据泄露风险高、以及强化学习微调效率低等问题。 Method: 构建共享骨干的多尺度联合训练架构(Ruyi2.5);设计边缘端信息瓶颈引导的不可逆特征映射+云端深度行为推理的两阶段Ruyi2.5-Camera系统;提出Binary Prefix Policy Optimization(BPPO)算法,通过二值化响应选择和前缀梯度更新优化RL训练。 Result: Ruyi2.5在通用多模态基准上达到Qwen3-VL水平;Ruyi2.5-Camera在隐私约束监控任务上显著优于Qwen3-VL;BPPO相较GRPO提速2–3倍。 Conclusion: Ruyi2.5系列验证了统一多模态骨干+分层部署+隐私优先设计的有效性,为边缘智能与可信AI提供了可扩展的技术路径。 Abstract: We present Ruyi2.5, a multimodal familial model built on the AI Flow framework. Extending Ruyi2's "Train Once, Deploy Many" paradigm to the multimodal domain, Ruyi2.5 constructs a shared-backbone architecture that co-trains models of varying scales within a single unified pipeline, ensuring semantic consistency across all deployment tiers. Built upon Ruyi2.5, Ruyi2.5-Camera model is developed as a privacy-preserving camera service system, which instantiates Ruyi2.5-Camera into a two-stage recognition pipeline: an edge model applies information-bottleneck-guided irreversible feature mapping to de-identify raw frames at the source, while a cloud model performs deep behavior reasoning. To accelerate reinforcement learning fine-tuning, we further propose Binary Prefix Policy Optimization (BPPO), which reduces sample redundancy via binary response selection and focuses gradient updates on response prefixes, achieving a 2 to 3 times training speedup over GRPO. Experiments show Ruyi2.5 matches Qwen3-VL on the general multimodal benchmarks, while Ruyi2.5-Camera substantially outperforms Qwen3-VL on privacy-constrained surveillance tasks.

[21] Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

Risham Sidhu,Julia Hockenmaier

Main category: cs.CL

TL;DR: 本文提出了一个纯文本的网格数据集GSU,用于评估大语言模型在导航、物体定位和结构组成三个核心空间推理任务上的能力。研究发现,尽管前沿模型能解决这些任务,但小模型通过全量微调或LoRA微调也能达到相近性能,表明专用具身智能体有发展潜力。

Details Motivation: 现有模型的空间推理能力难以与感知解耦评估,因此需要一个不依赖视觉输入、专门测试空间推理能力的数据集。 Method: 构建纯文本网格数据集GSU,设计三类空间推理任务(导航、物体定位、结构组成),在多个LLM/VLM上进行系统性评测,并对比不同微调策略的效果。 Result: 多数模型能理解基础网格概念,但在以具身智能体为参照系及从坐标列表识别3D形状方面表现较差;多模态训练并未提升模型对3D空间的通用理解;前沿模型表现优异,但小型模型经全量微调或LoRA微调后可媲美其性能。 Conclusion: 纯文本空间推理能力可被有效评测与提升,无需依赖视觉模态;小型语言模型经适当微调即可胜任复杂空间推理任务,为构建高效具身智能体提供了可行路径。 Abstract: We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from perception, we show that while most models grasp basic grid concepts, they struggle with frames of reference relative to an embodied agent and identifying 3D shapes from coordinate lists. We also find that exposure to a visual modality does not provide a generalizable understanding of 3D space that VLMs are able to utilize for these tasks. Finally, we show that while the very latest frontier models can solve the provided tasks (though harder variants may still stump them), fully fine-tuning a small LM or LORA fine-tuning a small LLM show potential to match frontier model performance, suggesting an avenue for specialized embodied agents.

[22] PACE-RAG: Patient-Aware Contextual and Evidence-based Policy RAG for Clinical Drug Recommendation

Chaeyoung Huh,Hyunmin Hwang,Jung Hwan Shin,Jinse Park,Jong Chul Ye

Main category: cs.CL

TL;DR: 本文提出PACE-RAG框架,结合患者个体临床背景与相似病例处方模式,提升帕金森病个性化用药推荐性能与可解释性,在多个基准上达到SOTA。

Details Motivation: 现有大语言模型缺乏对真实处方细微差异的建模能力,而传统RAG方法(如指南检索或相似患者检索)难以兼顾个体临床特异性与群体经验,导致推荐不够精准和可解释。 Method: 提出PACE-RAG(Patient-Aware Contextual and Evidence-based Policy RAG),融合患者个体上下文与相似病例的处方策略,通过分析特定临床信号下的治疗模式生成最优用药推荐及可解释临床摘要。 Result: 在帕金森病队列和MIMIC-IV数据集上,使用Llama-3.1-8B和Qwen3-8B模型,F1分数分别达80.84%和47.22%,显著优于现有方法。 Conclusion: PACE-RAG是一种鲁棒、临床可信的个性化用药决策支持方案,有效弥合了通用医学知识与个体化处方实践之间的鸿沟。 Abstract: Drug recommendation requires a deep understanding of individual patient context, especially for complex conditions like Parkinson's disease. While LLMs possess broad medical knowledge, they fail to capture the subtle nuances of actual prescribing patterns. Existing RAG methods also struggle with these complexities because guideline-based retrieval remains too generic and similar-patient retrieval often replicates majority patterns without accounting for the unique clinical nuances of individual patients. To bridge this gap, we propose PACE-RAG (Patient-Aware Contextual and Evidence-based Policy RAG), a novel framework designed to synthesize individual patient context with the prescribing tendencies of similar cases. By analyzing treatment patterns tailored to specific clinical signals, PACE-RAG identifies optimal prescriptions and generates an explainable clinical summary. Evaluated on a Parkinson's cohort and the MIMIC-IV benchmark using Llama-3.1-8B and Qwen3-8B, PACE-RAG achieved state-of-the-art performance, reaching F1 scores of 80.84% and 47.22%, respectively. These results validate PACE-RAG as a robust, clinically grounded solution for personalized decision support. Our code is available at: https://github.com/ChaeYoungHuh/PACE-RAG.

[23] SafeTutors: Benchmarking Pedagogical Safety in AI Tutoring Systems

Rima Hazra,Bikram Ghuku,Ilona Marchenko,Yaroslava Tokarieva,Sayan Layek,Somnath Banerjee,Julia Stoyanovich,Mykola Pechenizkiy

Main category: cs.CL

TL;DR: 本文提出SafeTutors基准,系统评估大语言模型作为AI导师时在教学有效性与安全性上的联合表现,揭示现有模型普遍存在因过度披露答案、强化错误概念及缺乏脚手架而导致的学习损害问题。

Details Motivation: 当前LLM评估范式孤立地考察解题准确性和通用安全性,未能反映AI导师在真实师生交互中兼顾教学有效性与安全性的能力;而辅导场景下的安全风险核心在于隐性学习损害,而非显性有害内容。 Method: 构建SafeTutors基准,基于教育学文献建立包含11个危害维度、48个子风险的理论驱动风险分类体系,覆盖数学、物理、化学三学科,并在单轮与多轮对话中联合评测模型的教学与安全表现。 Result: 发现所有被测模型均存在广泛危害;模型规模提升不能稳定改善表现;多轮对话显著加剧教学失败(从17.7%升至77.8%);不同学科危害特征各异;单轮‘安全/有益’结果会掩盖长期交互中的系统性辅导失效。 Conclusion: AI导师的安全性必须重新定义为‘教学安全性’,需以教育学原理为指导设计评估与干预方法,且须具备学科敏感性与交互过程意识。 Abstract: Large language models are rapidly being deployed as AI tutors, yet current evaluation paradigms assess problem-solving accuracy and generic safety in isolation, failing to capture whether a model is simultaneously pedagogically effective and safe across student-tutor interaction. We argue that tutoring safety is fundamentally different from conventional LLM safety: the primary risk is not toxic content but the quiet erosion of learning through answer over-disclosure, misconception reinforcement, and the abdication of scaffolding. To systematically study this failure mode, we introduce SafeTutors, a benchmark that jointly evaluates safety and pedagogy across mathematics, physics, and chemistry. SafeTutors is organized around a theoretically grounded risk taxonomy comprising 11 harm dimensions and 48 sub-risks drawn from learning-science literature. We uncover that all models show broad harm; scale doesn't reliably help; and multi-turn dialogue worsens behavior, with pedagogical failures rising from 17.7% to 77.8%. Harms also vary by subject, so mitigations must be discipline-aware, and single-turn "safe/helpful" results can mask systematic tutor failure over extended interaction.

[24] Argument Reconstruction as Supervision for Critical Thinking in LLMs

Hyun Ryu,Gyouk Chu,Gregor Betz,Eunho Yang,Carolyn Rose,Sean Welleck

Main category: cs.CL

TL;DR: 本文提出了一种自动论证重构框架GAAR,构建了高质量数据集Arguinas,并实证表明学习论证重构能显著提升大语言模型在多项批判性思维任务上的表现。

Details Motivation: 探究大语言模型(LLM)能否通过学习论证重构来增强其批判性思维能力,而此前这一能力尚不明确。 Method: 提出GAAR自动论证重构引擎;基于该引擎构建Arguinas高质量论证重构数据集;在七个批判性思维下游任务上评估学习论证重构对模型性能的影响。 Result: 在全部七项批判性思维任务中,经论证重构训练的模型均优于未训练模型,且在使用Arguinas数据集训练时提升最为显著。 Conclusion: 学习论证重构可有效提升LLM的批判性思维能力,所提框架与数据集为此提供了可行路径和重要资源。 Abstract: To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument's underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset. The source code and dataset will be publicly available.

[25] TRiMS: Real-Time Tracking of Minimal Sufficient Length for Efficient Reasoning via RL

Tingcheng Bian,Jinchang Luo,Mingquan Cheng,Jinyu Zhang,Xiaoling Xia,Ni Li,Yan Tao,Haiwei Wang

Main category: cs.CL

TL;DR: 本文提出MSL(Minimal Sufficient Length)理论指标,定义并证明了保持答案正确所需的最短推理链长度,并基于此设计TRiMS方法,在训练中结合GRPO算法与MSL估计,实现超80%的思维链token压缩且精度略有提升。

Details Motivation: 大型语言模型在复杂推理中依赖长链式思维(CoT),但导致严重的推理膨胀和计算冗余,亟需提升‘每Token智能’(Intelligence per Token)。 Method: 提出理论指标MSL,给出递归定义并证明其极限存在;分析主流CoT压缩策略,识别逼近MSL的关键结构因素;据此设计TRiMS方法,融合GRPO算法、MSL估计、动态批聚合及基于批标准差的优势计算以稳定训练。 Result: TRiMS在所有基准测试上实现超80%的CoT token压缩,同时带来轻微准确率提升。 Conclusion: MSL为推理链压缩提供了首个可测量的下界;TRiMS验证了逼近MSL的可行性,显著提升推理效率而不牺牲性能。 Abstract: Large language models achieve breakthroughs in complex reasoning via long chain-of-thought sequences. However, this often leads to severe reasoning inflation, causing substantial computational redundancy. To maximize Intelligence per Token, we introduce a theoretical metric, MSL-Minimal Sufficient Length. MSL rigorously characterizes the shortest reasoning length that preserves answer correctness. We provide a recursive definition based on independently sampled sequences and prove the existence of its limit, establishing the first measurable lower bound for reasoning-chain compression. Building on an analysis of mainstream CoT compression strategies, we identify key structural factors enabling a model to approach MSL. Based on these insights, we propose TRiMS which employs the GRPO algorithm in conjunction with MSL-based estimation during training, while mitigating instabilities during the training process through dynamic batch aggregation and advantage computation using batch-level standard deviation. TRiMS achieves over 80% CoT token reduction with a minor accuracy boost across all benchmarks.

[26] Humans and transformer LMs: Abstraction drives language learning

Jasper Jian,Christopher D. Manning

Main category: cs.CL

TL;DR: 本文研究了基于Transformer的语言模型(LM)如何学习语言类别,通过比较其训练过程中的行为与人类语言习得的抽象特征型和具体例证型理论,发现抽象类别行为早于具体词汇行为出现,且不同语言行为在训练中按序突现,表明抽象化在LM学习中起关键作用。

Details Motivation: 探究基于Transformer的语言模型如何学习语言类别,并将其与人类语言习得的抽象特征型和具体例证型理论进行对比,以理解LM是否可作为人类语言习得模型的存在性证明。 Method: 使用GPT-2 small模型,提出基于发散度的新指标,通过追踪下一词分布的学习轨迹,分析词汇语义和句法类别的涌现过程。 Result: 实验发现:(i) 构式习得时,抽象类别层面行为早于词汇项层面行为出现;(ii) 不同语言行为在训练过程中按序突现。 Conclusion: 抽象化在语言模型学习中起关键作用,该结果支持将LM视为人类语言习得某些理论(尤其是强调抽象化的理论)的存在性证明。 Abstract: Categorization is a core component of human linguistic competence. We investigate how a transformer-based language model (LM) learns linguistic categories by comparing its behaviour over the course of training to behaviours which characterize abstract feature-based and concrete exemplar-based accounts of human language acquisition. We investigate how lexical semantic and syntactic categories emerge using novel divergence-based metrics that track learning trajectories using next-token distributions. In experiments with GPT-2 small, we find that (i) when a construction is learned, abstract class-level behaviour is evident at earlier steps than lexical item-specific behaviour, and (ii) that different linguistic behaviours emerge abruptly in sequence at different points in training, revealing that abstraction plays a key role in how LMs learn. This result informs the models of human language acquisition that LMs may serve as an existence proof for.

[27] Learning When to Attend: Conditional Memory Access for Long-Context LLMs

Sakshi Choudhary,Aditya Chattopadhyay,Luca Zancato,Elvis Nunez,Matthew Trager,Wei Xia,Stefano Soatto

Main category: cs.CL

TL;DR: 本文提出L2A(Learning To Attend)方法,通过token-wise条件长程记忆访问机制,在保持性能的同时大幅减少全局注意力计算开销,显著提升长上下文语言模型的训练与推理效率。

Details Motivation: 语言模型难以泛化到预训练上下文长度之外,而持续长上下文预训练因注意力机制的二次复杂度而代价高昂;多数token实际只需局部上下文,无需全程全局注意力。 Method: 提出L2A层,实现token级条件全局注意力调用决策;设计定制Triton核以高效支持该条件注意力;支持训练后稀疏化剪枝全局注意力层。 Result: 在Qwen 2.5/3上将有效上下文从32K扩展至128K;性能达标准长上下文训练的97%,跳过约80% token的全局注意力;训练吞吐与首token延迟提升约2倍;KV缓存内存减少达50%,性能损失可忽略。 Conclusion: L2A是一种高效、可扩展的长上下文建模方案,在计算、内存与性能间取得优异平衡,适用于大规模语言模型的长程推理与检索任务。 Abstract: Language models struggle to generalize beyond pretraining context lengths, limiting long-horizon reasoning and retrieval. Continued pretraining on long-context data can help but is expensive due to the quadratic scaling of Attention. We observe that most tokens do not require (Global) Attention over the entire sequence and can rely on local context. Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention. We evaluate L2A on Qwen 2.5 and Qwen 3 models, extending their effective context length from 32K to 128K tokens. L2A matches the performance of standard long-context training to within 3% while skipping Global Attention for $\sim$80% of tokens, outperforming prior baselines. We also design custom Triton kernels to efficiently implement this token-wise conditional Attention on GPUs, achieving up to $\sim$2x improvements in training throughput and time-to-first-token over FlashAttention. Moreover, L2A enables post-training pruning of highly sparse Global Attention layers, reducing KV cache memory by up to 50% with negligible performance loss.

[28] Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination

Cem Uluoglakci,Tugba Taskaya Temizel

Main category: cs.CL

TL;DR: 本文提出HypoTermInstruct数据集,通过训练大模型识别并承认对虚构术语的无知,以提升其认知谦逊能力,显著降低幻觉,同时保持其他任务性能稳定。

Details Motivation: 大型语言模型常因监督微调(SFT)隐式鼓励‘必须作答’而产生幻觉;需一种能显式培养模型认知谦逊(即识别知识边界并坦承不确定)的训练方法。 Method: 构建HypoTermInstruct SFT数据集(含11,151个关于虚构术语的问题及31,487条响应),并设计强化验证的HypoTermQA-Enhanced评测基准;在Llama3.1-8B和Gemma3-4B上开展800次受控LoRA微调实验,对比100种配置与配对对照组。 Result: 用HypoTermInstruct替代通用指令数据后,HypoTerm Score中位数提升0.19%–25.91%,FactScore提升0.39%–0.86%,MMLU仅轻微下降0.26%–0.35%。 Conclusion: 高质量、目标明确的SFT数据可有效减少幻觉,无需依赖偏好建模或强化学习,为构建更可靠AI系统提供了可行且具机制解释性的路径。 Abstract: Large language models (LLMs) often hallucinate, producing fluent but false information, partly because supervised fine-tuning (SFT) implicitly rewards always responding. We introduce $\textit{HypoTermInstruct}$, an SFT dataset (31,487 responses for 11,151 questions) designed to teach models epistemological humility-the ability to recognize the limits of their own knowledge and admit uncertainty. This is achieved through questions about non-existent "hypothetical" terms. We also release $\textit{HypoTermQA-Enhanced}$, a benchmark for hallucination tendency strengthened through multiple validations. We conducted 800 controlled LoRA SFT runs across $\textit{Llama3.1-8B}$ and $\textit{Gemma3-4B}$ (base and instruct), testing 100 fine-tuning configurations with paired controls. Our results demonstrate that replacing generic instruction data with $\textit{HypoTermInstruct}$ significantly improves the HypoTerm Score (median increases of 0.19% to 25.91%) and FactScore (+0.39% to +0.86%), while maintaining stable performance on MMLU (minimal decreases of 0.26% to 0.35%). Our work demonstrates that targeted, high-quality SFT data teaching meta-cognitive skills can effectively reduce hallucination without preference/RL pipelines, providing mechanistic insights and a practical path toward more reliable AI systems.

[29] Language on Demand, Knowledge at Core: Composing LLMs with Encoder-Decoder Translation Models for Extensible Multilinguality

Mengyu Bu,Yang Feng

Main category: cs.CL

TL;DR: 本文提出XBridge架构,通过将多语言理解和生成任务交由预训练的编码器-解码器翻译模型处理,而让大语言模型(LLM)专注于英语知识处理,从而提升LLM在低资源及未见语言上的多语言能力,无需重新训练LLM。

Details Motivation: 大型语言模型(LLMs)虽具强泛化能力,但多语言性能不平衡,尤其在低资源或未见语言上表现不佳;而预训练的编码器-解码器翻译模型已具备均衡的多语言能力,可作为LLM的自然补充。 Method: 提出XBridge架构,采用编码器-LLM-解码器的组合结构,将多语言理解与生成卸载至外部翻译模型,并引入轻量级跨模型映射层和基于最优传输的对齐目标,以解决模型间表征错位问题。 Result: 在四个LLM上针对多语言理解、推理、摘要和生成任务的实验表明,XBridge显著优于强基线方法,尤其在低资源和未见语言上效果突出,且不需重训LLM。 Conclusion: XBridge是一种高效、无需微调LLM即可增强其多语言能力的即插即用式架构,为构建真正多语言的大模型系统提供了新思路。 Abstract: Large language models (LLMs) exhibit strong general intelligence, yet their multilingual performance remains highly imbalanced. Although LLMs encode substantial cross-lingual knowledge in a unified semantic space, they often struggle to reliably interface this knowledge with low-resource or unseen languages. Fortunately, pretrained encoder-decoder translation models already possess balanced multilingual capability, suggesting a natural complement to LLMs. In this work, we propose XBridge, a compositional encoder-LLM-decoder architecture that offloads multilingual understanding and generation to external pretrained translation models, while preserving the LLM as an English-centric core for general knowledge processing. To address the resulting representation misalignment across models, we introduce lightweight cross-model mapping layers and an optimal transport-based alignment objective, enabling fine-grained semantic consistency for multilingual generation. Experiments on four LLMs across multilingual understanding, reasoning, summarization, and generation indicate that XBridge outperforms strong baselines, especially on low-resource and previously unseen languages, without retraining the LLM.

[30] Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

Madhav S. Baidya,S. S. Baidya,Chirag Chawla

Main category: cs.CL

TL;DR: 本文提出一个全面的基准测试,评估多种机器生成文本检测方法在不同数据集和大语言模型上的表现,发现现有方法在跨域和跨模型泛化方面存在显著局限。

Details Motivation: 现有基准通常只在单一数据集和理想条件下评估单一检测器,无法反映实际应用中跨领域、跨大语言模型及对抗性场景下的检测能力,因此需要更全面、更具挑战性的评估框架。 Method: 构建包含HC3和ELI5两个数据集的综合基准,系统评估经典分类器、微调的Transformer编码器(BERT、RoBERTa等)、CNN、XGBoost风格统计模型、困惑度(perplexity)方法以及LLM-as-detector提示法等多种检测方法。 Result: Transformer模型在同分布下性能接近完美,但域偏移时显著下降;XGBoost风格模型性能相当且具可解释性;LLM作为检测器效果差并受生成器-检测器身份偏差影响;困惑度方法存在极性反转问题,但经校正后仍有效;所有方法均未能稳健地跨域和跨LLM泛化。 Conclusion: 当前机器生成文本检测方法缺乏鲁棒的跨域与跨模型泛化能力,需进一步研究提升其通用性与可靠性。 Abstract: The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.

[31] AURORA Model of Formant-to-Tongue Inversion for Didactic and Clinical Applications

Patrycja Strycharczuk,Sam Kirkham

Main category: cs.CL

TL;DR: AURORA模型基于前两个共振峰值预测元音发音时的舌头位移和形状,旨在作为教学辅助工具解释共振峰与发音之间的关系,并为生物反馈应用提供基础。

Details Motivation: 解释共振峰与舌头发音动作之间的关系,并为语音教学和生物反馈应用提供支持。 Method: 基于40名英语母语者的超声舌像和声学数据,构建AURORA模型以预测元音发音时的舌头位移和形状。 Result: 开发了Shiny应用程序和实时舌部生物反馈原型软件,使模型更易于被语音学学生、相关领域语言学家及言语治疗师和患者使用。 Conclusion: AURORA模型不仅有助于语音教学理解,也为未来言语障碍干预等生物反馈应用提供了可行基础。 Abstract: This paper outlines the conceptual and computational foundations of the AURORA (Acoustic Understanding and Real-time Observation of Resonant Articulations) model. AURORA predicts tongue displacement and shape in vowel sounds based on the first two formant values. It is intended as a didactic aid helping to explain the relationship between formants and the underlying articulation, as well as a foundation for biofeedback applications. The model is informed by ultrasound tongue imaging and acoustic data from 40 native speakers of English. In this paper we discuss the motivation for the model, the modelling objectives as well as the model architecture. We provide a qualitative evaluation of the model, focusing on selected tongue features. We then present two tools developed to make the model more accessible to a wider audience, a Shiny app and a prototype software for real-time tongue biofeedback. Potential users include students of phonetics, linguists in fields adjacent to phonetics, as well as speech and language therapy practitioners and clients.

[32] Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Yuxiang Mei,Delai Qiu,Shengping Liu,Jiaen Liang,Yanhua Long

Main category: cs.CL

TL;DR: 本文提出Zipper-LoRA,一种在多语言ASR中解决稳定性-可塑性困境的新型LoRA变体,通过语言条件路由器在秩级别动态融合共享与语言特异性子空间,并结合两阶段训练策略提升低资源语言性能。

Details Motivation: 多语言ASR中数据分布不均衡导致全共享PEFT引发负向跨语言干扰,而全独立调优又阻碍跨语言知识迁移,亟需兼顾稳定性和可塑性的新方法。 Method: 提出Zipper-LoRA框架(含Static/Hard/Soft三种变体),在LoRA秩级别实现共享与语言特异性子空间的动态合成;引入轻量级语言条件路由器控制各子空间贡献;辅以带Initial-B热启动的两阶段训练策略以稳定优化。 Result: 在12语言混合资源设置下,Zipper-LoRA显著优于全共享和完全独立基线,尤其在极低资源场景下增益明显;且在分块与非分块编码器配置下均保持鲁棒性。 Conclusion: Zipper-LoRA通过细粒度、语言感知的秩级参数解耦,有效缓解多语言ASR中的稳定性-可塑性矛盾,为大规模低资源语音识别提供了可靠、可扩展的解决方案。 Abstract: Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework's reliability for practical, large-scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong-May/Zipper-LoRA for reproducibility.

[33] KA2L: A Knowledge-Aware Active Learning Framework for LLMs

Haoxuan Yin,Bojian Liu,Chen Tang,Yangfan Wang,Lian Yan,Jingchi Jiang

Main category: cs.CL

TL;DR: 本文提出了一种知识感知的主动学习框架KA2L,通过分析LLM隐空间中领域知识的分布,识别模型未知知识,并生成相应问题用于高效微调,显著降低标注与计算成本并提升性能。

Details Motivation: 现有研究缺乏对大语言模型在特定领域知识理解深度的评估,以及如何利用主动学习有针对性地提升其专业知识。 Method: 提出KA2L框架:1)知识分布探测技术,分析Transformer特定层隐状态以区分已知/未知知识;2)隐状态解码方法,从潜在知识空间生成自然语言形式的未知问题;3)基于知识掌握程度筛选难样本进行主动微调。 Result: 在9个开源LLM上验证,KA2L在两个开放域和一个垂直领域数据集上节省50%标注与计算成本,同时性能更优。 Conclusion: KA2L为LLM的高效、精准知识增强提供了新范式,揭示了结合知识表征与主动学习提升模型专业能力的有效路径。 Abstract: Fine-tuning large language models (LLMs) with high-quality knowledge has been shown to enhance their performance effectively. However, there is a paucity of research on the depth of domain-specific knowledge comprehension by LLMs and the application of targeted active learning to improve their expertise. To address this gap, we introduce the Knowledge-Aware Active Learning (KA2L) framework. This framework assesses LLMs' mastery of specific knowledge points to aid in constructing unanswerable or unknowable questions through latent space analysis. This active learning strategy enhances training efficiency by focusing on knowledge the model has yet to master, thereby minimizing redundancy in learning already acquired information. This study innovatively employs a knowledge distribution probing technique to examine the hidden states of specific Transformer layers and identify the distribution of known and unknown knowledge within the LLM. Additionally, a hidden-state decoding method is proposed to generate numerous unknown questions in natural language from the latent knowledge space. In our experiments, we selected nine open-source LLMs to validate the effectiveness of the proposed framework. Results indicate that KA2L not only significantly reduces 50% annotation and computation costs across two open-domain and one vertical-domain dataset but also achieves better performance, offering valuable insights into active learning strategies for LLMs. The code is available at https://anonymous.4open.science/r/KA2L-F15C.

[34] VeriAgent: A Tool-Integrated Multi-Agent System with Evolving Memory for PPA-Aware RTL Code Generation

Yaoxiang Wang,Qi Shi,ShangZhan Li,Qingguo Hu,Xinyu Yin,Bo Guo,Xu Han,Maosong Sun,Jinsong Su

Main category: cs.CL

TL;DR: 本文提出了一种PPA(功耗、性能、面积)感知的多智能体框架,将EDA工具集成到闭环工作流中,结合程序员、正确性与PPA智能体,并引入可演化的结构化记忆机制,实现RTL代码生成中功能正确性与物理设计指标的联合优化。

Details Motivation: 现有大语言模型在RTL代码自动生成中虽能保证功能正确性,但忽视了物理设计关键目标(Power, Performance, Area),难以满足实际硬件设计需求。 Method: 构建工具集成的多智能体框架(Programmer Agent、Correctness Agent、PPA Agent),嵌入EDA工具形成闭环;设计Evolved Memory Mechanism,通过结构化记忆节点和动态内存管理器实现无需重训练的持续优化。 Result: 实验表明该方法在保持高功能正确性的同时,显著提升PPA各项指标;实现了从单次推理到反馈驱动持续优化的范式转变。 Conclusion: 该框架为LLM在真实硬件设计流程中的落地提供了可扩展、可持续优化的新路径。 Abstract: LLMs have recently demonstrated strong capabilities in automatic RTL code generation, achieving high syntactic and functional correctness. However, most methods focus on functional correctness while overlooking critical physical design objectives, including Power, Performance, and Area. In this work, we propose a PPA-aware, tool-integrated multi-agent framework for high-quality verilog code generation. Our framework explicitly incorporates EDA tools into a closed-loop workflow composed of a \textit{Programmer Agent}, a \textit{Correctness Agent}, and a \textit{PPA Agent}, enabling joint optimization of functional correctness and physical metrics. To support continuous improvement without model retraining, we introduce an \textit{Evolved Memory Mechanism} that externalizes optimization experience into structured memory nodes. A dedicated memory manager dynamically maintains the memory pool and allows the system to refine strategies based on historical execution trajectories. Extensive experiments demonstrate that our approach achieves strong functional correctness while delivering significant improvements in PPA metrics. By integrating tool-driven feedback with structured and evolvable memory, our framework transforms RTL generation from one-shot reasoning into a continual, feedback-driven optimization process, providing a scalable pathway for deploying LLMs in real-world hardware design flows.

[35] Do Language Models Encode Semantic Relations? Probing and Sparse Feature Analysis

Andor Diera,Ansgar Scherp

Main category: cs.CL

TL;DR: 本文研究了不同规模的大语言模型(Pythia-70M、GPT-2、Llama 3.1 8B)如何表征四种语义关系(同义、反义、上位、下位),结合线性探针与机械可解释性方法(如稀疏自编码器和激活修补),发现层级关系存在方向不对称性(上位关系冗余鲁棒,下位关系紧凑易扰),语义信号在中层、MLP路径中更强,且难度排序稳定(反义最易,同义最难);因果干预效果随模型容量提升而增强。

Details Motivation: 理解大语言模型是否捕获结构化语义,需探究其如何表征概念间关系,尤其是不同语义关系的内部编码机制与鲁棒性差异。 Method: 采用线性探针结合机械可解释性技术(稀疏自编码器SAE、激活修补),在三个尺度递增的模型上分析四种语义关系的表征位置、特征贡献及因果影响。 Result: 发现:1)层级关系具方向不对称性——hypernymy冗余抗抑制,hyponymy依赖紧凑易扰特征;2)语义信号整体弥散但分布稳定——峰值位于中层,MLP路径强于注意力路径;3)关系识别难度跨模型一致(antonymy最易,synonymy最难);4)SAE引导的因果干预仅在Llama 3.1上可靠生效,小模型中效果弱或不稳定。 Conclusion: 语义关系在LLM中具有可定位、可解释的表征模式,其鲁棒性与模型容量密切相关;本工作为连接稀疏特征与探针级因果证据提供了可复现框架。 Abstract: Understanding whether large language models (LLMs) capture structured meaning requires examining how they represent concept relationships. In this work, we study three models of increasing scale: Pythia-70M, GPT-2, and Llama 3.1 8B, focusing on four semantic relations: synonymy, antonymy, hypernymy, and hyponymy. We combine linear probing with mechanistic interpretability techniques, including sparse autoencoders (SAE) and activation patching, to identify where these relations are encoded and how specific features contribute to their representation. Our results reveal a directional asymmetry in hierarchical relations: hypernymy is encoded redundantly and resists suppression, while hyponymy relies on compact features that are more easily disrupted by ablation. More broadly, relation signals are diffuse but exhibit stable profiles: they peak in the mid-layers and are stronger in post-residual/MLP pathways than in attention. Difficulty is consistent across models (antonymy easiest, synonymy hardest). Probe-level causality is capacity-dependent: on Llama 3.1, SAE-guided patching reliably shifts these signals, whereas on smaller models the shifts are weak or unstable. Our results clarify where and how reliably semantic relations are represented inside LLMs, and provide a reproducible framework for relating sparse features to probe-level causal evidence.

[36] Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models

Jaemin Kim,Jong Chul Ye

Main category: cs.CL

TL;DR: 本文提出ARAM框架,用于在检索增强生成(RAG)中提升掩码扩散模型(MDM)的问答性能,通过动态调整基于信噪比(SNR)的引导尺度来缓解检索上下文噪声引发的检索先验冲突。

Details Motivation: 检索增强生成(RAG)中,当检索到的上下文存在噪声、不可靠或与模型参数知识不一致时,会引发检索先验冲突,损害生成质量;该问题在扩散语言模型中尚未被充分研究,因其迭代去噪机制带来独特挑战。 Method: 提出无需训练的自适应引导框架ARAM,针对掩码扩散模型(MDM),在去噪过程中依据检索上下文引起的分布偏移的信噪比(SNR)动态校准引导尺度:高SNR时增强引导,低SNR时抑制引导。 Result: 在多个知识密集型问答基准上,ARAM显著优于现有RAG基线方法,提升了整体问答性能。 Conclusion: ARAM为扩散语言模型在RAG场景下提供了有效、轻量且自适应的上下文融合机制,缓解了检索先验冲突问题,验证了动态SNR感知引导策略的有效性。 Abstract: Retrieval-Augmented Generation (RAG) improves factual grounding by incorporating external knowledge into language model generation. However, when retrieved context is noisy, unreliable, or inconsistent with the model's parametric knowledge, it introduces retrieval-prior conflicts that can degrade generation quality. While this problem has been studied in autoregressive language models, it remains largely unexplored in diffusion-based language models, where the iterative denoising process introduces unique challenges for integrating retrieved context. In this work, we propose Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models (MDMs) in RAG settings. ARAM dynamically calibrates the guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context. Intuitively, the model strengthens guidance when the retrieved context provides reliable corrective evidence and suppresses it when the contextual signal is noisy or non-supportive. Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.

[37] Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Ahmed Sharshar,Hosam Elgendy,Saad El Dine Ahmed,Yasser Rohaim,Yuxia Wang

Main category: cs.CL

TL;DR: 本文提出一个新型多模态、多语言的暗黑幽默检测与理解基准,旨在解决现有静态基准无法捕捉文化隐含线索带来的安全挑战。数据集包含3000条文本、6000张图像和1200个视频,覆盖英语、阿拉伯语及通用语境,并采用严格标注标准区分安全/有害(显性/隐性)幽默。实验表明闭源模型显著优于开源模型,且英阿双语性能差异明显,凸显文化敏感与推理感知的安全对齐必要性。

Details Motivation: 当前静态毒性基准难以捕捉暗黑幽默中依赖文化背景和隐含线索的安全风险,亟需能评估模型深层语境推理能力的多模态、多语言基准。 Method: 构建人工标注的多模态、多语言暗黑幽默数据集(文本/图像/视频),涵盖英语、阿拉伯语及语言无关场景;设计三级标注体系(Safe / Harmful-Explicit / Harmful-Implicit);系统评测主流开源与闭源模型在各模态上的表现。 Result: 闭源模型整体显著优于开源模型;英语表现普遍优于阿拉伯语;所有模型在隐性(Covert)有害幽默上识别能力较弱,暴露其深层推理与文化理解的不足。 Conclusion: 现有模型在跨语言、尤其是隐性有害幽默识别上存在严重短板,强调必须发展具备文化感知与上下文推理能力的安全对齐方法。 Abstract: Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing \emph{Safe} jokes from \emph{Harmful} ones, with the latter further classified into \emph{Explicit} (overt) and \emph{Implicit} (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. \textcolor{red}{Warning: this paper contains example data that may be offensive, harmful, or biased.}

[38] CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

Teng Pan,Yuchen Yan,Zixuan Wang,Ruiqing Zhang,Gaiyang Han,Wanqi Zhang,Weiming Lu,Jun Xiao,Yongliang Shen

Main category: cs.CL

TL;DR: 本文提出CoVerRL框架,通过单一模型在生成器和验证器角色间交替,利用多数投票提供监督训练验证器,并用改进的验证器过滤自一致性错误,从而避免共识陷阱,提升数学推理能力。

Details Motivation: 现有无标签强化学习方法存在共识陷阱问题,即模型过度追求自我一致性导致输出多样性下降,系统性错误被掩盖。 Method: 提出CoVerRL框架,让同一模型交替担任生成器与验证器;利用多数投票为验证器提供噪声但有用监督;验证器逐步过滤伪标签中的自一致错误,实现两能力协同进化。 Result: 在Qwen和Llama系列模型上,CoVerRL在数学推理基准上比无标签基线高4.7-5.9%;自验证准确率从约55%提升至85%以上。 Conclusion: CoVerRL通过生成器与验证器的协同演化,有效打破共识陷阱,在不依赖真实标签的前提下显著提升模型推理能力与自我验证能力。 Abstract: Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9\% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55\% to over 85\%, confirming that both capabilities genuinely co-evolve.

[39] Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Corentin Royer,Debarun Bhattacharjya,Gaetano Rossiello,Andrea Giovannini,Mennatallah El-Assady

Main category: cs.CL

TL;DR: 本文提出了一种基于信息论的自动步级标签生成方法,用于训练过程奖励模型(PRMs),以提升大语言模型多步推理的可靠性与可扩展性。

Details Motivation: 多步推理虽能增强大语言模型能力,但易导致错误传播;现有PRM训练依赖高成本人工标注或计算密集型自动标注,亟需更高效、可扩展的监督信号生成方式。 Method: 利用信息论估计每一步推理对正确答案概率的影响,以此作为步级质量信号;该方法将计算复杂度从O(N log N)降至O(N)。 Result: 在数学、Python编程、SQL和科学问答等多样化推理基准上,所生成的标签显著提升了best-of-K评估中的思维链选择效果。 Conclusion: 该方法实现了对大语言模型推理过程的高效、可扩展监督,尤其适用于错误传播敏感的任务。 Abstract: Multi-step reasoning improves the capabilities of large language models (LLMs) but increases the risk of errors propagating through intermediate steps. Process reward models (PRMs) mitigate this by scoring each step individually, enabling fine-grained supervision and improved reliability. Existing methods for training PRMs rely on costly human annotations or computationally intensive automatic labeling. We propose a novel approach to automatically generate step-level labels using Information Theory. Our method estimates how each reasoning step affects the likelihood of the correct answer, providing a signal of step quality. Importantly, it reduces computational complexity to $\mathcal{O}(N)$, improving over the previous $\mathcal{O}(N \log N)$ methods. We demonstrate that these labels enable effective chain-of-thought selection in best-of-$K$ evaluation settings across diverse reasoning benchmarks, including mathematics, Python programming, SQL, and scientific question answering. This work enables scalable and efficient supervision of LLM reasoning, particularly for tasks where error propagation is critical.

[40] Text-to-Stage: Spatial Layouts from Long-form Narratives

Jefferson Hernandez,Swarnadeep Saha,Chenxi Whitehouse,Sanjeel Parekh,Calvin Murdock,Yuliang Li,W. Owen Brimijoin,Vamsi Krishna Ithapu,Ishwarya Ananthabhotla

Main category: cs.CL

TL;DR: 本文研究语言模型从非结构化文本中进行空间推理的能力,提出叙事到舞台布局(narrative-to-play)任务,并设计了基于戏剧学的评估套件和结合拒绝式监督微调与可验证奖励强化学习的训练方法,在古典英语文学文本上验证了其有效性。

Details Motivation: 探索语言模型在缺乏显式空间、位置或关系线索的文本中进行空间推理的能力,以模拟人类空间理解能力,并支持下游媒体应用。 Method: 提出叙事到舞台布局任务;构建戏剧学启发的确定性评估套件;采用基于Best-of-N采样的拒绝式监督微调(SFT)与基于GRPO的可验证奖励强化学习(RL)相结合的训练与推理方法。 Result: 在纯文本古典英语文学语料上,相比基线模型,在角色归属、空间合理性、动作经济性等多个指标上均有提升,并在LLM评判和人类主观偏好上表现更优。 Conclusion: 语言模型可通过特定任务建模、定制评估与混合训练策略有效提升隐式空间推理能力,为文本到空间结构生成提供了可行路径。 Abstract: In this work, we probe the ability of a language model to demonstrate spatial reasoning from unstructured text, mimicking human capabilities and automating a process that benefits many downstream media applications. Concretely, we study the narrative-to-play task: inferring stage-play layouts (scenes, speaker positions, movements, and room types) from text that lacks explicit spatial, positional, or relational cues. We then introduce a dramaturgy-inspired deterministic evaluation suite and, finally, a training and inference recipe that combines rejection SFT using Best-of-N sampling with RL from verifiable rewards via GRPO. Experiments on a text-only corpus of classical English literature demonstrate improvements over vanilla models across multiple metrics (character attribution, spatial plausibility, and movement economy), as well as alignment with an LLM-as-a-judge and subjective human preferences.

[41] Event-Centric Human Value Understanding in News-Domain Texts: An Actor-Conditioned, Multi-Granularity Benchmark

Yao Wang,Xin Liu,Zhuochen Liu,Jiankang Chen,Adam Jatowt,Kyoungsook Kim,Noriko Kando,Haitao Yu

Main category: cs.CL

TL;DR: 本文提出了NEVU——一个面向事实新闻中人类价值观理解的新基准,强调事件中心性、行为者条件性和价值方向性,并构建了多层级标注数据集与统一基线模型。

Details Motivation: 现有价值观数据集在事实新闻理解中存在局限:缺乏行为者区分、依赖孤立语句或合成场景、缺少显式事件结构和价值方向信息。 Method: 构建了NEVU基准,基于2865篇英文新闻文章,采用LLM辅助+分阶段验证+人工审计的标注流程,在四个语义单元层级(子事件、行为复合事件、故事复合事件、整篇文章)上对(单元,行为者)对进行标注;使用包含54个细粒度值和20个粗粒度类别的分层价值空间;提供专有与开源大模型的统一基线,并尝试LoRA轻量微调。 Result: NEVU共标注45,793个单元–行为者对和168,061个有向价值实例;实验表明LoRA微调能持续提升开源模型在该任务上的表现;NEVU既可作为评估基准,也支持监督式适配。 Conclusion: NEVU填补了事实新闻中事件驱动、行为者敏感、方向明确的价值理解数据空白,推动更鲁棒、可解释的价值感知NLP系统发展。 Abstract: Existing human value datasets do not directly support value understanding in factual news: many are actor-agnostic, rely on isolated utterances or synthetic scenarios, and lack explicit event structure or value direction. We present \textbf{NEVU} (\textbf{N}ews \textbf{E}vent-centric \textbf{V}alue \textbf{U}nderstanding), a benchmark for \emph{actor-conditioned}, \emph{event-centric}, and \emph{direction-aware} human value recognition in factual news. NEVU evaluates whether models can identify value cues, attribute them to the correct actor, and determine value direction from grounded evidence. Built from 2{,}865 English news articles, NEVU organizes annotations at four semantic unit levels (\textbf{Subevent}, \textbf{behavior-based composite event}, \textbf{story-based composite event}, and \textbf{Article}) and labels \mbox{(unit, actor)} pairs for fine-grained evaluation across local and composite contexts. The annotations are produced through an LLM-assisted pipeline with staged verification and targeted human auditing. Using a hierarchical value space with \textbf{54} fine-grained values and \textbf{20} coarse-grained categories, NEVU covers 45{,}793 unit--actor pairs and 168{,}061 directed value instances. We provide unified baselines for proprietary and open-source LLMs, and find that lightweight adaptation (LoRA) consistently improves open-source models, showing that although NEVU is designed primarily as a benchmark, it also supports supervised adaptation beyond prompting-only evaluation. Data availability is described in Appendix~\ref{app:data_code_availability}.

[42] How do LLMs Compute Verbal Confidence

Dharshan Kumaran,Arthur Conmy,Federico Barbero,Simon Osindero,Viorica Patraucean,Petar Velickovic

Main category: cs.CL

TL;DR: 本文研究大语言模型(LLM)如何生成口头置信度(verbal confidence),发现其并非即时计算,而是自动在生成答案后缓存,并基于对答案质量的综合评估(而非仅token对数概率)进行输出,揭示了LLM具备自动、复杂的自我评估能力。

Details Motivation: 口头置信度被广泛用于从黑箱大模型中提取不确定性估计,但其内部生成机制尚不清楚:是即时计算还是预先缓存?代表的是简单token概率还是更丰富的答案质量评估? Method: 以Gemma 3 27B和Qwen 2.5 7B为对象,采用激活引导(activation steering)、补丁/噪声/交换实验(patching, noising, swap)、注意力阻断(attention blocking)、线性探针(linear probing)和方差分解(variance partitioning)等多种神经行为分析方法。 Result: 置信表征在答案临近位置先出现,再出现在置信输出位置;信息流路径为:从答案token中收集信息 → 在首个答案后位置缓存 → 检索输出;缓存表征能显著解释口头置信度变异,且超出token对数概率的贡献。 Conclusion: 口头置信度反映的是LLM在生成答案过程中自动完成的、复杂的自我评估过程,而非事后重构;这为理解大模型的元认知能力与提升校准性提供了新视角。 Abstract: Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

[43] Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval

Md. Asraful Haque,Aasar Mehdi,Maaz Mahboob,Tamkeen Fatima

Main category: cs.CL

TL;DR: 本文提出了一种面向领域的分层检索与验证架构,通过四阶段自调节流水线(内在验证、自适应搜索路由、纠正性文档评分、外在再生与原子级断言验证)显著提升大语言模型的事实准确性,尤其在时间敏感和数值精确场景中表现优异,但存在‘前提错误过度主张’的失败模式。

Details Motivation: 大型语言模型(LLMs)虽具高流畅性,却易产生事实性错误(即‘幻觉’),这在高风险领域尤为关键,亟需提升其可靠性与事实一致性。 Method: 提出基于LangGraph实现的四阶段自调节流水线:(I)带早退机制的内在验证以优化计算;(II)利用领域检测器的自适应搜索路由;(III)纠正性文档评分(CRAG)过滤无关上下文;(IV)外在再生及原子级断言验证。 Result: 在5个基准(共650个查询)上全面超越零样本基线,TimeQA v2胜率达83.7%,MMLU Global Facts达78.0%;事实性回答的groundedness得分稳定在78.8%–86.4%之间;识别出‘False-Premise Overclaiming’为主要失败模式。 Conclusion: 该分层RAG架构为防范误导性输出提供了强健的失效保护机制,实证揭示了多阶段RAG的行为特征,并建议未来工作应引入预检索‘可回答性’节点以进一步弥合对话AI的可靠性鸿沟。 Abstract: Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to "hallucinations" - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Corrective Document Grading (CRAG) to filter irrelevant context, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of "False-Premise Overclaiming" was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval "answerability" nodes to further bridge the reliability gap in conversational AI.

[44] DebugLM: Learning Traceable Training Data Provenance for LLMs

Wenjie Jacky Mo,Qin Liu,Xiaofei Wen,Wenxuan Zhou,Zhe Zhao,Muhao Chen

Main category: cs.CL

TL;DR: 本文提出DebugLM框架,为大语言模型(LLMs)嵌入数据溯源能力,使其能将行为追溯至特定训练数据源,并支持无需重训练的测试时定向修正。

Details Motivation: 现有LLM训练流程缺乏对行为来源数据的可观测性,导致调试困难、问题易复发,尤其在分布偏移或模型更新后。 Method: 提出DebugLM框架,通过在模型中引入唯一溯源标签,使模型能将其响应与对应训练数据源显式关联;并基于此实现测试时针对特定数据源的定向拒绝机制。 Result: 实验表明DebugLM能准确追踪多阶段训练中的行为来源,并实现有效的测试时定向修正,同时保持模型整体性能。 Conclusion: DebugLM为LLM提供了可解释、可干预的数据溯源能力,提升了模型开发与维护的可控性和鲁棒性。 Abstract: Large language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability reduces debugging to reactive patching and makes failures prone to recur under distribution shift or subsequent model updates. To address this limitation, we propose DebugLM, a framework that equips LLMs with built-in data provenance, enabling them to explicitly trace the origins of their behaviors to specific training data sources. Specifically, the model learns to associate its responses with unique provenance tags that indicate the responsible dataset, empowering developers to precisely identify where undesirable behaviors are learned. Building on this capability, DebugLM further supports targeted test-time remediation, enabling developers to selectively trigger targeted refusal for specified data sources without retraining or modifying model parameters. Experiments demonstrate that DebugLM provides accurate behavior tracing in multi-stage training pipelines and effective test-time remediation while preserving the general utility of the model.

[45] Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages

Yue Zhao,Jiatao Gu,Paloma Jeretič,Weijie Su

Main category: cs.CL

TL;DR: 本文提出了一种基于预训练多语言模型注意力机制的跨语言距离度量方法——Attention Transport Distance(ATD),通过最优传输量化注意力矩阵间的几何差异,有效复现语言谱系与地理接触关系,并提升低资源机器翻译性能。

Details Motivation: 现有语言学缺乏统一、可扩展的定量语言距离测量方法,而传统定性分析难以满足大规模、可计算的语言比较需求。 Method: 利用预训练多语言模型中自发形成的注意力机制,将注意力矩阵视为概率分布,采用最优传输(optimal transport)衡量其几何散度,定义为Attention Transport Distance(ATD),用于量化翻译过程中的语言表征距离。 Result: ATD在大量语言上成功复现了已知语言谱系结构,并揭示了地理邻近与语言接触模式;将其作为正则项可提升低资源机器翻译的迁移性能。 Conclusion: ATD为用神经网络检验语言学假设提供了原理性基础,使多语言模型成为定量语言学发现与公平多语言AI发展的有力工具。 Abstract: Understanding the distance between human languages is central to linguistics, anthropology, and tracing human evolutionary history. Yet, while linguistics has long provided rich qualitative accounts of cross-linguistic variation, a unified and scalable quantitative approach to measuring language distance remains lacking. In this paper, we introduce a method that leverages pretrained multilingual language models as systematic instruments for linguistic measurement. Specifically, we show that the spontaneously emerged attention mechanisms of these models provide a robust, tokenization-agnostic measure of cross-linguistic distance, termed Attention Transport Distance (ATD). By treating attention matrices as probability distributions and measuring their geometric divergence via optimal transport, we quantify the representational distance between languages during translation. Applying ATD to a large and diverse set of languages, we demonstrate that the resulting distances recover established linguistic groupings with high fidelity and reveal patterns aligned with geographic and contact-induced relationships. Furthermore, incorporating ATD as a regularizer improves transfer performance in low-resource machine translation. Our results establish a principled foundation for testing linguistic hypotheses using artificial neural networks. This framework transforms multilingual models into powerful tools for quantitative linguistic discovery, facilitating more equitable multilingual AI.

[46] IndicSafe: A Benchmark for Evaluating Multilingual LLM Safety in South Asia

Priyaranjan Pattnayak,Sanchari Chowdhuri

Main category: cs.CL

TL;DR: 本文首次系统评估了10个主流大语言模型在12种印度语言中的安全性表现,发现跨语言安全行为存在显著差异(一致率仅12.8%,安全率方差超17%),揭示了多语言模型中安全对齐无法自然迁移的问题,并发布了首个面向印度语言的文化适配安全评测基准IndicSafe。

Details Motivation: 大型语言模型在多元文化、低资源语言场景下的安全性行为尚不明确,尤其在使用人口超12亿但训练数据中严重不足的印度诸语言中缺乏系统评估。 Method: 构建包含6000条涵盖种姓、宗教、性别、健康和政治等文化敏感主题的提示语数据集,在12种印度语言中进行翻译与测试,评估10个主流LLM的安全响应;引入提示级熵值、类别偏差分和多语言一致性指数量化模型失败模式。 Result: 跨语言安全判断一致性极低(12.8%),安全率波动超17%;部分模型在低资源文字中过度拒绝良性提示或过度标记政治敏感内容,另一些则未能识别不安全生成。 Conclusion: 多语言大模型存在严重的安全泛化缺陷,安全对齐不具备跨语言可迁移性;需发展基于区域实际危害的语言感知对齐策略,并采用如IndicSafe等文化适配评测基准。 Abstract: As large language models (LLMs) are deployed in multilingual settings, their safety behavior in culturally diverse, low-resource languages remains poorly understood. We present the first systematic evaluation of LLM safety across 12 Indic languages, spoken by over 1.2 billion people but underrepresented in LLM training data. Using a dataset of 6,000 culturally grounded prompts spanning caste, religion, gender, health, and politics, we assess 10 leading LLMs on translated variants of the prompt. Our analysis reveals significant safety drift: cross-language agreement is just 12.8\%, and \texttt{SAFE} rate variance exceeds 17\% across languages. Some models over-refuse benign prompts in low-resource scripts, overflag politically sensitive topics, while others fail to flag unsafe generations. We quantify these failures using prompt-level entropy, category bias scores, and multilingual consistency indices. Our findings highlight critical safety generalization gaps in multilingual LLMs and show that safety alignment does not transfer evenly across languages. We release \textsc{IndicSafe}, the first benchmark to enable culturally informed safety evaluation for Indic deployments, and advocate for language-aware alignment strategies grounded in regional harms.

[47] Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

Raghavv Goel,Mukul Gagrani,Mingu Lee,Chris Lott

Main category: cs.CL

TL;DR: 本文提出了一种无需训练、基于掩码探针的多令牌预测(MTP)方法,利用LLM隐含的多步预测能力,在不修改权重或引入辅助模型的前提下,通过构建和剪枝候选词树实现并行验证与无损生成,显著提升吞吐量和接受长度。

Details Motivation: 大型语言模型虽仅被训练为单步下一个词预测,却展现出潜在的多令牌预测能力;现有训练-free方法性能有限,亟需更高效、无损的MTP方案。 Method: 提出一种训练-free的MTP方法:在推理时动态注入嵌入空间中的掩码token,采样top-K候选构建推测性token树,并用轻量剪枝保留高概率延续;随后并行验证候选序列。 Result: 在LLaMA3和Qwen3上分别将平均接受长度提升约12%和8–12%,吞吐量提升达15–19%;理论与实证表明解码器层天然对齐掩码token表征与下一token状态。 Conclusion: LLM内部已具备可被直接激发的多步预测能力;所提探针式MTP方法简单有效、无损且通用,为高效自回归生成提供了新范式。 Abstract: Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

[48] ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

Xuyang Cao,Qianying Liu,Chuan Xiao,Yusuke Oda,Pontus Stenetorp,Daisuke Kawahara,Makoto Onizuka,Sadao Kurohashi,Shuyuan Zheng

Main category: cs.CL

TL;DR: 本文提出了一种基于合作博弈论的多语言预训练缩放定律ShapleyLaw,通过Shapley值量化跨语言迁移效应,从而优化语言混合比例,提升模型性能预测与优化效果。

Details Motivation: 现有多语言缩放定律未考虑跨语言迁移效应,导致语言混合比例次优。 Method: 将多语言预训练建模为合作博弈,各语言作为玩家,利用Shapley值量化其对测试损失降低的贡献,提出ShapleyLaw。 Result: ShapleyLaw在模型性能预测和语言混合比例优化上均优于基线方法。 Conclusion: 引入合作博弈论视角并量化跨语言迁移可显著提升多语言预训练缩放定律的有效性与实用性。 Abstract: In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textit{language mixture ratios}. Multilingual scaling laws can predict the test loss under different language mixture ratios and can therefore be used to estimate the optimal ratios. However, the current approaches to multilingual scaling laws do not measure the \textit{cross-lingual transfer} effect, resulting in suboptimal mixture ratios. In this paper, we consider multilingual pretraining as a cooperative game in which each language acts as a player that jointly contributes to pretraining, gaining the resulting reduction in test loss as the payoff. Consequently, from the perspective of cooperative game theory, we quantify the cross-lingual transfer from each language by its contribution in the game, and propose a game-theoretic multilingual scaling law called \textit{ShapleyLaw}. Our experiments show that ShapleyLaw outperforms baseline methods in model performance prediction and language mixture optimization.

[49] Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Chiara Manna,Hosein Mohebbi,Afra Alishahi,Frédéric Blain,Eva Vanmassenhove

Main category: cs.CL

TL;DR: 本文提出了一种新的性别偏见度量方法——'先验偏见'(Prior Bias),用于评估解码器-only机器翻译模型中的性别偏见,并发现后训练(如指令微调)可降低该偏见。

Details Motivation: 现有基准难以全面反映现代机器翻译中性别偏见的复杂性,尤其在需将源语言隐含性别信号显式化为目标语言性别标记的任务中。 Method: 提出'先验偏见'新指标,扩展偏见评估框架,并将其应用于解码器-only MT模型;对比分析其与编码器-解码器架构在性别相关指标上的表现,考察后训练(如指令微调)的影响。 Result: 解码器-only模型并未普遍优于编码器-解码器模型在性别特定指标上;但后训练(如指令微调)不仅提升上下文感知能力,还显著降低男性倾向的先验偏见。 Conclusion: 先验偏见是衡量MT模型性别偏见的有效新维度;后训练技术对缓解此类偏见具有实际价值,提示模型架构之外的训练策略同样关键。 Abstract: While Large Language Models achieve state-of-the-art results across a wide range of NLP tasks, they remain prone to systematic biases. Among these, gender bias is particularly salient in MT, due to systematic differences across languages in whether and how gender is marked. As a result, translation often requires disambiguating implicit source signals into explicit gender-marked forms. In this context, standard benchmarks may capture broad disparities but fail to reflect the full complexity of gender bias in modern MT. In this paper, we extend recent frameworks on bias evaluation by: (i) introducing a novel measure coined "Prior Bias", capturing a model's default gender assumptions, and (ii) applying the framework to decoder-only MT models. Our results show that, despite their scale and state-of-the-art status, decoder-only models do not generally outperform encoder-decoder architectures on gender-specific metrics; however, post-training (e.g., instruction tuning) not only improves contextual awareness but also reduces the masculine Prior Bias.

[50] ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Argentina Anna Rescigno,Eva Vanmassenhove,Johanna Monti

Main category: cs.CL

TL;DR: 本文提出ConGA框架,用于在机器翻译中进行细粒度的性别标注,以解决从无语法性别的语言(如英语)到有语法性别的语言(如意大利语)翻译时的性别偏差问题,并基于gENder-IT数据集构建了评估基准。

Details Motivation: 英语等语言缺乏语法性别,而意大利语等语言要求语法一致,导致机器翻译系统常默认使用阳性形式,加剧偏见并降低准确性。 Method: 提出Contextual Gender Annotation(ConGA)框架,定义英语语义性别(M/F/A)与意大利语语法性别(M/F)的词级标注规范,并引入实体级标识符支持跨句追踪;应用于gENder-IT数据集,构建黄金标准评测资源。 Result: 实验发现现有MT系统存在系统性的阳性过用和阴性实现不一致问题。 Conclusion: ConGA为构建更具性别意识和多语言能力的NLP系统提供了可复现的方法论与基准评测资源。 Abstract: Handling gender across languages remains a persistent challenge for Machine Translation (MT) and Large Language Models (LLMs), especially when translating from gender-neutral languages into morphologically gendered ones, such as English to Italian. English largely omits grammatical gender, while Italian requires explicit agreement across multiple grammatical categories. This asymmetry often leads MT systems to default to masculine forms, reinforcing bias and reducing translation accuracy. To address this issue, we present the Contextual Gender Annotation (ConGA) framework, a linguistically grounded set of guidelines for word-level gender annotation. The scheme distinguishes between semantic gender in English through three tags, Masculine (M), Feminine (F), and Ambiguous (A), and grammatical gender realisation in Italian (Masculine (M), Feminine (F)), combined with entity-level identifiers for cross-sentence tracking. We apply ConGA to the gENder-IT dataset, creating a gold-standard resource for evaluating gender bias in translation. Our results reveal systematic masculine overuse and inconsistent feminine realisation, highlighting persistent limitations of current MT systems. By combining fine-grained linguistic annotation with quantitative evaluation, this work offers both a methodology and a benchmark for building more gender-aware and multilingual NLP systems.

cs.CV [Back]

[51] Multi-Modal Multi-Agent Reinforcement Learning for Radiology Report Generation: Radiologist-Like Workflow with Clinically Verifiable Rewards

Kaito Baba,Satoshi Kodera

Main category: cs.CV

TL;DR: 本文提出MARL-Rad,一种用于放射科报告生成的多模态多智能体强化学习框架,通过区域特异性智能体与全局整合智能体协同,并以临床可验证奖励优化,显著提升临床有效性指标。

Details Motivation: 现有方法多为单模型强化学习或对独立训练模型进行后验智能体化,缺乏多智能体联合训练与端到端系统级优化,难以兼顾报告的临床准确性与结构一致性。 Method: 提出MARL-Rad框架,包含多个区域特异性智能体和一个全局整合智能体,采用多智能体强化学习进行联合训练,并设计临床可验证的奖励函数(如RadGraph、CheXbert、GREEN相关奖励)进行端到端优化。 Result: 在MIMIC-CXR和IU X-ray数据集上,MARL-Rad在RadGraph、CheXbert、GREEN等临床有效性(CE)指标上达到SOTA;同时提升左右侧一致性与报告细节准确性。 Conclusion: MARL-Rad通过多智能体协同与临床导向的强化学习优化,有效提升了放射科报告生成的临床实用性与可靠性,为医学AI报告生成提供了新范式。 Abstract: We propose MARL-Rad, a novel multi-modal multi-agent reinforcement learning framework for radiology report generation that coordinates region-specific agents and a global integrating agent, optimized via clinically verifiable rewards. Unlike prior single-model reinforcement learning or post-hoc agentization of independently trained models, our method jointly trains multiple agents and optimizes the entire agent system through reinforcement learning. Experiments on the MIMIC-CXR and IU X-ray datasets show that MARL-Rad consistently improves clinically efficacy (CE) metrics such as RadGraph, CheXbert, and GREEN scores, achieving state-of-the-art CE performance. Further analyses confirm that MARL-Rad enhances laterality consistency and produces more accurate, detail-informed reports.

[52] Tokenization vs. Augmentation: A Systematic Study of Writer Variance in IMU-Based Online Handwriting Recognition

Jindong Li,Dario Zanca,Vincent Christlein,Tim Hamann,Jens Barth,Peter Kämpf,Björn Eskofier

Main category: cs.CV

TL;DR: 本文研究了基于惯性测量单元(IMU)的在线手写识别中,如何通过子词分词和拼接式数据增强来应对书写者间和书写者内差异问题。实验表明:Bigram分词在书写者独立场景下显著降低词错误率(WER),而拼接式数据增强在书写者依赖场景下大幅降低字符和词错误率;二者效果具有明显的方差依赖性。

Details Motivation: IMU-based在线手写识别面临字符分布不均和书写者间差异两大挑战,需针对性策略分别缓解书写者间风格差异与书写者内样本稀疏问题。 Method: 系统比较子词分词(特别是Bigram分词)与拼接式数据增强两种策略,在OnHW-Words500数据集的writer-independent和writer-dependent划分上进行实验验证,并分析token粒度与训练扩展的影响。 Result: Bigram分词使writer-independent任务WER从15.40%降至12.99%;拼接式数据增强使writer-dependent任务CER降低34.5%,WER降低25.4%;短粒度token更优;该增强效果优于等比例延长训练。 Conclusion: 子词分词主要缓解书写者间风格差异,拼接式数据增强主要补偿书写者内分布稀疏,二者效果具有明确的方差依赖性,应依任务设定选择适配策略。 Abstract: Inertial measurement unit-based online handwriting recognition enables the recognition of input signals collected across different writing surfaces but remains challenged by uneven character distributions and inter-writer variability. In this work, we systematically investigate two strategies to address these issues: sub-word tokenization and concatenation-based data augmentation. Our experiments on the OnHW-Words500 dataset reveal a clear dichotomy between handling inter-writer and intra-writer variance. On the writer-independent split, structural abstraction via Bigram tokenization significantly improves performance to unseen writing styles, reducing the word error rate (WER) from 15.40% to 12.99%. In contrast, on the writer-dependent split, tokenization degrades performance due to vocabulary distribution shifts between the training and validation sets. Instead, our proposed concatenation-based data augmentation acts as a powerful regularizer, reducing the character error rate by 34.5% and the WER by 25.4%. Further analysis shows that short, low-level tokens benefit model performance and that concatenation-based data augmentation performance gain surpasses those achieved by proportionally extended training. These findings reveal a clear variance-dependent effect: sub-word tokenization primarily mitigates inter-writer stylistic variability, whereas concatenation-based data augmentation effectively compensates for intra-writer distributional sparsity.

[53] Leveraging Large Vision Model for Multi-UAV Co-perception in Low-Altitude Wireless Networks

Yunting Xu,Jiacheng Wang,Ruichen Zhang,Changyuan Zhao,Yinqiu Liu,Dusit Niyato,Liang Yu,Haibo Zhou,Dong In Kim

Main category: cs.CV

TL;DR: 本文提出了一种基站辅助的多无人机协同感知框架(BHU),通过Top-K像素选择、稀疏化传输、Swin-large+MaskDINO的BEV特征提取与融合,以及基于扩散模型的深度强化学习联合优化算法,在显著降低通信开销(-85%)的同时提升感知性能(+5%)。

Details Motivation: 多无人机协同感知面临海量视觉数据带来的通信时延与资源效率挑战,亟需兼顾通信效率与感知性能的高效协同框架。 Method: 提出BHU框架:1)Top-K机制筛选RGB图像中最具信息量的像素实现视觉稀疏化;2)通过MU-MIMO上传至地面服务器;3)采用Swin-large+MaskDINO编码器提取并融合BEV特征;4)设计基于扩散模型的DRL算法联合优化UAV选择、稀疏比和预编码矩阵。 Result: 在Air-Co-Pred数据集上,相比CNN基线,感知性能提升超5%,通信开销降低85%。 Conclusion: BHU框架有效解决了资源受限无线环境下多无人机协同感知的通信-感知联合优化难题,为低空经济应用提供了可扩展、高效率的解决方案。 Abstract: Multi-uncrewed aerial vehicle (UAV) cooperative perception has emerged as a promising paradigm for diverse low-altitude economy applications, where complementary multi-view observations are leveraged to enhance perception performance via wireless communications. However, the massive visual data generated by multiple UAVs poses significant challenges in terms of communication latency and resource efficiency. To address these challenges, this paper proposes a communication-efficient cooperative perception framework, termed Base-Station-Helped UAV (BHU), which reduces communication overhead while enhancing perception performance. Specifically, we employ a Top-K selection mechanism to identify the most informative pixels from UAV-captured RGB images, enabling sparsified visual transmission with reduced data volume and latency. The sparsified images are transmitted to a ground server via multi-user MIMO (MU-MIMO), where a Swin-large-based MaskDINO encoder extracts bird's-eye-view (BEV) features and performs cooperative feature fusion for ground vehicle perception. Furthermore, we develop a diffusion model-based deep reinforcement learning (DRL) algorithm to jointly select cooperative UAVs, sparsification ratios, and precoding matrices, achieving a balance between communication efficiency and perception utility. Simulation results on the Air-Co-Pred dataset demonstrate that, compared with traditional CNN-based BEV fusion baselines, the proposed BHU framework improves perception performance by over 5% while reducing communication overhead by 85%, providing an effective solution for multi-UAV cooperative perception under resource-constrained wireless environments.

[54] Facial beauty prediction fusing transfer learning and broad learning system

Junying Gan,Xiaoshan Xie,Yikui Zhai,Guohui He,Chaoyun Mai,Heng Luo

Main category: cs.CV

TL;DR: 本文提出了一种融合迁移学习与宽学习系统(BLS)的面部美学预测(FBP)方法,包括E-BLS和ER-BLS两种模型,提升了预测精度。

Details Motivation: 面部美学预测面临数据稀缺易过拟合、人脸外观多变及人类感知复杂等挑战,需兼顾数据效率与建模速度。 Method: 将迁移学习(基于EfficientNets的特征提取器)与宽学习系统(BLS)结合:首先构建E-BLS,再引入连接层形成ER-BLS。 Result: E-BLS和ER-BLS在FBP任务上相比传统BLS和CNN方法精度更高,验证了所提方法的有效性与优越性。 Conclusion: 融合迁移学习与BLS的框架可有效提升面部美学预测性能,并具备向模式识别、目标检测和图像分类等领域拓展的潜力。 Abstract: Facial beauty prediction (FBP) is an important and challenging problem in the fields of computer vision and machine learning. Not only it is easily prone to overfitting due to the lack of large-scale and effective data, but also difficult to quickly build robust and effective facial beauty evaluation models because of the variability of facial appearance and the complexity of human perception. Transfer Learning can be able to reduce the dependence on large amounts of data as well as avoid overfitting problems. Broad learning system (BLS) can be capable of quickly completing models building and training. For this purpose, Transfer Learning was fused with BLS for FBP in this paper. Firstly, a feature extractor is constructed by way of CNNs models based on transfer learning for facial feature extraction, in which EfficientNets are used in this paper, and the fused features of facial beauty extracted are transferred to BLS for FBP, called E-BLS. Secondly, on the basis of E-BLS, a connection layer is designed to connect the feature extractor and BLS, called ER-BLS. Finally, experimental results show that, compared with the previous BLS and CNNs methods existed, the accuracy of FBP was improved by E-BLS and ER-BLS, demonstrating the effectiveness and superiority of the method presented, which can also be widely used in pattern recognition, object detection and image classification.

[55] Script-to-Slide Grounding: Grounding Script Sentences to Slide Objects for Automatic Instructional Video Generation

Rena Suzuki,Masato Kikuchi,Tadachika Ozono

Main category: cs.CV

TL;DR: 本文提出并形式化了Script-to-Slide Grounding(S2SG)任务,旨在将脚本句子自动关联到幻灯片中的对应对象,并提出了基于大语言模型的Text-S2SG方法,在文本对象上达到0.924的F1分数。

Details Motivation: 幻灯片视频编辑中,将语音内容与幻灯片对象进行视觉效果对齐的过程劳动密集、效率低下,亟需自动化支持。 Method: 提出Script-to-Slide Grounding(S2SG)任务定义,并设计Text-S2SG方法,利用大语言模型(LLM)实现脚本句子到幻灯片中文本对象的自动对齐。 Result: Text-S2SG在实验中取得0.924的F1-score,验证了该方法在文本对象上的高准确性。 Conclusion: 本工作首次将隐性的幻灯片视频编辑过程形式化为可计算任务S2SG,为教育与科研视频的自动化生成奠定了基础。 Abstract: While slide-based videos augmented with visual effects are widely utilized in education and research presentations, the video editing process -- particularly applying visual effects to ground spoken content to slide objects -- remains highly labor-intensive. This study aims to develop a system that automatically generates such instructional videos from slides and corresponding scripts. As a foundational step, this paper proposes and formulates Script-to-Slide Grounding (S2SG), defined as the task of grounding script sentences to their corresponding slide objects. Furthermore, as an initial step, we propose ``Text-S2SG,'' a method that utilizes a large language model (LLM) to perform this grounding task for text objects. Our experiments demonstrate that the proposed method achieves high performance (F1-score: 0.924). The contribution of this work is the formalization of a previously implicit slide-based video editing process into a computable task, thereby paving the way for its automation.

[56] Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Nimrod Shabtay,Moshe Kimhi,Artem Spector,Sivan Haray,Ehud Rivlin,Chaim Baskin,Raja Giryes,Eli Schwartz

Main category: cs.CV

TL;DR: AwaRes是一种空间按需的视觉-语言模型框架,通过低分辨率全局视图结合按需调用高分辨率局部裁剪,在保证准确性的同时提升计算效率。

Details Motivation: 解决视觉-语言模型中高分辨率输入计算开销大、低分辨率输入易丢失关键细节(如小文本)的精度-效率权衡问题。 Method: 提出AwaRes框架:基于低分辨率全局图像处理,通过工具调用机制动态获取高分辨率局部区域;自动构建监督数据(裁判判断是否需裁剪 + 接地模型定位证据并映射为离散裁剪轨迹);采用冷启动监督微调(SFT)后接多轮GRPO强化学习,奖励函数融合语义正确性与裁剪代价惩罚。 Result: 在多个视觉问答基准上显著优于同等计算预算下的基线方法,在保持高准确率的同时大幅降低计算成本。 Conclusion: AwaRes验证了‘按需高分辨率’范式的有效性,为高效高保真视觉理解提供了新思路。 Abstract: Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

[57] AgriChat: A Multimodal Large Language Model for Agriculture Image Understanding

Abderrahmene Boudiaf,Irfan Hussain,Sajid Javed

Main category: cs.CV

TL;DR: 本文提出V2VK管道生成AgriMM基准数据集,并基于此构建专用多模态大语言模型AgriChat,显著提升农业AI在细粒度识别、病害诊断等任务上的性能与可信度。

Details Motivation: 现有农业多模态大语言模型受限于缺乏大规模高质量农业数据集,以及模型缺乏经验证的领域专业知识,导致泛化能力弱、易产生生物幻觉。 Method: 提出Vision-to-Verified-Knowledge(V2VK)生成式AI标注框架,融合视觉描述与网络增强的科学文献检索,自动生成经植物病理学文献验证的AgriMM基准(含3000+农业类别、60.7万+视觉问答对);并基于该数据训练专用MLLM AgriChat。 Result: AgriChat在植物物种识别、病害症状识别、作物计数和成熟度评估等多任务上显著优于现有开源MLLM;实验证明‘保留视觉细节+网络验证知识’是构建鲁棒可信农业AI的有效路径。 Conclusion: 通过V2VK生成可验证数据并训练领域专用MLLM,可有效缓解农业AI中数据稀缺与领域知识缺失的双重瓶颈,为可信农业AI提供新范式。 Abstract: The deployment of Multimodal Large Language Models (MLLMs) in agriculture is currently stalled by a critical trade-off: the existing literature lacks the large-scale agricultural datasets required for robust model development and evaluation, while current state-of-the-art models lack the verified domain expertise necessary to reason across diverse taxonomies. To address these challenges, we propose the Vision-to-Verified-Knowledge (V2VK) pipeline, a novel generative AI-driven annotation framework that integrates visual captioning with web-augmented scientific retrieval to autonomously generate the AgriMM benchmark, effectively eliminating biological hallucinations by grounding training data in verified phytopathological literature. The AgriMM benchmark contains over 3,000 agricultural classes and more than 607k VQAs spanning multiple tasks, including fine-grained plant species identification, plant disease symptom recognition, crop counting, and ripeness assessment. Leveraging this verifiable data, we present AgriChat, a specialized MLLM that presents broad knowledge across thousands of agricultural classes and provides detailed agricultural assessments with extensive explanations. Extensive evaluation across diverse tasks, datasets, and evaluation conditions reveals both the capabilities and limitations of current agricultural MLLMs, while demonstrating AgriChat's superior performance over other open-source models, including internal and external benchmarks. The results validate that preserving visual detail combined with web-verified knowledge constitutes a reliable pathway toward robust and trustworthy agricultural AI. The code and dataset are publicly available at https://github.com/boudiafA/AgriChat .

[58] GenLie: A Global-Enhanced Lie Detection Network under Sparsity and Semantic Interference

Zongshun Zhang,Yao Liu,Qiao Liu,Xuefeng Peng,Peiyuan Jiang,Jiaye Yang,Daibing Yao,Wei Lin

Main category: cs.CV

TL;DR: 本文提出GenLie模型,通过局部特征建模与全局监督相结合的方式,提升视频谎言检测中稀疏且判别性强的表征学习能力,有效抑制身份相关噪声,在多个公开数据集上超越现有方法。

Details Motivation: 视频谎言检测的核心挑战在于学习稀疏但具有判别性的表征:欺骗信号微弱短暂,易被冗余信息掩盖,且个体与上下文差异引入强身份相关噪声。 Method: 提出GenLie——一种全局增强型谎言检测网络,采用局部建模捕捉稀疏细微欺骗线索,并借助全局监督与优化来抑制身份噪声、提升表征鲁棒性与判别性。 Result: 在涵盖高风险与低风险场景的三个公开数据集上,GenLie持续优于当前最先进方法。 Conclusion: GenLie通过全局监督下的局部建模策略,有效解决了视频谎言检测中稀疏信号提取与身份噪声抑制的关键难题,显著提升了检测性能。 Abstract: Video-based lie detection aims to identify deceptive behaviors from visual cues. Despite recent progress, its core challenge lies in learning sparse yet discriminative representations. Deceptive signals are typically subtle and short-lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity-related noise. To address this issue, we propose GenLie, a Global-Enhanced Lie Detection Network that performs local feature modeling under global supervision. Specifically, sparse and subtle deceptive cues are captured at the local level, while global supervision and optimization ensure robust and discriminative representations by suppressing identity-related noise. Experiments on three public datasets, covering both high- and low-stakes scenarios, show that GenLie consistently outperforms state-of-the-art methods. Source code is available at https://github.com/AliasDictusZ1/GenLie.

[59] TDMM-LM: Bridging Facial Understanding and Animation via Language Models

Luchuan Song,Pinxin Liu,Haiyang Liu,Zhenchao Jin,Yolo Yunlong Tang,Zichong Xu,Susan Liang,Jing Bi,Jason J Corso,Chenliang Xu

Main category: cs.CV

TL;DR: 本文提出一种利用基础生成模型合成大规模面部行为数据集的方法,并将面部参数建模转化为语言问题,实现双向文本-面部运动理解与生成。

Details Motivation: 面部动画发展滞后于人体动画,主要受限于缺乏高质量、文本配对的面部行为语料库。 Method: 设计覆盖情绪与头部运动的提示词集,利用多个生成模型合成约80小时面部视频,并拟合逐帧3D面部参数,构建大规模(文本提示+3D参数)配对数据集;在此基础上,构建Motion2Language和Language2Motion两个双向任务,探索大语言模型对面部运动的理解与生成能力。 Result: 实验证明语言模型能在该设定下有效解释和合成面部运动,具备强泛化能力;首次将面部参数建模建模为语言问题。 Conclusion: 本工作建立了文本驱动面部动画与运动理解的统一范式,填补了该领域数据与方法上的关键空白。 Abstract: Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. We design prompts suite covering emotions and head motions, generate about 80 hours of facial videos with multiple generators, and fit per-frame 3D facial parameters, yielding large-scale (prompt and parameter) pairs for training. Building on this dataset, we probe language models for bidirectional competence over facial motion via two complementary tasks: (1) Motion2Language: given a sequence of 3D facial parameters, the model produces natural-language descriptions capturing content, style, and dynamics; and (2) Language2Motion: given a prompt, the model synthesizes the corresponding sequence of 3D facial parameters via quantized motion tokens for downstream animation. Extensive experiments show that in this setting language models can both interpret and synthesize facial motion with strong generalization. To best of our knowledge, this is the first work to cast facial-parameter modeling as a language problem, establishing a unified path for text-conditioned facial animation and motion understanding.

[60] Solution for 10th Competition on Ambivalence/Hesitancy (AH) Video Recognition Challenge using Divergence-Based Multimodal Fusion

Aislan Gabriel O. Souza,Agostinho Freire,Leandro Honorato Silva,Igor Lucas B. da Silva,João Vinícius R. de Andrade,Gabriel C. de Albuquerque,Lucas Matheus da S. Oliveira,Mário Stela Guerra,Luciana Machado

Main category: cs.CV

TL;DR: 本文提出一种基于差异的多模态融合方法,用于识别视频中的矛盾/犹豫情绪,通过计算视觉、音频和文本模态嵌入间的绝对差值来显式建模跨模态冲突,在BAH数据集上显著超越基线。

Details Motivation: 解决ABAW竞赛中Ambivalence/Hesitancy视频识别挑战,捕捉多模态间不一致这一A/H情绪的关键特征。 Method: 采用Py-Feat提取Action Units作为视觉特征,Wav2Vec 2.0处理音频,BERT编码文本;各模态经BiLSTM+注意力池化后映射至共享嵌入空间;融合模块计算两两模态嵌入的绝对差值以量化冲突。 Result: 在BAH验证集上Macro F1达0.6808,远超基线0.2827;统计分析表明AU的时间变异性是区分A/H最主导的视觉线索。 Conclusion: 跨模态差异是建模A/H情绪的有效信号,所提融合机制能有效捕捉模态间不一致性,且视觉AU动态性对A/H识别至关重要。 Abstract: We address the Ambivalence/Hesitancy (A/H) Video Recognition Challenge at the 10th ABAW Competition (CVPR 2026). We propose a divergence-based multimodal fusion that explicitly measures cross-modal conflict between visual, audio, and textual channels. Visual features are encoded as Action Units (AUs) extracted via Py-Feat, audio via Wav2Vec 2.0, and text via BERT. Each modality is processed by a BiLSTM with attention pooling and projected into a shared embedding space. The fusion module computes pairwise absolute differences between modality embeddings, directly capturing the incongruence that characterizes A/H. On the BAH dataset, our approach achieves a Macro F1 of 0.6808 on the validation test set, outperforming the challenge baseline of 0.2827. Statistical analysis across 1{,}132 videos confirms that temporal variability of AUs is the dominant visual discriminator of A/H.

[61] KGS-GCN: Enhancing Sparse Skeleton Sensing via Kinematics-Driven Gaussian Splatting and Probabilistic Topology for Action Recognition

Yuhan Chen,Yicui Shi,Guofa Li,Liping Zhang,Jie Li,Jiaxin Gao,Wenbo Chu

Main category: cs.CV

TL;DR: 本文提出KGS-GCN框架,通过运动学驱动的高斯溅射和概率图拓扑建模,将稀疏骨架序列转化为连续多视角热图,并自适应构建关节关联矩阵,从而提升动态动作识别性能。

Details Motivation: 现有传感器设备生成的骨架数据稀疏且受固定物理拓扑限制,难以建模细粒度时空细节与长程依赖关系。 Method: 提出KGS-GCN:1)运动学驱动的高斯溅射模块,利用瞬时关节速度构建各向异性协方差矩阵,生成富含时空语义的连续多视角热图;2)基于Bhattacharyya距离度量关节高斯分布间统计相关性,构建概率化自适应邻接矩阵;3)通过视觉上下文门控机制调制GCN主干网络。 Result: 实验表明KGS-GCN显著增强复杂时空动态建模能力,在低保真传感器数据上展现出更强鲁棒性。 Conclusion: 该方法为真实场景传感应用中提升感知可靠性提供了可行路径,有效克服了输入稀疏性和拓扑刚性问题。 Abstract: Skeleton-based action recognition is widely utilized in sensor systems including human-computer interaction and intelligent surveillance. Nevertheless, current sensor devices typically generate sparse skeleton data as discrete coordinates, which inevitably discards fine-grained spatiotemporal details during highly dynamic movements. Moreover, the rigid constraints of predefined physical sensor topologies hinder the modeling of latent long-range dependencies. To overcome these limitations, we propose KGS-GCN, a graph convolutional network that integrates kinematics-driven Gaussian splatting with probabilistic topology. Our framework explicitly addresses the challenges of sensor data sparsity and topological rigidity by transforming discrete joints into continuous generative representations. Firstly, a kinematics-driven Gaussian splatting module is designed to dynamically construct anisotropic covariance matrices using instantaneous joint velocity vectors. This module enhances visual representation by rendering sparse skeleton sequences into multi-view continuous heatmaps rich in spatiotemporal semantics. Secondly, to transcend the limitations of fixed physical connections, a probabilistic topology construction method is proposed. This approach generates an adaptive prior adjacency matrix by quantifying statistical correlations via the Bhattacharyya distance between joint Gaussian distributions. Ultimately, the GCN backbone is adaptively modulated by the rendered visual features via a visual context gating mechanism. Empirical results demonstrate that KGS-GCN significantly enhances the modeling of complex spatiotemporal dynamics. By addressing the inherent limitations of sparse inputs, our framework offers a robust solution for processing low-fidelity sensor data. This approach establishes a practical pathway for improving perceptual reliability in real-world sensing applications.

[62] Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

Yujia Yang,Yuanxiang Wang,Zhenyu Guan,Tiankun Yang,Chenxi Bao,Haopeng Jin,Jinwen Luo,Xinyu Zuo,Lisheng Duan,Haijin Liang,Jin Ma,Xinming Wang,Ruiwen Tao,Hongzhu Yi

Main category: cs.CV

TL;DR: 本文提出Omni IIE Bench,一个面向语义尺度一致性的指令图像编辑(IIE)新基准,揭示主流模型在跨语义尺度任务中性能显著下降的问题。

Details Motivation: 现有IIE基准注重任务广度而忽视模型在不同语义尺度任务间的表现一致性,这一缺陷在专业应用中尤为关键。 Method: 构建了高质量、人工标注的Omni IIE Bench基准,包含单轮一致性(属性修改与实体替换配对)和多轮协调(跨语义尺度连续对话)双轨诊断设计,并采用多阶段人工筛选与专业评审确保质量。 Result: 对8个主流IIE模型的评估首次量化发现:几乎所有模型从低语义尺度任务转向高语义尺度任务时均出现显著性能下降。 Conclusion: Omni IIE Bench为开发更可靠、稳定的下一代IIE模型提供了关键诊断工具与实践洞见。 Abstract: While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.

[63] Joint Optimization of Storage and Loading for High-Performance 3D Point Cloud Data Processing

Ke Wang,Yanfei Cao,Xiangzhi Tao,Naijie Gu,Jun Yu,Zhengdong Wang,Shouyang Dong,Fan Yu,Cong Wang,Yang Luo

Main category: cs.CV

TL;DR: 本文提出了一种新的点云数据存储格式.PcRecord和配套的高性能多阶段并行处理流水线,显著加速了大规模点云数据的加载与处理,在多个主流数据集上实现了数倍的性能提升。

Details Motivation: 点云数据规模大、格式多样(如PLY、XYZ、BIN等),导致加载和处理效率低,传统方法难以应对大规模数据集。 Method: 提出统一的.PcRecord存储格式以减少存储占用,并设计了一个基于多阶段并行架构的高性能数据处理流水线,优化计算资源利用。 Result: 在ModelNet40、S3DIS、ShapeNet、Kitti、SUN RGB-D和ScanNet等多个数据集上,GPU和Ascend硬件下分别实现最高25.4倍和19.3倍的加速。 Conclusion: .PcRecord格式与配套流水线有效解决了点云数据加载与处理的瓶颈问题,提升了大规模点云处理的效率与可扩展性。 Abstract: With the rapid development of computer vision and deep learning, significant advancements have been made in 3D vision, partic- ularly in autonomous driving, robotic perception, and augmented reality. 3D point cloud data, as a crucial representation of 3D information, has gained widespread attention. However, the vast scale and complexity of point cloud data present significant chal- lenges for loading and processing and traditional algorithms struggle to handle large-scale datasets.The diversity of storage formats for point cloud datasets (e.g., PLY, XYZ, BIN) adds complexity to data handling and results in inefficiencies in data preparation. Al- though binary formats like BIN and NPY have been used to speed up data access, they still do not fully address the time-consuming data loading and processing phase. To overcome these challenges, we propose the .PcRecord format, a unified data storage solution designed to reduce the storage occupation and accelerate the processing of point cloud data. We also introduce a high-performance data processing pipeline equipped with multiple modules. By leveraging a multi-stage parallel pipeline architecture, our system optimizes the use of computational resources, significantly improving processing speed and efficiency. This paper details the im- plementation of this system and demonstrates its effectiveness in addressing the challenges of handling large-scale point cloud datasets.On average, our system achieves performance improvements of 6.61x (ModelNet40), 2.69x (S3DIS), 2.23x (ShapeNet), 3.09x (Kitti), 8.07x (SUN RGB-D), and 5.67x (ScanNet) with GPU and 6.9x, 1.88x, 1.29x, 2.28x, 25.4x, and 19.3x with Ascend.

[64] EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments

Kun Luo,Xiaoguang Ma

Main category: cs.CV

TL;DR: 本文提出EmergeNav框架,通过引入结构化执行机制(如分阶段规划、目标条件感知提取、对比双记忆推理和双视场传感),在无需任务特定训练的情况下显著提升了零样本视觉-语言导航性能。

Details Motivation: 现代视觉语言模型虽具备语义先验知识,但其开放式的推理能力难以直接转化为稳定的长时程具身执行;关键瓶颈在于缺乏组织指令跟随、感知定位、时间进展与阶段验证的执行结构。 Method: 提出EmergeNav零样本框架,将连续视觉-语言导航建模为结构化具身推理:包含Plan-Solve-Transition层级执行结构、GIPE目标条件感知提取、对比双记忆推理实现进展定位、角色分离的Dual-FOV传感支持时空对齐的局部控制与边界验证。 Result: 在VLN-CE基准上,仅使用开源VLM主干(Qwen3-VL-8B/32B)且无任务训练、显式地图、图搜索或路点预测,达到30.00和37.00的成功率(SR)。 Conclusion: 显式的执行结构是将视觉语言模型先验知识转化为稳定具身导航行为的关键要素。 Abstract: Zero-shot vision-and-language navigation in continuous environments (VLN-CE) remains challenging for modern vision-language models (VLMs). Although these models encode useful semantic priors, their open-ended reasoning does not directly translate into stable long-horizon embodied execution. We argue that the key bottleneck is not missing knowledge alone, but missing an execution structure for organizing instruction following, perceptual grounding, temporal progress, and stage verification. We propose EmergeNav, a zero-shot framework that formulates continuous VLN as structured embodied inference. EmergeNav combines a Plan--Solve--Transition hierarchy for stage-structured execution, GIPE for goal-conditioned perceptual extraction, contrastive dual-memory reasoning for progress grounding, and role-separated Dual-FOV sensing for time-aligned local control and boundary verification. On VLN-CE, EmergeNav achieves strong zero-shot performance using only open-source VLM backbones and no task-specific training, explicit maps, graph search, or waypoint predictors, reaching 30.00 SR with Qwen3-VL-8B and 37.00 SR with Qwen3-VL-32B. These results suggest that explicit execution structure is a key ingredient for turning VLM priors into stable embodied navigation behavior.

[65] PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models

Hisayuki Yokomizo,Taiki Miyanishi,Yan Gang,Shuhei Kurita,Nakamasa Inoue,Yusuke Iwasawa

Main category: cs.CV

TL;DR: 本文提出PhysQuantAgent框架和VisPhysQuant数据集,用于基于视觉语言模型(VLM)的真实世界物体质量估计,并引入三种视觉提示方法提升VLM对物理量(如质量)的推理能力。

Details Motivation: 现有视觉语言模型在推断真实物体质量等物理属性方面能力有限,且缺乏面向真实感知条件的物理量评估基准。 Method: 提出PhysQuantAgent框架;构建含RGB-D多视角视频与精确质量标注的VisPhysQuant新基准;设计三种视觉提示方法——目标检测、尺度估计和横截面图像生成,以增强VLM对物体尺寸与内部结构的理解。 Result: 视觉提示显著提升了VLM在真实数据上的质量估计准确率,验证了融合空间推理与VLM知识进行物理推断的有效性。 Conclusion: 通过结构化视觉提示可有效弥补VLM在物理量估计上的不足,为机器人具身智能中的物理感知提供了可行路径。 Abstract: Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, and most existing benchmarks do not explicitly evaluate physical quantity estimation under realistic sensing conditions. In this work, we propose PhysQuantAgent, a framework for real-world object mass estimation using VLMs, together with VisPhysQuant, a new benchmark dataset for evaluation. VisPhysQuant consists of RGB-D videos of real objects captured from multiple viewpoints, annotated with precise mass measurements. To improve estimation accuracy, we introduce three visual prompting methods that enhance the input image with object detection, scale estimation, and cross-sectional image generation to help the model comprehend the size and internal structure of the target object. Experiments show that visual prompting significantly improves mass estimation accuracy on real-world data, suggesting the efficacy of integrating spatial reasoning with VLM knowledge for physical inference.

[66] Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE

Niklas Roßberg,Sinan Hasirlioglu,Mohamed Essayed Bouzouraa,Wolfgang Utschick,Michael Botsch

Main category: cs.CV

TL;DR: 本文提出了一种基于'Scenario-as-Specification'概念的标准化场景提取方法,以及一种融合领域知识的场景聚类流程,以提升自动驾驶系统(ADS)验证的效率与可比性。

Details Motivation: 现有真实交通场景提取定义不统一,导致场景难以比较;场景分组中,机器学习方法虽能处理复杂性但缺乏可解释性,而基于规则的方法又难以兼顾领域知识。 Method: 提出标准化的场景提取框架(基于Scenario-as-Specification),并设计领域知识引导的场景聚类流程,在highD数据集上进行实验验证。 Result: 实验表明该方法能可靠提取场景,并有效融合领域知识于聚类过程,支持从高速公路数据中更标准化地生成场景类别。 Conclusion: 所提方法提升了ADS验证中场景构建的标准化、可解释性与领域适配性,有助于提高自动驾驶车辆验证效率。 Abstract: Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.

[67] CineSRD: Leveraging Visual, Acoustic, and Linguistic Cues for Open-World Visual Media Speaker Diarization

Liangbin Huang,Xiaohua Liao,Chaoqun Cui,Shijing Wang,Zhaolong Huang,Yanlong Du,Wenji Mao

Main category: cs.CV

TL;DR: 本文提出CineSRD框架,面向影视等开放场景的说话人日志任务,融合视听文多模态线索解决长视频、多说话人、音画异步等挑战,并构建中英文影视说话人日志基准。

Details Motivation: 传统说话人日志系统局限于会议、访谈等约束场景,难以应对影视等开放世界视觉媒体中长视频、大量说话人、音画异步及野外环境多变等挑战。 Method: 提出统一多模态框架CineSRD,结合视频视觉、音频和字幕语言线索:先通过视觉锚点聚类注册初始说话人,再利用音频语言模型进行说话人轮换检测,以优化标注并补充未注册的画外音说话人。 Result: 在自建中英文影视说话人日志基准上性能优越,在传统数据集上也具竞争力,验证了其在开放世界视觉媒体中的鲁棒性与泛化能力。 Conclusion: CineSRD有效拓展了说话人日志任务至复杂视觉媒体领域,为开放世界多模态说话人理解提供了新范式。 Abstract: Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.

[68] MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing

Zhaoyuan Qiu,Ken Chen,Xiangwei Wang,Yu Xia,Sachith Seneviratne,Saman Halgamuge

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的多步指令图像编辑框架MSRAMIE,基于多模态大语言模型(MLLM),通过树状状态和参考图结构化推理,显著提升复杂指令的遵循能力与完成率。

Details Motivation: 现有基于指令的图像编辑模型在处理多步、长且相互依赖的复杂指令时性能下降,主要原因是缺乏带复杂多指令标注的训练数据,而收集此类数据并重新训练模型成本高昂。 Method: 提出MSRAMIE——一种训练-free的代理框架,以多模态大语言模型(MLLM)为核心,将现有编辑模型作为插件;设计Instructor-Actor协同机制,并引入Tree-of-States和Graph-of-References两种新型推理拓扑结构,支持指令分解、状态转移、跨步信息聚合与原始输入回溯。 Result: 实验表明,随着指令复杂度增加,MSRAMIE相较基线方法指令遵循准确率提升超15%,单次运行完成全部修改的概率提升超100%,同时保持感知质量与视觉一致性。 Conclusion: MSRAMIE提供了一种高效、可解释、可控的无需训练方案,有效拓展了现有图像编辑模型处理复杂现实指令的能力边界。 Abstract: Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address this challenge, we propose MSRAMIE, a training-free agent framework built on Multimodal Large Language Model (MLLM). MSRAMIE takes existing editing models as plug-in components and handle multi-instruction tasks via structured multimodal reasoning. It orchestrates iterative interactions between an MLLM-based Instructor and an image editing Actor, introducing a novel reasoning topology that comprises the proposed Tree-of-States and Graph-of-References. During inference, complex instructions are decomposed into multiple editing steps which enable state transitions, cross-step information aggregation, and original input recall, which enables systematic exploration of the image editing space and flexible progressive output refinement. The visualizable inference topology further provides interpretable and controllable decision pathways. Experiments show that as the instruction complexity increases, MSRAMIE can improve instruction following over 15% and increases the probability of finishing all modifications in a single run over 100%, while preserving perceptual quality and maintaining visual consistency.

[69] Continual Multimodal Egocentric Activity Recognition via Modality-Aware Novel Detection

Wonseon Lim,Hyejeong Im,Dae-Won Kim

Main category: cs.CV

TL;DR: 本文提出了一种面向多模态第一人称开放世界持续学习的框架MAND,通过模态感知的自适应评分(MoAS)和模态级表征稳定训练(MoRST),提升新颖活动检测与已知类分类性能。

Details Motivation: 现有方法在开放世界持续学习中依赖主logits进行新颖性打分,忽视了IMU等模态的互补信息,且受RGB主导和灾难性遗忘影响导致模态利用失衡。 Method: 提出MAND框架:1)推理阶段采用Modality-aware Adaptive Scoring(MoAS),基于能量分数估计各模态可靠性并自适应融合logits;2)训练阶段采用Modality-wise Representation Stabilization Training(MoRST),通过辅助头和模态级logit蒸馏保持各模态判别能力。 Result: 在公开多模态第一人称基准上,新颖活动检测AUC提升最高达10%,已知类分类准确率提升最高达2.8%。 Conclusion: MAND有效缓解了多模态开放世界持续学习中的模态失衡与灾难性遗忘问题,显著提升了新颖性检测与分类性能。 Abstract: Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary streams. Existing methods rely on the main logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens over time under catastrophic forgetting. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) estimates sample-wise modality reliability from energy scores and adaptively integrates modality logits to better exploit complementary modality cues for novelty detection. During training, Modality-wise Representation Stabilization Training (MoRST) preserves modality-specific discriminability across tasks via auxiliary heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND improves novel activity detection AUC by up to 10\% and known-class classification accuracy by up to 2.8\% over state-of-the-art baselines.

[70] Are a Thousand Words Better Than a Single Picture? Beyond Images -- A Framework for Multi-Modal Knowledge Graph Dataset Enrichment

Pengyu Zhang,Klim Zaporojets,Jie Liu,Jia-Hong Huang,Paul Groth

Main category: cs.CV

TL;DR: 本文提出Beyond Images,一种自动数据驱动的多模态知识图谱(MMKG)增强方法:通过大规模检索实体相关图像、将视觉输入转换为文本描述、并用大语言模型融合生成实体对齐摘要,从而在不改变模型结构的前提下提升MMKG补全性能,尤其在视觉模糊实体上效果显著。

Details Motivation: 现有MMKG依赖高质量图像,但大规模图像收集困难,且易忽略模糊但相关的视觉内容(如logo、符号、抽象场景),导致信息利用不足。 Method: 提出三阶段自动数据增强流水线:(1) 大规模检索实体相关图像;(2) 将所有图像统一转化为文本描述,避免模糊图像引入噪声;(3) 利用大语言模型融合多源描述,生成简洁、实体对齐的文本摘要,用于替代或增强原始文本模态。同时提供轻量级图文一致性检查接口支持人工审计。 Result: 在三个公开MMKG数据集及多个基线模型上,整体Hits@1最高提升7%;在含模糊logo/符号的难例子集上,MRR提升201.35%,Hits@1提升333.33%;图文一致性接口提升了描述质量与数据集可靠性。 Conclusion: 扩大图像覆盖范围并将其转化为高质量文本描述,是提升MMKG补全性能的一种实用、有效且架构无关的路径。 Abstract: Multi-Modal Knowledge Graphs (MMKGs) benefit from visual information, yet large-scale image collection is hard to curate and often excludes ambiguous but relevant visuals (e.g., logos, symbols, abstract scenes). We present Beyond Images, an automatic data-centric enrichment pipeline with optional human auditing. This pipeline operates in three stages: (1) large-scale retrieval of additional entity-related images, (2) conversion of all visual inputs into textual descriptions to ensure that ambiguous images contribute usable semantics rather than noise, and (3) fusion of multi-source descriptions using a large language model (LLM) to generate concise, entity-aligned summaries. These summaries replace or augment the text modality in standard MMKG models without changing their architectures or loss functions. Across three public MMKG datasets and multiple baseline models, we observe consistent gains (up to 7% Hits@1 overall). Furthermore, on a challenging subset of entities with visually ambiguous logos and symbols, converting images into text yields large improvements (201.35% MRR and 333.33% Hits@1). Additionally, we release a lightweight Text-Image Consistency Check Interface for optional targeted audits, improving description quality and dataset reliability. Our results show that scaling image coverage and converting ambiguous visuals into text is a practical path to stronger MMKG completion. Code, datasets, and supplementary materials are available at https://github.com/pengyu-zhang/Beyond-Images.

[71] Empirical Recipes for Efficient and Compact Vision-Language Models

Jiabo Huang,Zhizhong Li,Sina Sajadmanesh,Weiming Zhuang,Lingjuan Lyu

Main category: cs.CV

TL;DR: 本文通过端到端效率分析,识别出紧凑型视觉语言模型(VLMs)推理瓶颈,并提出针对性优化方案,显著降低延迟(如TTFT分别降低53%和93%),同时保持精度;进一步提出支持结构化感知输出的新型紧凑模型族ArgusVLM,在多个基准上实现高效且强性能的平衡。

Details Motivation: 现有紧凑型视觉语言模型(VLMs)在资源受限场景下未能充分发挥其参数量减少带来的推理加速潜力,亟需深入理解并解决实际推理中的效率瓶颈。 Method: 开展端到端实证效率分析与系统性推理性能剖析,识别主导瓶颈;据此设计适配紧凑VLMs的通用优化策略;并扩展模型以支持结构化感知输出,构建ArgusVLM模型族。 Result: 所提优化方法使InternVL3-2B的首词生成时间(TTFT)降低53%,SmolVLM-256M降低93%;ArgusVLM在多类基准测试中兼顾紧凑性、高效性与强性能。 Conclusion: 紧凑VLM的推理效率瓶颈不仅在于参数量,更取决于软硬件协同与计算流程设计;本文提出的优化范式与ArgusVLM架构为构建实用高效VLM系统提供了可复用的方法论与模型基础。 Abstract: Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.

[72] HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Shenzhi Wang,Shixuan Liu,Jing Zhou,Chang Gao,Xiong-Hui Chen,Binghai Wang,An Yang,Shiji Song,Bowen Yu,Gao Huang,Junyang Lin

Main category: cs.CV

TL;DR: 本文提出HopChain框架,用于合成多跳视觉-语言推理数据以增强VLMs在长链推理中的能力,显著提升24个基准测试中的20项性能,并验证了多跳结构对长链推理的关键作用。

Details Motivation: 现有视觉-语言数据缺乏依赖全程视觉证据的复杂推理链,难以暴露VLMs在长链思维(CoT)中的感知、推理、知识和幻觉等复合错误。 Method: 提出HopChain框架,自动生成逻辑依赖、实例锚定的多跳视觉-语言查询链,每跳建立后续所需实例/条件,最终答案为可验证的具体数值;将其加入Qwen3.5系列模型的RLVR训练数据中。 Result: 在24个跨领域基准上,添加HopChain数据后20项指标提升;消融实验显示替换为半多跳或单跳导致平均准确率下降5.3和7.0分;超长CoT场景下准确率提升超50点。 Conclusion: HopChain是一种高效、可扩展的多跳数据合成框架,能显著且泛化地增强VLMs的视觉-语言推理能力,尤其在长链推理任务中效果突出。 Abstract: VLMs show strong multimodal capabilities, but they still struggle with fine-grained vision-language reasoning. We find that long CoT reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for RLVR does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data specifically for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We add the multi-hop data synthesized by HopChain to the original RLVR data used to train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, and compare against RLVR on the original RLVR data alone across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized to target any specific benchmark, adding it improves 20 out of 24 benchmarks on both models, indicating broad and generalizable gains. To demonstrate that full chained queries are important, we replace them with half-multi-hop or single-hop variants, reducing the 24-benchmark average accuracy by 5.3 and 7.0 points, respectively. Multi-hop training also strengthens long-CoT vision-language reasoning, with gains peaking at more than 50 accuracy points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

[73] OpenQlaw: An Agentic AI Assistant for Analysis of 2D Quantum Materials

Sankalp Pandey,Xuan-Bac Nguyen,Hoang-Quan Nguyen,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu

Main category: cs.CV

TL;DR: 本文提出OpenQlaw,一种面向二维量子材料分析的智能代理系统,通过结合轻量级代理框架NanoBot与物理感知多模态平台QuPAINT,实现视觉识别与物理推理解耦,并支持动态查询处理、尺度感知计算及持久化物理参数记忆,从而提升实验人员在器件制备中的交互效率与实用性。

Details Motivation: 现有面向二维量子材料的多模态大模型虽具物理感知能力,但输出过于冗长、认知负荷高,缺乏对实验人员实时交互和实际器件制备的直接支持。 Method: 构建OpenQlaw代理系统:基于NanoBot轻量代理框架,集成QuPAINT作为物理感知多模态专家节点;核心LLM代理负责任务编排,解析空间数据并调用专家能力;引入持久化记忆模块存储物理尺度比与样品制备方法。 Result: 实现了视觉识别与物理推理的解耦;支持尺度感知物理计算、孤立视觉标注生成等动态任务;可通过多种消息渠道部署至实验室现场;显著提升高通量器件制备中的上下文理解与响应实用性。 Conclusion: OpenQlaw将孤立的多模态推理转化为具备上下文感知与持续记忆的实验助手,为光学识别到器件制造的转化提供了可扩展、可交互的智能代理范式。 Abstract: The transition from optical identification of 2D quantum materials to practical device fabrication requires dynamic reasoning beyond the detection accuracy. While recent domain-specific Multimodal Large Language Models (MLLMs) successfully ground visual features using physics-informed reasoning, their outputs are optimized for step-by-step cognitive transparency. This yields verbose candidate enumerations followed by dense reasoning that, while accurate, may induce cognitive overload and lack immediate utility for real-world interaction with researchers. To address this challenge, we introduce OpenQlaw, an agentic orchestration system for analyzing 2D materials. The architecture is built upon NanoBot, a lightweight agentic framework inspired by OpenClaw, and QuPAINT, one of the first Physics-Aware Instruction Multi-modal platforms for Quantum Material Discovery. This allows accessibility to the lab floor via a variety of messaging channels. OpenQlaw allows the core Large Language Model (LLM) agent to orchestrate a domain-expert MLLM,with QuPAINT, as a specialized node, successfully decoupling visual identification from reasoning and deterministic image rendering. By parsing spatial data from the expert, the agent can dynamically process user queries, such as performing scale-aware physical computation or generating isolated visual annotations, and answer in a naturalistic manner. Crucially, the system features a persistent memory that enables the agent to save physical scale ratios (e.g., 1 pixel = 0.25 μm) for area computations and store sample preparation methods for efficacy comparison. The application of an agentic architecture, together with the extension that uses the core agent as an orchestrator for domain-specific experts, transforms isolated inferences into a context-aware assistant capable of accelerating high-throughput device fabrication.

[74] Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Songchun Zhang,Zeyue Xue,Siming Fu,Jie Huang,Xianghao Kong,Y Ma,Haoyang Huang,Nan Duan,Anyi Rao

Main category: cs.CV

TL;DR: 本文提出Astrolabe,一种专为蒸馏自回归视频模型设计的高效在线强化学习框架,通过前向过程强化学习、流式训练和多奖励目标来提升生成质量并确保长程一致性。

Details Motivation: 蒸馏自回归视频模型虽支持高效流式生成,但常与人类视觉偏好不一致;现有强化学习方法难以适配,存在重蒸馏成本高或反向过程优化开销大等问题。 Method: 提出前向过程RL范式(基于负感知微调),在推理端直接对比正负样本;设计滚动KV缓存的流式训练方案,仅对局部片段更新RL策略并利用历史上下文保持连贯性;引入不确定性感知的选择性正则化与动态参考更新的多奖励目标以抑制奖励作弊。 Result: 在多个蒸馏AR视频模型上验证了方法有效性,显著提升生成质量,具备鲁棒性与可扩展性。 Conclusion: Astrolabe为蒸馏AR视频模型提供了一种无需重蒸馏、低内存开销、支持长视频对齐的实用强化学习解决方案。 Abstract: Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.

[75] PaAgent: Portrait-Aware Image Restoration Agent via Subjective-Objective Reinforcement Learning

Yijian Wang,Qingsen Yan,Jiantao Zhou,Duwei Dai,Wei Dong

Main category: cs.CV

TL;DR: 本文提出了一种面向肖像的图像恢复(IR)智能体PaAgent,通过自演化的IR工具画像库和检索增强生成(RAG)机制,提升IR工具选择效率;并引入主客观结合的强化学习策略,更准确地感知复杂场景下的退化信息。

Details Motivation: 现有图像恢复智能体缺乏对历史交互的洞察总结机制,导致IR工具选择过程低效、需穷举搜索。 Method: 提出PaAgent:1)构建并持续演化基于IR工具特性(退化图、恢复图、所选工具)的自演化画像库;2)利用RAG从画像库中检索匹配当前输入的最优IR工具;3)设计主客观结合的强化学习策略,在奖励函数中融合图像质量评分与语义理解,以精准识别部分/非均匀退化。 Result: 在8个IR基准数据集(涵盖6种单退化与8种混合退化场景)上实验验证,PaAgent显著优于现有方法,尤其在复杂退化场景下表现突出。 Conclusion: PaAgent通过画像库+RAG+主客观RL三重机制,有效解决了IR智能体工具选择低效与退化感知不准两大瓶颈,为多模态IR智能体设计提供了新范式。 Abstract: Image Restoration (IR) agents, leveraging multimodal large language models to perceive degradation and invoke restoration tools, have shown promise in automating IR tasks. However, existing IR agents typically lack an insight summarization mechanism for past interactions, which results in an exhaustive search for the optimal IR tool. To address this limitation, we propose a portrait-aware IR agent, dubbed PaAgent, which incorporates a self-evolving portrait bank for IR tools and Retrieval-Augmented Generation (RAG) to select a suitable IR tool for input. Specifically, to construct and evolve the portrait bank, the PaAgent continuously enriches it by summarizing the characteristics of various IR tools with restored images, selected IR tools, and degraded images. In addition, the RAG is employed to select the optimal IR tool for the input image by retrieving relevant insights from the portrait bank. Furthermore, to enhance PaAgent's ability to perceive degradation in complex scenes, we propose a subjective-objective reinforcement learning strategy that considers both image quality scores and semantic insights in reward generation, which accurately provides the degradation information even under partial and non-uniform degradation. Extensive experiments across 8 IR benchmarks, covering six single-degradation and eight mixed-degradation scenarios, validate PaAgent's superiority in addressing complex IR tasks. Our project page is \href{https://wyjgr.github.io/PaAgent.html}{PaAgent}.

[76] DesertFormer: Transformer-Based Semantic Segmentation for Off-Road Desert Terrain Classification in Autonomous Navigation Systems

Yasaswini Chebolu

Main category: cs.CV

TL;DR: 本文提出DesertFormer,一种基于SegFormer B2的语义分割模型,专为沙漠等非结构化越野环境设计,实现对10类地形的高精度识别,在自建数据集上达到64.4% mIoU,较DeepLabV3基线提升24.2%,并开源代码与交互式推理工具。

Details Motivation: 沙漠地形缺乏色彩对比、光照变化剧烈、植被稀疏,导致标准道路场景分割模型失效,亟需适配越野环境的可靠地形感知方法。 Method: 基于SegFormer B2架构,采用分层Mix Transformer(MiT-B2)主干网络;构建含4176张512×512标注图像的专用沙漠越野数据集;引入类别加权训练和copy-paste增强以缓解稀有类别问题。 Result: 在自建数据集上取得64.4% mIoU和86.1%像素精度,显著优于DeepLabV3 MobileNetV2基线(41.0% mIoU);系统性失败分析揭示Ground Clutter↔Landscape及Dry Grass↔Landscape为主要混淆模式。 Conclusion: DesertFormer有效提升了沙漠越野环境下的地形语义分割性能,支持安全路径规划;所提数据集、方法改进与开源资源为后续越野视觉感知研究提供了坚实基础。 Abstract: Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off-road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road-scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off-road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT-B2) backbone. The system classifies terrain into ten ecologically meaningful categories -- Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky -- enabling safety-aware path planning for ground robots and autonomous vehicles. Trained on a purpose-built dataset of 4,176 annotated off-road images at 512x512 resolution, DesertFormer achieves a mean Intersection-over-Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns -- Ground Clutter to Landscape and Dry Grass to Landscape -- and propose class-weighted training and copy-paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at https://github.com/Yasaswini-ch/Vision-based-Desert-Terrain-Segmentation-using-SegFormer.

[77] TrackDeform3D: Markerless and Autonomous 3D Keypoint Tracking and Dataset Collection for Deformable Objects

Yeheng Zong,Yizhou Chen,Alexander Bowler,Chia-Tung Yang,Ram Vasudevan

Main category: cs.CV

TL;DR: 本文提出了一种低成本、全自动的RGB-D数据采集框架TrackDeform3D,用于构建可变形物体的大规模3D数据集,通过引入运动一致性约束实现鲁棒的3D关键点检测与跟踪,并发布了一个含6类物体、总计110分钟轨迹数据的高质量数据集。

Details Motivation: 现有方法在处理复杂形变时鲁棒性差,且大规模3D数据采集成本高、依赖人工标注或动捕设备,在非结构化环境中假设易失效,导致可变形物体的大规模3D数据集和基准稀缺。 Method: 提出基于RGB-D相机的自主采集框架TrackDeform3D,识别3D关键点并结合运动一致性约束实现鲁棒、时间平滑且几何一致的轨迹跟踪。 Result: 在多类可变形物体上显著优于多个SOTA跟踪方法,几何与跟踪精度均提升;构建了含6类物体、总计110分钟轨迹的高质量大规模3D数据集。 Conclusion: TrackDeform3D为可变形物体提供了可扩展、低成本的数据采集范式,推动了相关下游任务(如动力学建模与运动规划)的发展。 Abstract: Structured 3D representations such as keypoints and meshes offer compact, expressive descriptions of deformable objects, jointly capturing geometric and topological information useful for downstream tasks such as dynamics modeling and motion planning. However, robustly extracting such representations remains challenging, as current perception methods struggle to handle complex deformations. Moreover, large-scale 3D data collection remains a bottleneck: existing approaches either require prohibitive data collection efforts, such as labor-intensive annotation or expensive motion capture setups, or rely on simplifying assumptions that break down in unstructured environments. As a result, large-scale 3D datasets and benchmarks for deformable objects remain scarce. To address these challenges, this paper presents an affordable and autonomous framework for collecting 3D datasets of deformable objects using only RGB-D cameras. The proposed method identifies 3D keypoints and robustly tracks their trajectories, incorporating motion consistency constraints to produce temporally smooth and geometrically coherent data. TrackDeform3D is evaluated against several state-of-the-art tracking methods across diverse object categories and demonstrates consistent improvements in both geometric and tracking accuracy. Using this framework, this paper presents a high-quality, large-scale dataset consisting of 6 deformable objects, totaling 110 minutes of trajectory data.

[78] Edge-Efficient Two-Stream Multimodal Architecture for Non-Intrusive Bathroom Fall Detection

Haitian Wang,Yiren Wang,Xinyu Wang,Sheldon Fung,Atif Mansoor

Main category: cs.CV

TL;DR: 本文提出了一种面向浴室湿滑环境的双流式毫米波雷达-振动融合模型,通过Motion-Mamba和Impact-Griffin分支分别建模长时运动模式与瞬态冲击特征,并采用交叉条件融合机制提升因果对齐与抗干扰能力,在边缘设备上实现高精度、低延迟、低功耗的跌倒检测。

Details Motivation: 现有跌倒检测方法将运动与冲击视为松耦合信号流,依赖粗粒度时间对齐和幅度阈值,未显式建模雷达观测坍塌与地面冲击间的因果关系,且难以应对计时漂移、物体掉落干扰及边缘设备的时延与能耗约束。 Method: 提出双流架构:Motion-Mamba分支编码毫米波雷达信号以捕获长程运动模式;Impact-Griffin分支处理三轴振动信号,突出冲击瞬态与跨轴耦合;引入低秩双线性交互与Switch-MoE头部实现跨条件融合,对齐运动与冲击token并抑制物体掉落干扰;模型部署于树莓派4B网关。 Result: 在自建浴室跌倒检测基准数据集(>3小时同步雷达与振动数据,含运行水流干扰)上,测试集达96.1%准确率、94.8%精确率、88.0%召回率、91.1%宏F1及0.968 AUC;相比最强基线,准确率提升2.0个百分点,跌倒召回率提升1.3个百分点,时延从35.9 ms降至15.8 ms,每2.56秒窗口能耗从14200 mJ降至10750 mJ。 Conclusion: 该方法有效建模了运动-冲击的因果关系,显著提升了跌倒检测性能与鲁棒性,同时满足边缘部署的实时性与能效要求,为隐私保护型居家健康监护提供了可行技术路径。 Abstract: Falls in wet bathroom environments are a major safety risk for seniors living alone. Recent work has shown that mmWave-only, vibration-only, and existing multimodal schemes, such as vibration-triggered radar activation, early feature concatenation, and decision-level score fusion, can support privacy-preserving, non-intrusive fall detection. However, these designs still treat motion and impact as loosely coupled streams, depending on coarse temporal alignment and amplitude thresholds, and do not explicitly encode the causal link between radar-observed collapse and floor impact or address timing drift, object drop confounders, and latency and energy constraints on low-power edge devices. To this end, we propose a two-stream architecture that encodes radar signals with a Motion--Mamba branch for long-range motion patterns and processes floor vibration with an Impact--Griffin branch that emphasizes impact transients and cross-axis coupling. Cross-conditioned fusion uses low-rank bilinear interaction and a Switch--MoE head to align motion and impact tokens and suppress object-drop confounders. The model keeps inference cost suitable for real-time execution on a Raspberry Pi 4B gateway. We construct a bathroom fall detection benchmark dataset with frame-level annotations, comprising more than 3~h of synchronized mmWave radar and triaxial vibration recordings across eight scenarios under running water, together with subject-independent training, validation, and test splits. On the test split, our model attains 96.1% accuracy, 94.8% precision, 88.0% recall, a 91.1% macro F1 score, and an AUC of 0.968. Compared with the strongest baseline, it improves accuracy by 2.0 percentage points and fall recall by 1.3 percentage points, while reducing latency from 35.9 ms to 15.8 ms and lowering energy per 2.56 s window from 14200 mJ to 10750 mJ on the Raspberry Pi 4B gateway.

[79] ACE-LoRA: Graph-Attentive Context Enhancement for Parameter-Efficient Adaptation of Medical Vision-Language Models

M. Arda Aydın,Melih B. Yilmaz,Aykut Koç,Tolga Çukur

Main category: cs.CV

TL;DR: 本文提出ACE-LoRA,一种参数高效的适配框架,用于增强通用医学视觉语言模型(VLMs)在保持零样本泛化能力的同时融入细粒度诊断线索。它结合LoRA模块与基于注意力的上下文增强超图神经网络(ACE-HGNN),并引入标签引导的InfoNCE损失以提升跨模态对齐,在多个医学下游任务上显著优于现有方法。

Details Motivation: 现有医学视觉语言模型在专业化(单域)与通用化(多域)之间存在权衡:专科模型泛化差,通用模型丢失细粒度诊断线索;亟需兼顾二者优势的高效适配方法。 Method: 提出ACE-LoRA框架:1)在冻结的图像-文本编码器中插入LoRA模块;2)设计ACE-HGNN模块建模高阶上下文交互,融合局部诊断线索;3)采用标签引导的InfoNCE损失抑制语义相关图文对的假负例。 Result: 仅增加0.95M可训练参数,ACE-LoRA在零样本分类、分割、检测等多领域医学基准上持续超越SOTA医学VLM及PEFT基线。 Conclusion: ACE-LoRA有效缓解了医学VLM中专业化与通用化的矛盾,验证了通过轻量级结构增强和精细化对齐可同时提升细粒度理解与零样本泛化能力。 Abstract: The success of CLIP-like vision-language models (VLMs) on natural images has inspired medical counterparts, yet existing approaches largely fall into two extremes: specialist models trained on single-domain data, which capture domain-specific details but generalize poorly, and generalist medical VLMs trained on multi-domain data, which retain broad semantics but dilute fine-grained diagnostic cues. Bridging this specialization-generalization trade-off remains challenging. To address this problem, we propose ACE-LoRA, a parameter-efficient adaptation framework for generalist medical VLMs that maintains robust zero-shot generalization. ACE-LoRA integrates Low-Rank Adaptation (LoRA) modules into frozen image-text encoders and introduces an Attention-based Context Enhancement Hypergraph Neural Network (ACE-HGNN) module that captures higher-order contextual interactions beyond pairwise similarity to enrich global representations with localized diagnostic cues, addressing a key limitation of prior Parameter-Efficient Fine-Tuning (PEFT) methods that overlook fine-grained details. To further enhance cross-modal alignment, we formulate a label-guided InfoNCE loss to effectively suppress false negatives between semantically related image-text pairs. Despite adding only 0.95M trainable parameters, ACE-LoRA consistently outperforms state-of-the-art medical VLMs and PEFT baselines across zero-shot classification, segmentation, and detection benchmarks spanning multiple domains. Our code is available at https://github.com/icon-lab/ACE-LoRA.

[80] Accurate Shift Invariant Convolutional Neural Networks Using Gaussian-Hermite Moments

Jaspreet Singh,Petra Bosilj,Grzegorz Cielniak

Main category: cs.CV

TL;DR: 本文提出了一种名为高斯-埃尔米特采样(GHS)的新型下采样策略,旨在提升CNN对空间平移的不变性,无需修改网络结构或增加训练开销,并在多个数据集上验证了其有效性和准确性。

Details Motivation: CNN本身不具备平移不变性,主要由于下采样操作破坏了该性质;而下采样又对计算效率和感受野扩展至关重要,因此需要一种既能保持平移不变性又不牺牲性能的下采样方法。 Method: 提出高斯-埃尔米特采样(GHS),利用高斯-埃尔米特多项式实现平移一致的采样,使CNN层在训练前即具备对任意空间平移的不变性,并可直接嵌入标准CNN架构中。 Result: 在CIFAR-10、CIFAR-100和MNIST-rot数据集上的实验表明,GHS实现了100%的平移分类一致性,并提升了分类准确率。 Conclusion: GHS是一种有效、即插即用的下采样方法,能在不改变CNN结构和训练流程的前提下,显著增强模型的平移不变性与分类性能。 Abstract: The convolutional neural networks (CNNs) are not inherently shift invariant or equivariant. The downsampling operation, used in CNNs, is one of the key reasons which breaks the shift invariant property of a CNN. Conversely, downsampling operation is important to improve computational efficiency and increase the area of the receptive field for more contextual information. In this work, we propose Gaussian-Hermite Sampling (GHS), a novel downsampling strategy designed to achieve accurate shift invariance. GHS leverages Gaussian-Hermite polynomials to perform shift-consistent sampling, enabling CNN layers to maintain invariance to arbitrary spatial shifts prior to training. When integrated into standard CNN architectures, the proposed method embeds shift invariance directly at the layer level without requiring architectural modifications or additional training procedures. We evaluate the proposed approach on CIFAR-10, CIFAR-100, and MNIST-rot datasets. Experimental results demonstrate that GHS significantly improves shift consistency, achieving 100% classification consistency under spatial shifts, while also improving classification accuracy compared to baseline CNN models.

[81] LLM-Powered Flood Depth Estimation from Social Media Imagery: A Vision-Language Model Framework with Mechanistic Interpretability for Transportation Resilience

Nafis Fuad,Xiaodong Qian

Main category: cs.CV

TL;DR: 本文提出FloodLlama——一个基于视觉-语言模型的街景图像单帧厘米级洪水深度估计算法,结合TikTok众包数据与合成数据训练,实现高精度、鲁棒、轻量化的实时城市内涝感知,支撑电动车安全与自动驾驶决策。

Details Motivation: 城市内涝日益威胁交通网络连续性,但尚无系统能提供实时、街景级、厘米级分辨率的洪水深度信息,以支持动态路径规划、电动车安全和自动驾驶运行。 Method: 构建FloodLlama:基于LLaMA 3.2-11B Vision模型,采用QLoRA进行微调;利用约19万张合成图像(涵盖7类车辆、4种天气、41个深度等级)及TikTok众包真实数据;引入渐进式课程学习与深度依赖提示策略(浅水用简单提示、深水用思维链推理);提出五阶段机制可解释性分析框架,定位关键层L23并实现选择性微调。 Result: 在34797次测试中,平均绝对误差(MAE)<0.97 cm;深度>5 cm时Acc@5cm >93.7%,浅水时>96.8%;Tier 3配置在真实数据上达98.62%准确率,且对视觉遮挡鲁棒;TikTok数据管道在底特律676帧标注数据上验证可行。 Conclusion: FloodLlama为城市内涝监测提供了可扩展、无需专用基础设施的实时解决方案,显著提升电动车涉水安全、自动驾驶适应性及交通韧性管理能力。 Abstract: Urban flooding poses an escalating threat to transportation network continuity, yet no operational system currently provides real-time, street-level flood depth information at the centimeter resolution required for dynamic routing, electric vehicle (EV) safety, and autonomous vehicle (AV) operations. This study presents FloodLlama, a fine-tuned open-source vision-language model (VLM) for continuous flood depth estimation from single street-level images, supported by a multimodal sensing pipeline using TikTok data. A synthetic dataset of approximately 190000 images was generated, covering seven vehicle types, four weather conditions, and 41 depth levels (0-40 cm at 1 cm resolution). Progressive curriculum training enabled coarse-to-fine learning, while LLaMA 3.2-11B Vision was fine-tuned using QLoRA. Evaluation across 34797 trials reveals a depth-dependent prompt effect: simple prompts perform better for shallow flooding, whereas chain-of-thought (CoT) reasoning improves performance at greater depths. FloodLlama achieves a mean absolute error (MAE) below 0.97 cm and Acc@5cm above 93.7% for deep flooding, exceeding 96.8% for shallow depths. A five-phase mechanistic interpretability framework identifies layer L23 as the critical depth-encoding transition and enables selective fine-tuning that reduces trainable parameters by 76-80% while maintaining accuracy. The Tier 3 configuration achieves 98.62% accuracy on real-world data and shows strong robustness under visual occlusion. A TikTok-based data pipeline, validated on 676 annotated flood frames from Detroit, demonstrates the feasibility of real-time, crowd-sourced flood sensing. The proposed framework provides a scalable, infrastructure-free solution with direct implications for EV safety, AV deployment, and resilient transportation management.

[82] Pixel-level Counterfactual Contrastive Learning for Medical Image Segmentation

Marceau Lafargue-Hauret,Raghav Mehta,Fabio De Sousa Ribeiro,Mélanie Roschewitz,Ben Glocker

Main category: cs.CV

TL;DR: 本文提出了一种结合反事实生成与密集对比学习(DVD-CL/MVD-CL)的新方法,支持无标注和银标准标注两种模式,并引入CHRO-map可视化工具;实验表明其在图像分割任务中显著优于现有方法,尤其提升了对采集与病理变异的鲁棒性。

Details Motivation: 图像分割依赖大量人工标注数据,成本高、耗时长;银标准标签易获取但存在偏差;自监督学习虽无需标注,但现有反事实+对比学习方法难以直接迁移到像素级任务。 Method: 提出DVD-CL(双视图)和MVD-CL(多视图)密集对比学习框架,融合反事实图像生成;设计监督变体以利用银标准标签;引入CHRO-map进行高分辨率颜色编码可视化。 Result: 无标注DVD-CL优于其他密集对比学习方法;使用银标准标签的监督变体比直接监督训练提升性能,达~94% DSC;整体增强模型对采集与病理变异的鲁棒性。 Conclusion: 像素级对比学习结合反事实生成与银标准标注可有效提升分割模型性能与泛化能力,为减少人工标注依赖提供了新路径。 Abstract: Image segmentation relies on large annotated datasets, which are expensive and slow to produce. Silver-standard (AI-generated) labels are easier to obtain, but they risk introducing bias. Self-supervised learning, needing only images, has become key for pre-training. Recent work combining contrastive learning with counterfactual generation improves representation learning for classification but does not readily extend to pixel-level tasks. We propose a pipeline combining counterfactual generation with dense contrastive learning via Dual-View (DVD-CL) and Multi-View (MVD-CL) methods, along with supervised variants that utilize available silver-standard annotations. A new visualisation algorithm, the Color-coded High Resolution Overlay map (CHRO-map) is also introduced. Experiments show annotation-free DVD-CL outperforms other dense contrastive learning methods, while supervised variants using silver-standard labels outperform training on the silver-standard labeled data directly, achieving $\sim$94% DSC on challenging data. These results highlight that pixel-level contrastive learning, enhanced by counterfactuals and silver-standard annotations, improves robustness to acquisition and pathological variations.

[83] Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Zacharie Bugaud

Main category: cs.CV

TL;DR: 本文研究了视觉语言模型(VLM)集成中因同族模型错误相关性导致的性能瓶颈,提出三种家族感知集成方法(HFV、QualRCCV、LCS),显著提升在误导性问题上的准确率和整体基准表现。

Details Motivation: 不同供应商的视觉语言模型(VLM)集成虽能提升基准准确率,但同架构家族模型存在相关错误,标准投票无法缓解,甚至在部分问题上导致集体错误(Misleading tier)使准确率归零。 Method: 提出三种家族感知集成方法:1)分层家族投票(HFV),先族内聚合再族间投票;2)无训练加权法QualRCCV,结合模型校准度、家族质量与逆家族规模加权;3)学习式候选答案重排序法LCS,用交叉验证分类器基于支持广度、家族多样性与模型质量重打分。 Result: HFV在Misleading tier上提升18–26个百分点;QualRCCV首次在全部三个基准上显著优于校准投票(p<0.05);LCS在VQAv2、TextVQA、GQA上分别提升0.68%、0.61%、2.45%,且不损害任一基准性能;LCS在VQAv2 test-standard达87.83%(12模型)。 Conclusion: 模型家族结构显著影响集成效果,显式建模家族相关性可大幅提升鲁棒性与准确性;所提方法尤其在易被主流集成忽略的‘误导性问题’上实现突破,为VLM集成提供了新范式。 Abstract: Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.

[84] MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Wei Yu,Runjia Qian,Yumeng Li,Liquan Wang,Songheng Yin,Sri Siddarth Chakaravarthy P,Dennis Anthony,Yang Ye,Yidi Li,Weiwei Wan,Animesh Garg

Main category: cs.CV

TL;DR: 本文提出了一种名为Mosaic Memory(MosaicMem)的混合空间记忆机制,用于提升视频扩散模型在相机运动、场景重访与干预下的空间一致性,兼顾3D定位精度与动态内容生成能力。

Details Motivation: 现有视频扩散模型在空间记忆方面存在瓶颈:显式3D结构难以建模运动物体,隐式记忆则常导致相机运动不准确;需兼顾空间一致性与动态生成能力。 Method: 提出MosaicMem——一种混合空间记忆机制,将图像块提升至3D空间以实现可靠定位与定向检索,并结合模型原生条件控制保持提示遵循性;引入PRoPE相机条件建模及两种新记忆对齐方法;通过‘分块-组合’接口在目标视角下拼接空间对齐的图像块,保留静态内容并允许模型对动态部分进行修复生成。 Result: 实验表明,相比隐式记忆,MosaicMem显著提升相机姿态遵循度;相比显式基线,更强地建模动态内容;支持分钟级导航、基于记忆的场景编辑和自回归视频展开。 Conclusion: MosaicMem有效弥合了显式与隐式空间记忆的缺陷,在保持世界一致性的同时支持灵活、可控、长时程的视频生成与交互。 Abstract: Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.

[85] SMAL-pets: SMAL Based Avatars of Pets from Single Image

Piotr Borycki,Joanna Waczyńska,Yizhe Zhu,Yongqiang Gao,Przemysław Spurek

Main category: cs.CV

TL;DR: 本文提出SMAL-pets框架,通过融合3D高斯泼溅与SMAL参数化模型,从单张图像生成高保真、可编辑、可动画的宠物3D头像,并支持文本驱动的外观与动作编辑。

Details Motivation: 现有动物3D重建面临数据稀缺、形态多样性大、毛发纹理难建模、编辑与动画依赖手动操作等挑战。 Method: 结合3D高斯泼溅(视觉保真)与SMAL参数化模型(解剖合理性),构建混合架构;设计多模态编辑套件,支持基于自然语言提示的外观调整与复杂动画生成。 Result: 实现了从单图生成高质量、可编辑、可驱动的3D狗模型,显著提升毛发表现力与跨品种泛化能力,并支持文本控制的直观编辑与动画。 Conclusion: SMAL-pets弥合了重建与生成建模的鸿沟,为宠物数字人提供了一种灵活、鲁棒且用户友好的端到端解决方案。 Abstract: Creating high-fidelity, animatable 3D dog avatars remains a formidable challenge in computer vision. Unlike human digital doubles, animal reconstruction faces a critical shortage of large-scale, annotated datasets for specialized applications. Furthermore, the immense morphological diversity across species, breeds, and crosses, which varies significantly in size, proportions, and features, complicates the generalization of existing models. Current reconstruction methods often struggle to capture realistic fur textures. Additionally, ensuring these avatars are fully editable and capable of performing complex, naturalistic movements typically necessitates labor-intensive manual mesh manipulation and expert rigging. This paper introduces SMAL-pets, a comprehensive framework that generates high-quality, editable animal avatars from a single input image. Our approach bridges the gap between reconstruction and generative modeling by leveraging a hybrid architecture. Our method integrates 3D Gaussian Splatting with the SMAL parametric model to provide a representation that is both visually high-fidelity and anatomically grounded. We introduce a multimodal editing suite that enables users to refine the avatar's appearance and execute complex animations through direct textual prompts. By allowing users to control both the aesthetic and behavioral aspects of the model via natural language, SMAL-pets provides a flexible, robust tool for animation and virtual reality.

[86] BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images

David Skuddis,Vincent Ress,Wei Zhang,Vincent Ofosu Nyako,Norbert Haala

Main category: cs.CV

TL;DR: BEV-SLD是一种基于场景地标检测(SLD)的LiDAR全局定位方法,利用鸟瞰图(BEV)图像自监督地发现场景特定地标,并通过一致性损失实现跨帧稳定检测,在多种环境中表现优异。

Details Motivation: 传统方法多为场景无关型,缺乏对场景特有结构的建模能力;本文旨在利用自监督方式从BEV图像中提取高空间密度、具判别性的场景地标以提升定位鲁棒性。 Method: 提出BEV-SLD框架:将LiDAR点云转为BEV图像,通过网络预测每帧的地标热图;引入一致性损失,约束可学习的全局地标坐标与各帧热图对齐,从而实现稳定、场景自适应的地标检测与匹配。 Result: 在校园、工业区和森林等多种真实场景下验证有效,定位鲁棒性强,性能优于当前主流方法。 Conclusion: BEV-SLD证明了自监督、场景感知的地标学习范式在LiDAR全局定位中的有效性,为无高精地图依赖的定位提供了新思路。 Abstract: We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our self-supervised approach leverages bird's-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns learnable global landmark coordinates with per-frame heatmaps, yielding consistent landmark detections across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and achieves strong performance compared to state-of-the-art methods.

[87] GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global-Local Feature Fusion

Zhuojiang Cai,Zhenghui Sun,Feng Lu

Main category: cs.CV

TL;DR: 本文提出了GazeOnce360,一种基于单个向上安装的鱼眼相机进行多人3D视线方向估计的端到端模型,并构建了合成数据集MPSGaze360以支持该新场景研究。

Details Motivation: 传统视线估计依赖前向摄像头且视角受限,而本文旨在解决从向上鱼眼相机在360°场景中估计多人3D视线这一被忽视但实用的新问题。 Method: 提出GazeOnce360模型,采用旋转卷积和眼关键点监督处理鱼眼畸变与视角变化,并设计双分辨率架构融合全局低分辨率上下文与局部高分辨率眼部细节。 Result: 在自建合成数据集MPSGaze360上验证了各模块有效性,证明了鱼眼360°多人视线估计的可行性与潜力。 Conclusion: 该工作开辟了鱼眼相机用于多用户自然交互场景下视线估计的新方向,为实际应用(如智能会议、协作分析)提供了技术基础。 Abstract: We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. Unlike conventional approaches that rely on forward-facing cameras in constrained viewpoints, we address the underexplored setting of estimating the 3D gaze direction of multiple people distributed across a 360° scene from an upward fisheye perspective. To support research in this setting, we introduce MPSGaze360, a large-scale synthetic dataset rendered using Unreal Engine, featuring diverse multi-person configurations with accurate 3D gaze and eye landmark annotations. Our model tackles the severe distortion and perspective variation inherent in fisheye imagery by incorporating rotational convolutions and eye landmark supervision. To better capture fine-grained eye features crucial for gaze estimation, we propose a dual-resolution architecture that fuses global low-resolution context with high-resolution local eye regions. Experimental results demonstrate the effectiveness of each component in our model. This work highlights the feasibility and potential of fisheye-based 360° gaze estimation in practical multi-person scenarios. Project page: https://caizhuojiang.github.io/GazeOnce360/.

[88] Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience

Jacob Piland,Byron Dowling,Christopher Sweet,Adam Czajka

Main category: cs.CV

TL;DR: 本文探讨了在隐私约束下,利用通用多模态大语言模型(MLLMs)结合人类专家知识进行虹膜呈现攻击检测(PAD)的可行性。研究表明,预训练视觉编码器能自然聚类多种虹膜攻击类型,而融入人类显著性描述的结构化提示可提升模型在模糊类别上的判别能力;实验表明 Gemini 2.5 Pro 在专家提示下优于CNN基线和人工判别,本地部署的 Llama 3.2-Vision 达到近人类水平。

Details Motivation: 虹膜PAD面临数据获取难(未知攻击不可预知、多样数据昂贵)、隐私敏感(生物数据不可外传)及攻击快速演化等现实挑战,亟需兼顾性能与隐私的新型解决方案。 Method: 利用预训练MLLMs(Gemini 2.5 Pro和Llama 3.2-Vision)分析虹膜图像视觉嵌入的聚类特性,并设计融合人类专家口头描述攻击特征的结构化提示,在不上传原始生物数据前提下完成PAD任务。 Result: 在224张含7类攻击的IRB受限数据集上,Gemini+专家提示准确率超越CNN基线和人类专家;本地部署的Llama 3.2-Vision达到近人类性能。视觉编码器嵌入显示天然具备攻击类型聚类能力,结构化提示有效缓解类间重叠问题。 Conclusion: 符合机构隐私规范的MLLMs(云端合规服务或本地部署)可作为虹膜PAD的有效可行方案,无需专门训练且兼顾安全性与适应性。 Abstract: Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due to rapid emergence of new attack vectors demanding adaptable solutions, we thus investigate in this paper whether general-purpose multimodal large language models (MLLMs) can perform iris PAD when augmented with human expert knowledge, operating under strict privacy constraints that prohibit sending biometric data to public cloud MLLM services. Through analysis of vision encoder embeddings applied to our dataset, we demonstrate that pre-trained vision transformers in MLLMs inherently cluster many iris attack types despite never being explicitly trained for this task. However, where clustering shows overlap between attack classes, we find that structured prompts incorporating human salience (verbal descriptions from subjects identifying attack indicators) enable these models to resolve ambiguities. Testing on an IRB-restricted dataset of 224 iris images spanning seven attack types, using only university-approved services (Gemini 2.5 Pro) or locally-hosted models (e.g., Llama 3.2-Vision), we show that Gemini with expert-informed prompts outperforms both a specialized convolutional neural networks (CNN)-based baseline and human examiners, while the locally-deployable Llama achieves near-human performance. Our results establish that MLLMs deployable within institutional privacy constraints offer a viable path for iris PAD.

[89] Patient4D: Temporally Consistent Patient Body Mesh Recovery from Monocular Operating Room Video

Mingxiao Tu,Hoijoon Jung,Alireza Moghadam,Andre Kyme,Jinman Kim

Main category: cs.CV

TL;DR: 本文提出Patient4D方法,利用患者静止先验,在手术AR场景下提升单目视频中三维人体网格重建的鲁棒性,尤其应对术中铺巾遮挡与头戴相机视角连续变化的挑战。

Details Motivation: 现有单目人体网格恢复(HMR)方法在手术AR场景(患者静止、铺巾严重遮挡、头戴相机视角持续变化)下性能显著下降,缺乏对静止先验的有效利用。 Method: 提出stationarity-constrained重建流程Patient4D,融合图像级基础模型与轻量几何机制;核心包括Pose Locking(基于稳定关键帧锚定姿态参数)和Rigid Fallback(基于轮廓引导的刚性对齐以应对严重遮挡)。 Result: 在4680段合成手术序列及3个公开HMR视频基准上验证:在铺巾遮挡下,平均IoU达0.75,失败帧率从30.5%降至1.3%。 Conclusion: 显式建模患者静止先验可显著提升单目三维重建在临床AR中的鲁棒性与实用性,为真实手术环境提供可行技术路径。 Abstract: Recovering a dense 3D body mesh from monocular video remains challenging under occlusion from draping and continuously moving camera viewpoints. This configuration arises in surgical augmented reality (AR), where an anesthetized patient lies under surgical draping while a surgeon's head-mounted camera continuously changes viewpoint. Existing human mesh recovery (HMR) methods are typically trained on upright, moving subjects captured from relatively stable cameras, leading to performance degradation under such conditions. To address this, we present Patient4D, a stationarity-constrained reconstruction pipeline that explicitly exploits the stationarity prior. The pipeline combines image-level foundation models for perception with lightweight geometric mechanisms that enforce temporal consistency across frames. Two key components enable robust reconstruction: Pose Locking, which anchors pose parameters using stable keyframes, and Rigid Fallback, which recovers meshes under severe occlusion through silhouette-guided rigid alignment. Together, these mechanisms stabilize predictions while remaining compatible with off-the-shelf HMR models. We evaluate Patient4D on 4,680 synthetic surgical sequences and three public HMR video benchmarks. Under surgical drape occlusion, Patient4D achieves a 0.75 mean IoU, reducing failure frames from 30.5% to 1.3% compared to the best baseline. Our findings demonstrate that exploiting stationarity priors can substantially improve monocular reconstruction in clinical AR scenarios.

[90] Visual Product Search Benchmark

Karthik Sulthanpete Govindappa

Main category: cs.CV

TL;DR: 本文构建了一个面向工业应用的实例级图像检索视觉嵌入模型基准测试,评估了开源基础模型、专有多模态系统和领域专用视觉模型在真实工业数据集上的性能,强调无后处理、多样化成像条件和精确实例匹配。

Details Motivation: 工业与商业场景中(如维护、采购)需可靠地从图像识别产品,错误匹配会导致严重下游故障;现有视觉搜索系统需在大规模、动态更新的目录及多样成像条件下精准检索目标实例,但缺乏面向工业场景的统一基准评估。 Method: 设计并执行统一的图像到图像检索协议,评估多类视觉嵌入模型(开源基础模型、专有多模态系统、领域专用视觉模型),使用来自制造、汽车、DIY、零售等实际产线的工业数据集及公开基准,且不采用任何后处理以隔离模型本征检索能力。 Result: 揭示了当前基础/统一嵌入模型在细粒度实例检索任务上的迁移能力有限,相比专为工业训练的模型存在明显差距;明确了各模型在真实约束(异构图像质量、精确匹配要求)下的优势与局限。 Conclusion: 该基准为工业产品识别系统选型与研发提供了实证依据,强调需结合领域特性优化嵌入模型,并推动更贴近生产需求的视觉检索评估范式。 Abstract: Reliable product identification from images is a critical requirement in industrial and commercial applications, particularly in maintenance, procurement, and operational workflows where incorrect matches can lead to costly downstream failures. At the core of such systems lies the visual search component, which must retrieve and rank the exact object instance from large and continuously evolving catalogs under diverse imaging conditions. This report presents a structured benchmark of modern visual embedding models for instance-level image retrieval, with a focus on industrial applications. A curated set of open-source foundation embedding models, proprietary multi-modal embedding systems, and domain-specific vision-only models are evaluated under a unified image-to-image retrieval protocol. The benchmark includes curated datasets, which includes industrial datasets derived from production deployments in Manufacturing, Automotive, DIY, and Retail, as well as established public benchmarks. Evaluation is conducted without post-processing, isolating the retrieval capability of each model. The results provide insight into how well contemporary foundation and unified embedding models transfer to fine-grained instance retrieval tasks, and how they compare to models explicitly trained for industrial applications. By emphasizing realistic constraints, heterogeneous image conditions, and exact instance matching requirements, this benchmark aims to inform both practitioners and researchers about the strengths and limitations of current visual embedding approaches in production-level product identification systems. An interactive companion website presenting the benchmark results, evaluation details, and additional visualizations is available at https://benchmark.nyris.io.

[91] SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization

Ishrith Gowda,Chunwei Liu

Main category: cs.CV

TL;DR: 本文提出SA-CycleGAN-2.5D,一种基于域自适应的多中心神经影像标准化方法,通过2.5D三平面注入、U-ResNet+全局自注意力生成器和谱归一化判别器,显著降低扫描仪导致的分布偏移,提升放射组学可重复性。

Details Motivation: 多中心神经影像分析受扫描仪引起的协变量偏移严重干扰,导致体素强度分布跨设备非线性变化,而现有方法(如ComBat或CNN)在空间建模或全局相关性建模上存在局限。 Method: 提出SA-CycleGAN-2.5D框架:(1) 2.5D三平面流形注入以保留z方向梯度;(2) 带密集体素级自注意力的U-ResNet生成器,突破CNN感受野限制;(3) 谱归一化判别器保障Lipschitz连续性与训练稳定。 Result: 在654例胶质瘤患者(BraTS与UPenn-GBM)上,MMD降低99.1%(1.729→0.015),域分类器准确率降至近随机水平(59.7%);消融实验证明全局注意力对异质→同质转换至关重要(Cohen's d=1.32, p<0.001)。 Conclusion: SA-CycleGAN-2.5D有效桥接2D效率与3D一致性,在保持肿瘤病理生理结构前提下实现体素级图像标准化,支撑可复现的多中心放射组学分析。 Abstract: Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(\mathbf{x})$ varies non-linearly across acquisition protocols while the conditional anatomy $P(\mathbf{y}|\mathbf{x})$ remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., ComBat) operate in feature space, precluding spatial downstream tasks, while standard deep learning approaches are theoretically bounded by local effective receptive fields (ERF), failing to model the global intensity correlations characteristic of field-strength bias. We propose SA-CycleGAN-2.5D, a domain adaptation framework motivated by the $HΔH$-divergence bound of Ben-David et al., integrating three architectural innovations: (1) A 2.5D tri-planar manifold injection preserving through-plane gradients $\nabla_z$ at $O(HW)$ complexity; (2) A U-ResNet generator with dense voxel-to-voxel self-attention, surpassing the $O(\sqrt{L})$ receptive field limit of CNNs to model global scanner field biases; and (3) A spectrally-normalized discriminator constraining the Lipschitz constant ($K_D \le 1$) for stable adversarial optimization. Evaluated on 654 glioma patients across two institutional domains (BraTS and UPenn-GBM), our method reduces Maximum Mean Discrepancy (MMD) by 99.1% ($1.729 \to 0.015$) and degrades domain classifier accuracy to near-chance (59.7%). Ablation confirms that global attention is statistically essential (Cohen's $d = 1.32$, $p < 0.001$) for the harder heterogeneous-to-homogeneous translation direction. By bridging 2D efficiency and 3D consistency, our framework yields voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis.

[92] Adaptive Anchor Policies for Efficient 4D Gaussian Streaming

Ashim Dahal,Rabab Abdelfattah,Nick Rahimi

Main category: cs.CV

TL;DR: 本文提出了一种预算感知的锚点采样方法 Efficient Gaussian Streaming (EGS),通过强化学习策略替代传统的 Farthest Point Sampling (FPS),在保持高斯流重建主干不变的前提下,动态选择更少但更有信息量的锚点,显著提升了动态场景重建中质量与效率的权衡。

Details Motivation: 现有高斯溅射动态场景重建方法通常采用固定数量(如8192个)的锚点(如FPS),导致在计算预算受限时资源浪费或性能下降,缺乏对场景复杂度和实时性需求的自适应能力。 Method: 提出EGS框架,将锚点选择建模为离散约束下的强化学习问题,联合优化锚点数量与具体子集;利用高斯表示的空间特征作为状态输入,以重建质量与运行时间的加权指标为奖励进行策略训练;可即插即用地替换原有FPS模块。 Result: 在N3DV和MeetingRoom等动态多视角数据集上,EGS在快速渲染模式下仅用256个锚点(比8192少32倍),PSNR提升0.52–0.61 dB且速度比IGS@8192快1.29–1.35倍;在高质量精调模式下,以远低于全锚点的预算仍保持与基线相当的性能。 Conclusion: EGS是一种轻量、通用、预算感知的锚点采样方法,有效缓解了固定锚点策略在动态高斯流重建中的资源低效问题,在质量和效率之间实现了更优平衡。 Abstract: Dynamic scene reconstruction with Gaussian Splatting has enabled efficient streaming for real-time rendering and free-viewpoint video. However, most pipelines rely on fixed anchor selection such as Farthest Point Sampling (FPS), typically using 8,192 anchors regardless of scene complexity, which over-allocates computation under strict budgets. We propose Efficient Gaussian Streaming (EGS), a plug-in, budget-aware anchor sampler that replaces FPS with a reinforcement-learned policy while keeping the Gaussian streaming reconstruction backbone unchanged. The policy jointly selects an anchor budget and a subset of informative anchors under discrete constraints, balancing reconstruction quality and runtime using spatial features of the Gaussian representation. We evaluate EGS in two settings: fast rendering, which prioritizes runtime efficiency, and high-quality refinement, which enables additional optimization. Experiments on dynamic multi-view datasets show consistent improvements in the quality--efficiency trade-off over FPS sampling. On unseen data, in fast rendering at 256 anchors ($32\times$ fewer than 8,192), EGS improves PSNR by $+0.52$--$0.61$\,dB while running $1.29$--$1.35\times$ faster than IGS@8192 (N3DV and MeetingRoom). In high-quality refinement, EGS remains competitive with the full-anchor baseline at substantially lower anchor budgets. \emph{Code and pretrained checkpoints will be released upon acceptance.} \keywords{4D Gaussian Splatting \and 4D Gaussian Streaming \and Reinforcement Learning}

[93] From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

Boyong Wu,Sanghwan Kim,Zeynep Akata

Main category: cs.CV

TL;DR: 本文通过层线性探测、注意力干预和双向注意力分析,揭示了多模态大语言模型(MLLMs)在像素级分割任务中空间理解的内在机制,发现适配器导致表征下降,而LLM层通过注意力机制逐步恢复,并受因果注意力限制,双向图像标记注意力可缓解该限制。

Details Motivation: 多模态大语言模型(MLLMs)虽被广泛用于像素级视觉任务,但其内在空间理解能力尚不清楚。 Method: 采用层线性探测评估整个MLLM流程(视觉编码器、适配器、LLM)的分割能力;进行基于干预的注意力‘敲除’分析以检验跨标记注意力是否逐步优化视觉表征;评估图像标记间双向注意力对空间一致性的贡献。 Result: 适配器引入分割表征下降;LLM层通过注意力介导的细化逐步恢复表征,正确分类的标记能引导邻近错误标记转向正确标签;早期图像标记位置的恢复受限于因果注意力,而图像标记间的双向注意力可缓解该限制。 Conclusion: 本研究为MLLM如何处理视觉信息以实现分割提供了机制性解释,并为未来具备分割能力的模型设计提供了指导。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and LLM. We further conduct an intervention based attention knockout analysis to test whether cross-token attention progressively refines visual representations, and an evaluation of bidirectional attention among image tokens on spatial consistency. Our analysis reveals that the adapter introduces a segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement, where correctly classified tokens steer misclassified neighbors toward the correct label. At early image token positions, this recovery is bounded by causal attention, which bidirectional attention among image tokens alleviates. These findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models.

[94] GigaWorld-Policy: An Efficient Action-Centered World--Action Model

Angen Ye,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Hao Li,Hengtao Li,Jie Li,Jindi Lv,Jingyu Liu,Min Cao,Peng Li,Qiuping Deng,Wenjun Mei,Xiaofeng Wang,Xinze Chen,Xinyu Zhou,Yang Wang,Yifan Chang,Yifan Li,Yukun Zhou,Yun Ye,Zhichao Liu,Zheng Zhu

Main category: cs.CV

TL;DR: 本文提出GigaWorld-Policy,一种以动作为中心的世界-动作模型(WAM),通过解耦动作预测与视频生成、引入因果结构实现高效推理,并在真实机器人平台上显著提升速度与任务成功率。

Details Motivation: 现有基于预训练视频生成骨干网络的世界-动作模型(WAM)存在两大瓶颈:一是联合推理未来视觉动态与动作导致高推理开销;二是视觉与运动表征纠缠,使动作预测质量过度依赖视频预测精度。 Method: 提出GigaWorld-Policy,将策略训练建模为两个耦合组件:1)基于当前观测预测未来动作序列;2)基于预测动作与当前观测生成未来视频;采用因果架构防止未来视频token影响动作token,使视频生成在推理时可选;并构建大规模机器人数据集用于预训练动作中心的视频生成模型。 Result: 在真实机器人平台上,GigaWorld-Policy比主流WAM基线Motus快9倍,任务成功率提升7%;在RoboTwin 2.0上相较pi-0.5性能提升95%。 Conclusion: 以动作为中心、解耦且因果约束的WAM设计能兼顾高效推理与高策略性能,为具身智能中的世界模型落地提供了新范式。 Abstract: World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.

[95] LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

Inbum Heo,Taewook Hwang,Jeesu Jung,Sangkeun Jung

Main category: cs.CV

TL;DR: 本文提出Layout Error Detection (LED)基准,用于评估文档布局分析(DLA)中模型的结构推理能力,定义了八种标准化错误类型,并构建了LED-Dataset和三项评估任务,以实现对文档理解模型结构鲁棒性和推理能力的细粒度、可解释性诊断。

Details Motivation: 现有基于重叠的指标(如IoU、mAP)无法捕捉文档布局分析中的逻辑结构性错误(如区域合并、分割、遗漏等),亟需一种能评估结构推理能力的新基准。 Method: 提出LED基准,定义八种标准化布局错误类型,设计定量规则与错误注入算法以模拟真实错误;构建LED-Dataset,并设计三项评估任务:文档级错误检测、文档级错误类型分类、元素级错误类型分类。 Result: 在多个SOTA多模态模型上的实验表明,LED能实现细粒度、可解释的结构理解评估,揭示了不同模态与架构模型在结构性推理上的明显弱点。 Conclusion: LED建立了统一、可解释的基准,用于诊断文档理解模型的结构鲁棒性与推理能力,推动DLA向更深层逻辑理解发展。 Abstract: Recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs) have improved Document Layout Analysis (DLA), yet structural errors such as region merging, splitting, and omission remain persistent. Conventional overlap-based metrics (e.g., IoU, mAP) fail to capture such logical inconsistencies. To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy. LED defines eight standardized error types (Missing, Hallucination, Size Error, Split, Merge, Overlap, Duplicate, and Misclassification) and provides quantitative rules and injection algorithms for realistic error simulation. Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification. Experiments with state-of-the-art multimodal models show that LED enables fine-grained and interpretable assessment of structural understanding, revealing clear weaknesses across modalities and architectures. Overall, LED establishes a unified and explainable benchmark for diagnosing the structural robustness and reasoning capability of document understanding models.

[96] ConfusionBench: An Expert-Validated Benchmark for Confusion Recognition and Localization in Educational Videos

Lu Dong,Xiao Wang,Mark Frank,Srirangaraj Setlur,Venu Govindaraju,Ifeoma Nwogu

Main category: cs.CV

TL;DR: 本文提出了一种多阶段过滤流程来构建高质量的学生困惑识别与定位基准ConfusionBench,并对开源与专有模型进行了零样本评估,揭示其在教育视频中识别学生困惑的性能差异与局限。

Details Motivation: 现有困惑数据集存在标签噪声大、时间标注粗糙、专家验证不足等问题,难以支撑细粒度识别与时间定位分析。 Method: 设计了融合模型辅助筛选、研究者人工整理和专家验证的三阶段过滤流程;构建了包含平衡分类数据集与视频定位数据集的ConfusionBench基准;开展零样本clip级识别与长视频定位实验;开发困惑报告可视化工具。 Result: 专有模型整体性能更优但易过预测过渡段,开源模型更保守但漏检较多;ConfusionBench提升了数据质量;可视化工具支持教育专家干预决策。 Conclusion: 高质量数据构建流程与基准对推动教育AI中学生困惑理解至关重要,未来工作需进一步优化模型时序建模能力与标注一致性。 Abstract: Recognizing and localizing student confusion from video is an important yet challenging problem in educational AI. Existing confusion datasets suffer from noisy labels, coarse temporal annotations, and limited expert validation, which hinder reliable fine-grained recognition and temporally grounded analysis. To address these limitations, we propose a practical multi-stage filtering pipeline that integrates two stages of model-assisted screening, researcher curation, and expert validation to build a higher-quality benchmark for confusion understanding. Based on this pipeline, we introduce ConfusionBench, a new benchmark for educational videos consisting of a balanced confusion recognition dataset and a video localization dataset. We further provide zero-shot baseline evaluations of a representative open-source model and a proprietary model on clip-level confusion recognition, long-video confusion localization tasks. Experimental results show that the proprietary model performs better overall but tends to over-predict transitional segments, while the open-source model is more conservative and more prone to missed detections. In addition, the proposed student confusion report visualization can support educational experts in making intervention decisions and adapting learning plans accordingly. All datasets and related materials will be made publicly available on our project page.

[97] DANCE: Dynamic 3D CNN Pruning: Joint Frame, Channel, and Feature Adaptation for Energy Efficiency on the Edge

Mohamed Mejri,Ashiqur Rasul,Abhijit Chatterjee

Main category: cs.CV

TL;DR: 本文提出DANCE框架,一种面向3D CNN的细粒度、输入感知动态剪枝方法,通过激活变异性放大(AVA)和自适应激活剪枝(AAP)两步,在几乎不损失精度的前提下显著提升能效。

Details Motivation: 现代CNN无法根据输入样本的计算复杂度动态调整,导致能量消耗高。 Method: 提出两步法:第一步AVA重训练模型以增大神经元激活幅度的方差;第二步AAP训练轻量级控制器网络,基于首层输出统计动态剪枝帧、通道和特征。 Result: 在Jetson Nano和Snapdragon 8 Gen 1上分别实现1.37X和2.22X加速,能效最高提升1.47X。 Conclusion: DANCE在保持性能的同时显著降低计算与内存访问开销,提升了3D CNN的功率效率。 Abstract: Modern convolutional neural networks (CNNs) are workhorses for video and image processing, but fail to adapt to the computational complexity of input samples in a dynamic manner to minimize energy consumption. In this research, we propose DANCE, a fine-grained, input-aware, dynamic pruning framework for 3D CNNs to maximize power efficiency with negligible to zero impact on performance. In the proposed two-step approach, the first step is called activation variability amplification (AVA), and the 3D CNN model is retrained to increase the variance of the magnitude of neuron activations across the network in this step, facilitating pruning decisions across diverse CNN input scenarios. In the second step, called adaptive activation pruning (AAP), a lightweight activation controller network is trained to dynamically prune frames, channels, and features of 3D convolutional layers of the network (different for each layer), based on statistics of the outputs of the first layer of the network. Our method achieves substantial savings in multiply-accumulate (MAC) operations and memory accesses by introducing sparsity within convolutional layers. Hardware validation on the NVIDIA Jetson Nano GPU and the Qualcomm Snapdragon 8 Gen 1 platform demonstrates respective speedups of 1.37X and 2.22X, achieving up to 1.47X higher energy efficiency compared to the state of the art.

[98] Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation

Jianzhang Zhang,Yijing Tian,Jiwang Qu,Chuang Liu

Main category: cs.CV

TL;DR: 本文提出了一种两阶段框架,通过Group-Shared Attention(GSA)增强跨帧角色身份一致性,并利用Direct Preference Optimization(DPO)提升生成图像的美学与叙事质量,在ViStoryBench上显著提升了角色身份和风格一致性指标。

Details Motivation: 现有故事可视化方法在角色身份一致性和长叙事连贯性方面存在严重不足,尤其在复杂交互和长情节中易出现身份漂移问题。 Method: 第一阶段引入Group-Shared Attention(GSA),实现注意力层内无损跨样本信息共享,以结构化建模帧间身份对应;第二阶段采用Direct Preference Optimization(DPO),基于整体人类偏好数据联合优化视觉保真度与身份一致性。 Result: 在ViStoryBench基准上达到新SOTA:角色身份一致性(CIDS)提升+10.0,风格一致性(CSD)提升+18.7,同时保持高保真生成质量。 Conclusion: 所提两阶段框架有效解决了故事可视化中的身份漂移与风格不一致问题,验证了GSA与DPO协同设计在多帧语义-视觉对齐任务中的有效性与泛化性。 Abstract: Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.

[99] 3D MRI-Based Alzheimer's Disease Classification Using Multi-Modal 3D CNN with Leakage-Aware Subject-Level Evaluation

Md Sifat,Sania Akter,Akif Islam,Md. Ekramul Hamid,Abu Saleh Musa Miah,Najmul Hassan,Md Abdur Rahim,Jungpil Shin

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态3D卷积神经网络的阿尔茨海默病(AD)分类方法,利用OASIS-1数据集的原始3D MRI体积及FSL FAST分割得到的组织概率图,在受试者级别5折交叉验证中达到72.34%准确率和0.778 AUC,并通过Grad-CAM验证模型关注已知与AD相关的解剖区域。

Details Motivation: 现有研究多基于2D MRI切片,而临床实践依赖全脑三维结构;三维体素分析更能捕捉与疾病进展相关的脑区空间关系。 Method: 构建融合T1加权MRI与灰质、白质、脑脊液概率图的多模态3D CNN模型,采用受试者级别的5折交叉验证,并辅以Grad-CAM可视化和切片级对照实验。 Result: 在OASIS-1数据集上取得平均准确率72.34%±4.66%,ROC AUC为0.7781±0.0365;Grad-CAM显示模型聚焦于内侧颞叶和脑室等AD相关区域。 Conclusion: 该多模态3D框架建立了可复现的受试者级AD分类基准,证实了体素级MRI分析在AD诊断中的潜力,并强调了数据表示与评估策略对性能评估的重要性。 Abstract: Deep learning has become an important tool for Alzheimer's disease (AD) classification from structural MRI. Many existing studies analyze individual 2D slices extracted from MRI volumes, while clinical neuroimaging practice typically relies on the full three dimensional structure of the brain. From this perspective, volumetric analysis may better capture spatial relationships among brain regions that are relevant to disease progression. Motivated by this idea, this work proposes a multimodal 3D convolutional neural network for AD classification using raw OASIS 1 MRI volumes. The model combines structural T1 information with gray matter, white matter, and cerebrospinal fluid probability maps obtained through FSL FAST segmentation in order to capture complementary neuroanatomical information. The proposed approach is evaluated on the clinically labelled OASIS 1 cohort using 5 fold subject level cross validation, achieving a mean accuracy of 72.34% plus or minus 4.66% and a ROC AUC of 0.7781 plus or minus 0.0365. GradCAM visualizations further indicate that the model focuses on anatomically meaningful regions, including the medial temporal lobe and ventricular areas that are known to be associated with Alzheimer's related structural changes. To better understand how data representation and evaluation strategies may influence reported performance, additional diagnostic experiments were conducted on a slice based version of the dataset under both slice level and subject level protocols. These observations help provide context for the volumetric results. Overall, the proposed multimodal 3D framework establishes a reproducible subject level benchmark and highlights the potential benefits of volumetric MRI analysis for Alzheimer's disease classification.

[100] Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Haiyang Yan,Hongyun Zhou,Peng Xu,Xiaoxue Feng,Mengyi Liu

Main category: cs.CV

TL;DR: 本文提出Symphony多智能体系统,通过模拟人类认知模式分解长视频理解任务,并引入深度推理协作机制和VLM基础定位方法,显著提升长视频理解性能。

Details Motivation: 现有MLLM代理在长视频理解(LVU)任务中表现不佳,因其信息密度高、时间跨度长,简单任务分解与协作机制不足以支持长链推理,而基于嵌入的检索又易丢失关键信息。 Method: Symphony系统采用细粒度子任务分解、增强反思的深度推理协作机制,以及基于VLM的视频片段相关性评估与定位方法。 Result: Symphony在LVBench、LongVideoBench、VideoMME和MLVU等多个基准上达到SOTA性能,在LVBench上较先前最佳方法提升5.0%。 Conclusion: Symphony通过类人认知建模与多智能体协同,有效缓解了长视频理解中的信息密度与时间跨度挑战,为复杂视频推理任务提供了新范式。 Abstract: Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. By emulating human cognition patterns, Symphony decomposes LVU into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection, effectively improving the reasoning capability. Additionally, Symphony provides a VLM-based grounding approach to analyze LVU tasks and assess the relevance of video segments, which significantly enhances the ability to locate complex problems with implicit intentions and large temporal spans. Experimental results show that Symphony achieves state-of-the-art performance on LVBench, LongVideoBench, VideoMME, and MLVU, with a 5.0% improvement over the prior state-of-the-art method on LVBench. Code is available at https://github.com/Haiyang0226/Symphony.

[101] Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

Yuelin Zhang,Sijie Cheng,Chen Li,Zongzhao Li,Yuxin Huang,Yang Liu,Wenbing Huang

Main category: cs.CV

TL;DR: 本文提出R²VLM模型,通过循环推理框架和链式思维(CoT)显式建模任务分解与完成状态,实现高效、准确的长时程具身任务进度估计,兼顾推理能力与计算效率。

Details Motivation: 现有基于视觉语言模型(VLM)的任务进度估计方法忽视其复杂推理能力,且处理长视频轨迹计算开销大,难以实际部署。 Method: 提出循环推理视觉语言模型(R²VLM),采用递归处理局部视频片段,并通过持续演化的链式思维(CoT)维护全局上下文,显式记录任务分解、关键步骤及完成状态。 Result: 在ALFRED和Ego4D数据集上训练后,R²VLM在进度估计及下游应用(如策略学习、强化学习奖励建模、主动辅助)中均取得新SOTA性能,具备强泛化能力。 Conclusion: R²VLM有效平衡了推理深度与计算效率,为具身智能中的长时程任务理解与规划提供了新范式。 Abstract: Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ($\text{R}^2$VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of processing long videos while preserving essential reasoning capabilities. We train $\text{R}^2$VLM on large-scale, automatically generated datasets from ALFRED and Ego4D. Extensive experiments on progress estimation and downstream applications, including progress-enhanced policy learning, reward modeling for reinforcement learning, and proactive assistance, demonstrate that $\text{R}^2$VLM achieves strong performance and generalization, achieving a new state-of-the-art in long-horizon task progress estimation. The models and benchmarks are publicly available at \href{https://huggingface.co/collections/zhangyuelin/r2vlm}{huggingface}.

[102] A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition

Hongbing Li,Jiamin Liu,Shuo Zhang,Bo Xiao

Main category: cs.CV

TL;DR: 本文提出了一种无需候选区域的Query-Guided Network(QGN),通过文本引导和跨模态交互统一多模态推理与解码,提升GMNER任务中实体识别与图像区域定位的准确性。

Details Motivation: 现有GMNER方法依赖预训练通用目标检测器,易忽略与文本实体匹配所需的细粒度图像区域,导致模态间错位和性能下降。 Method: 提出无候选区域的Query-Guided Network(QGN),利用文本查询直接引导跨模态交互与联合解码,避免两阶段分离流程。 Result: 在多个主流GMNER基准上,QGN达到SOTA性能,展现出对开放域场景的强鲁棒性与高精度接地能力。 Conclusion: QGN通过端到端文本引导的跨模态建模,有效缓解了检测器与实体间的语义错位问题,为GMNER提供了更统一、精准的解决方案。 Abstract: Grounded Multimodal Named Entity Recognition (GMNER) identifies named entities, including their spans and types, in natural language text and grounds them to the corresponding regions in associated images. Most existing approaches split this task into two steps: they first detect objects using a pre-trained general-purpose detector and then match named entities to the detected objects. However, these methods face a major limitation. Because pre-trained general-purpose object detectors operate independently of textual entities, they tend to detect common objects and frequently overlook specific fine-grained regions required by named entities. This misalignment between object detectors and entities introduces imprecision and can impair overall system performance. In this paper, we propose a proposal-free Query-Guided Network (QGN) that unifies multimodal reasoning and decoding through text guidance and cross- modal interaction. QGN enables accurate grounding and robust performance in open-domain scenarios. Extensive experiments demonstrate that QGN achieves top performance among compared GMNER models on widely used benchmarks.

[103] MedSAD-CLIP: Supervised CLIP with Token-Patch Cross-Attention for Medical Anomaly Detection and Segmentation

Thuy Truong Tran,Minh Kha Do,Phuc Nguyen Duy,Min Hun Lee

Main category: cs.CV

TL;DR: 本文提出MedSAD-CLIP,一种面向医学异常检测与分割的监督式CLIP适配方法,在有限标注数据下利用细粒度图文线索(TPCA)、轻量图像适配器与可学习提示词,并设计边缘对比损失,显著提升病灶定位与分割精度。

Details Motivation: 现有基于CLIP的医学异常检测方法依赖全局表征和弱监督,导致定位粗略、分割质量差;而临床中常存在少量但有意义的异常标注数据,亟需有效利用。 Method: 提出MedSAD-CLIP:引入Token-Patch Cross-Attention(TPCA)建模细粒度图文对齐;采用轻量图像适配器和可学习prompt token实现CLIP预训练编码器的高效医学域适配;设计Margin-based图像-文本对比损失增强正常/异常表征判别性。 Result: 在Brain、Retina、Lung、Breast四大医学数据集上,像素级分割与图像级分类性能均超越现有SOTA方法。 Conclusion: 监督式CLIP适配是一种统一、可扩展的医学异常理解新范式,兼顾高精度定位分割与强泛化能力。 Abstract: Medical anomaly detection (MAD) and segmentation play a critical role in assisting clinical diagnosis by identifying abnormal regions in medical images and localizing pathological regions. Recent CLIP-based studies are promising for anomaly detection in zero-/few-shot settings, and typically rely on global representations and weak supervision, often producing coarse localization and limited segmentation quality. In this work, we study supervised adaptation of CLIP for MAD under a realistic clinical setting where a limited yet meaningful amount of labeled abnormal data is available. Our model MedSAD-CLIP leverages fine-grained text-visual cues via the Token-Patch Cross-Attention(TPCA) to improve lesion localization while preserving the generalization capability of CLIP representations. Lightweight image adapters and learnable prompt tokens efficiently adapt the pretrained CLIP encoder to the medical domain while preserving its rich semantic alignment. Furthermore, a Margin-based image-text Contrastive Loss is designed to enhance global feature discrimination between normal and abnormal representations. Extensive experiments on four diverse benchmarks-Brain, Retina, Lung, and Breast datasets-demonstrate the effectiveness of our approach, achieving superior performance in both pixel-level segmentation and image-level classification over state-of-the-art methods. Our results highlight the potential of supervised CLIP adaptation as a unified and scalable paradigm for medical anomaly understanding. Code will be made available at https://github.com/thuy4tbn99/MedSAD-CLIP

[104] FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

Peisen Zhao,Xiaopeng Zhang,Mingxing Xu,Ruoyu Sun,Zewei Du,Dunzheng Wang,Guanghao Zheng,Haohang Xu,Zhibo Zhang,Yuhang Zhang,Yi Ai,Lin Liu,Qi Tian

Main category: cs.CV

TL;DR: 本文提出FineViT视觉编码器,通过高分辨率训练与LLM对齐的渐进式策略,提升MLLMs在细粒度视觉感知任务上的性能。

Details Motivation: 现有基于CLIP的视觉编码器因低分辨率预训练和噪声网络图文对,在密集空间任务中存在细节丢失问题,成为MLLM性能瓶颈。 Method: 提出FineViT:首先在数十亿全局重描述图文对上从零开始高分辨率训练;再利用4.5亿高质量局部描述数据集FineCap-450M进行LLM对齐以增强局部感知。 Result: FineViT在零样本识别与检索(尤其长上下文检索)上达到SOTA;集成到MLLM后显著优于SigLIP2、Qwen-ViT等多模态视觉编码器。 Conclusion: FineViT为细粒度视觉感知提供了强大新基线,有望推动MLLM视觉理解能力的发展。 Abstract: While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.

[105] EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection

Chenyang Zhu,Maorong Wang,Jun Liu,Ching-Chun Chang,Isao Echizen

Main category: cs.CV

TL;DR: 本文提出EvoGuard,一种基于多智能体的AIGI检测框架,通过动态协调多种现有检测器(包括MLLM和非MLLM),结合自主规划、反思与多轮推理,在仅需低成本二元标签的情况下实现SOTA性能,并支持即插即用式扩展。

Details Motivation: AI生成图像(AIGI)泛滥带来严重 misinformation风险,传统方法依赖低级特征,现有基于MLLM的方法泛化性好但可扩展性差、标注成本高,难以应对复杂动态的真实场景。 Method: 提出EvoGuard:将多种SOTA检测器封装为可调用工具,由具备规划与反思能力的智能体通过能力感知的动态编排机制协调调用;采用基于GRPO的智能体强化学习进行优化,仅需二元标签。 Result: 在多个基准上达到SOTA准确率,缓解正负样本偏差,并支持无需训练的即插即用式新检测器集成,提升整体性能。 Conclusion: EvoGuard提供了一种高效、实用、可持续演进的AIGI检测范式,突破单模型局限,兼顾性能、泛化性与部署灵活性。 Abstract: The rapid proliferation of AI-Generated Images (AIGIs) has introduced severe risks of misinformation, making AIGI detection a critical yet challenging task. While traditional detection paradigms mainly rely on low-level features, recent research increasingly focuses on leveraging the general understanding ability of Multimodal Large Language Models (MLLMs) to achieve better generalization, but still suffer from limited extensibility and expensive training data annotations. To better address complex and dynamic real-world environments, we propose EvoGuard, a novel agentic framework for AIGI detection. It encapsulates various state-of-the-art (SOTA) off-the-shelf MLLM and non-MLLM detectors as callable tools, and coordinates them through a capability-aware dynamic orchestration mechanism. Empowered by the agent's capacities for autonomous planning and reflection, it intelligently selects suitable tools for given samples, reflects intermediate results, and decides the next action, reaching a final conclusion through multi-turn invocation and reasoning. This design effectively exploits the complementary strengths among heterogeneous detectors, transcending the limits of any single model. Furthermore, optimized by a GRPO-based Agentic Reinforcement Learning algorithm using only low-cost binary labels, it eliminates the reliance on fine-grained annotations. Extensive experiments demonstrate that EvoGuard achieves SOTA accuracy while mitigating the bias between positive and negative samples. More importantly, it allows the plug-and-play integration of new detectors to boost overall performance in a train-free manner, offering a highly practical, long-term solution to ever-evolving AIGI threats. Source code will be publicly available upon acceptance.

[106] OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery

Yiwen Zhao,Ce Zheng,Yufu Wang,Hsueh-Han Daniel Yang,Liting Wen,Laszlo A. Jeni

Main category: cs.CV

TL;DR: 本文提出OnlineHMR,一种完全在线的人体网格恢复框架,满足因果性、保真性、时间一致性和高效性四大在线处理要求,支持AR/VR等实时交互场景。

Details Motivation: 现有HMR方法多为离线,依赖未来帧或全局优化,难以应用于需要实时反馈的AR/VR和远程呈现等场景。 Method: 提出两分支架构,结合因果键值缓存设计与滑动窗口学习策略实现流式推理;引入以人体为中心的增量SLAM,实现在线世界坐标对齐与物理合理的轨迹校正。 Result: 在EMDB基准和高动态自定义视频上性能媲美现有分块方法,同时首次支持真正在线处理。 Conclusion: OnlineHMR成功实现了高质量、低延迟、物理合理的世界坐标人体运动在线重建,拓展了HMR在实时交互系统中的应用边界。 Abstract: Human mesh recovery (HMR) models 3D human body from monocular videos, with recent works extending it to world-coordinate human trajectory and motion reconstruction. However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, faithfulness, temporal consistency, and efficiency. Built upon a two-branch architecture, OnlineHMR enables streaming inference via a causal key-value cache design and a curated sliding-window learning strategy. Meanwhile, a human-centric incremental SLAM provides online world-grounded alignment under physically plausible trajectory correction. Experimental results show that our method achieves performance comparable to existing chunk-based approaches on the standard EMDB benchmark and highly dynamic custom videos, while uniquely supporting online processing. Page and code are available at https://tsukasane.github.io/Video-OnlineHMR/.

[107] A 3D Reconstruction Benchmark for Asset Inspection

James L. Gray,Nikolai Goncharov,Alexandre Cardaillac,Ryan Griffiths,Jack Naylor,Donald G. Dansereau

Main category: cs.CV

TL;DR: 本文提出了一种面向资产检测的新型3D重建合成数据集,包含带真实深度图、相机位姿和网格模型的三个场景,并评估了当前SOTA方法在密集轨迹与非朗伯表面条件下的性能瓶颈。

Details Motivation: 现有3D重建数据集缺乏高重叠度、毫米级细节及复杂外观(如反射、透明)等资产巡检真实条件的样本,难以有效评测相关方法。 Method: 构建包含地面真值深度图、相机位姿和网格模型的合成数据集,模拟资产检测中的飞行轨迹和不同表面状态(非朗伯材质),并系统评估主流3D重建方法。 Result: 实验表明,当前SOTA方法在密集图像序列和复杂表面条件下表现显著下降,暴露出可扩展性瓶颈。 Conclusion: 该数据集揭示了面向实际资产检测的3D重建方法的关键挑战,为后续研究提供了新方向和可靠评测基准。 Abstract: Asset management requires accurate 3D models to inform the maintenance, repair, and assessment of buildings, maritime vessels, and other key structures as they age. These downstream applications rely on high-fidelity models produced from aerial surveys in close proximity to the asset, enabling operators to locate and characterise deterioration or damage and plan repairs. Captured images typically have high overlap between adjacent camera poses, sufficient detail at millimetre scale, and challenging visual appearances such as reflections and transparency. However, existing 3D reconstruction datasets lack examples of these conditions, making it difficult to benchmark methods for this task. We present a new dataset with ground truth depth maps, camera poses, and mesh models of three synthetic scenes with simulated inspection trajectories and varying levels of surface condition on non-Lambertian scene content. We evaluate state-of-the-art reconstruction methods on this dataset. Our results demonstrate that current approaches struggle significantly with the dense capture trajectories and complex surface conditions inherent to this domain, exposing a critical scalability gap and pointing toward new research directions for deployable 3D reconstruction in asset inspection. Project page: https://roboticimaging.org/Projects/asset-inspection-dataset/

[108] UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models

Segyu Lee,Boryeong Cho,Hojung Jung,Seokhyun An,Juhyeong Kim,Jaehyun Kwak,Yongjin Yang,Sangwon Jang,Youngrok Park,Wonjun Chang,Se-Young Yun

Main category: cs.CV

TL;DR: 本文提出了首个面向统一多模态模型(UMMs)的系统级安全评估基准UniSAFE,覆盖7种I/O模态组合,包含6802个实例,用于评估15个SOTA UMMs的安全漏洞,并揭示了多图像合成与多轮交互等场景下的显著风险。

Details Motivation: 现有安全评测基准碎片化、缺乏跨任务与跨模态的系统级评估能力,难以全面衡量UMMs的复杂安全风险。 Method: 构建共享目标(shared-target)设计的UniSAFE基准,涵盖7种I/O模态组合和多种风险场景;基于6802个精心构造的实例,对15个主流UMMs进行系统性安全评测。 Result: 发现当前UMMs普遍存在严重安全漏洞,尤其在多图像合成和多轮交互设置中违规率更高;图像输出任务比文本输出任务更易出现安全失败。 Conclusion: UMMs亟需更强的系统级安全对齐机制,UniSAFE为该方向提供了标准化评测工具与实证基础。 Abstract: Unified Multimodal Models (UMMs) offer powerful cross-modality capabilities but introduce new safety risks not observed in single-task models. Despite their emergence, existing safety benchmarks remain fragmented across tasks and modalities, limiting the comprehensive evaluation of complex system-level vulnerabilities. To address this gap, we introduce UniSAFE, the first comprehensive benchmark for system-level safety evaluation of UMMs across 7 I/O modality combinations, spanning conventional tasks and novel multimodal-context image generation settings. UniSAFE is built with a shared-target design that projects common risk scenarios across task-specific I/O configurations, enabling controlled cross-task comparisons of safety failures. Comprising 6,802 curated instances, we use UniSAFE to evaluate 15 state-of-the-art UMMs, both proprietary and open-source. Our results reveal critical vulnerabilities across current UMMs, including elevated safety violations in multi-image composition and multi-turn settings, with image-output tasks consistently more vulnerable than text-output tasks. These findings highlight the need for stronger system-level safety alignment for UMMs. Our code and data are publicly available at https://github.com/segyulee/UniSAFE

[109] MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

Xuri Ge,Chunhao Wang,Xindi Wang,Zheyun Qin,Zhumin Chen,Xin Xin

Main category: cs.CV

TL;DR: 本文提出MCoT-MVS方法,利用多模态大语言模型的链式推理生成文本线索,指导多层次视觉注意力选择,并通过加权分层融合模块对齐组合查询与目标图像,在CIRR和FashionIQ数据集上达到SOTA性能。

Details Motivation: 现有CIR方法难以从参考图像中准确提取与文本修改意图匹配的语义线索,易受无关视觉噪声干扰。 Method: 提出多级视觉选择方法MCoT-MVS:1)用多模态大语言模型(MLLM)对图文组合输入进行链式推理,生成保留、移除和目标推断文本;2)基于这些文本线索设计两个参考图像视觉注意力选择模块,分别提取patch级和instance级判别性语义;3)设计加权分层组合模块,融合多粒度视觉线索、修改文本与目标描述,在统一嵌入空间中对齐组合查询与目标图像。 Result: 在CIRR和FashionIQ两个CIR基准上显著优于现有方法,达到新的SOTA性能。代码与预训练模型已开源。 Conclusion: MCoT-MVS通过引入多模态链式推理引导的多层次视觉选择与分层融合机制,有效提升了组合图像检索中图文语义对齐的准确性与鲁棒性。 Abstract: Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.

[110] Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Kevin Qu,Haozhe Qi,Mihai Dusmanu,Mahdi Rad,Rui Wang,Marc Pollefeys

Main category: cs.CV

TL;DR: 本文提出Loc3R-VLM框架,通过全局布局重建与显式情境建模两个联合目标,赋予2D视觉语言模型从单目视频中理解3D空间的能力,并利用轻量级相机姿态先验保证几何一致性,显著提升语言驱动定位与3D问答性能。

Details Motivation: 现有多模态大语言模型在空间理解和视角感知推理方面仍存在不足,需增强其3D理解能力。 Method: 提出Loc3R-VLM框架,包含全局布局重建和显式情境建模两个联合目标,并结合预训练3D基础模型提供的轻量级相机姿态先验以保障几何一致性和度量尺度对齐。 Result: 在基于语言的定位任务及情景化与通用3D问答基准上达到SOTA性能,优于现有2D和视频方法。 Conclusion: 所提出的空间监督框架能有效赋能2D视觉语言模型实现强3D理解能力。 Abstract: Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

[111] Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes

Umangi Jain,Vladimir Kim,Matheus Gadelha,Igor Gilitschenski,Zhiqin Chen

Main category: cs.CV

TL;DR: 本文提出Material Magic Wand工具,通过材料感知的部件编码器自动识别和分组具有相同材质但几何形态各异的3D网格部件,显著提升材质分配效率。

Details Motivation: 真实世界中许多形状(如松果鳞片、建筑窗户)包含共享相同材质但几何形态各异的重复结构,手动逐个识别和选择这些部件费时费力。 Method: 提出一种材料感知的部件编码器,生成兼顾局部几何与全局上下文的部件嵌入;采用监督对比损失进行训练,使同材质部件嵌入靠近、不同材质部件嵌入分离;通过检索与选中部件嵌入相近的部件实现自动分组。 Result: 在包含100个形状、241个部件级查询的自建数据集上验证了方法有效性,并展示了其在交互式材质分配应用中的实用价值。 Conclusion: Material Magic Wand能高效、准确地实现无纹理网格的材质感知部件分组,为3D内容创作提供了实用、用户友好的新工具。 Abstract: We introduce the problem of material-aware part grouping in untextured meshes. Many real-world shapes, such as scales of pinecones or windows of buildings, contain repeated structures that share the same material but exhibit geometric variations. When assigning materials to such meshes, these repeated parts often require piece-by-piece manual identification and selection, which is tedious and time-consuming. To address this, we propose Material Magic Wand, a tool that allows artists to select part groups based on their estimated material properties -- when one part is selected, our algorithm automatically retrieves all other parts likely to share the same material. The key component of our approach is a part encoder that generates a material-aware embedding for each 3D part, accounting for both local geometry and global context. We train our model with a supervised contrastive loss that brings embeddings of material-consistent parts closer while separating those of different materials; therefore, part grouping can be achieved by retrieving embeddings that are close to the embedding of the selected part. To benchmark this task, we introduce a curated dataset of 100 shapes with 241 part-level queries. We verify the effectiveness of our method through extensive experiments and demonstrate its practical value in an interactive material assignment application.

Zhihua Wei,Qiang Li,Jian Ruan,Zhenxin Qin,Leilei Wen,Dongrui Liu,Wen Shen

Main category: cs.CV

TL;DR: 本文发现大型视觉语言模型(VLMs)在加入图像后安全对齐减弱,其根本原因并非无法识别有害意图,而是图像诱导表征沿特定‘越狱方向’偏移;据此提出一种在推理时去除该越狱相关偏移(JRS-Rem)的防御方法,有效提升安全性且不损害良性任务性能。

Details Motivation: 大型视觉语言模型(VLMs)在引入视觉模态后安全对齐能力下降,图像可显著提升越狱成功率,需探究其内在机理并提出有效防御。 Method: 通过分析VLM内部表征空间,识别出区分越狱样本与拒绝样本的‘越狱方向’,定义图像引发的表征偏移中沿该方向的分量为越狱相关偏移(JRS),并设计JRS-Rem方法在推理时去除该偏移。 Result: 实验证明JRS-Rem能在多种越狱场景下提供强防御效果,同时保持模型在良性任务上的性能。 Conclusion: VLM越狱并非源于有害意图识别失败,而是视觉输入导致表征向特定越狱状态偏移;去除该偏移是提升VLM安全性的有效途径。 Abstract: Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along this direction. Our analysis shows that the jailbreak-related shift reliably characterizes jailbreak behavior, providing a unified explanation for diverse jailbreak scenarios. Finally, we propose a defense method that enhances VLM safety by removing the jailbreak-related shift (JRS-Rem) at inference time. Experiments show that JRS-Rem provides strong defense across multiple scenarios while preserving performance on benign tasks.

[113] Shot-Aware Frame Sampling for Video Understanding

Mengyu Zhao,Di Fu,Yongyu Xie,Jiaxing Zhang,Zhigang Yuan,Shirin Jalali,Yong Cao

Main category: cs.CV

TL;DR: 本文提出InfoShot,一种任务无关、基于镜头的视频帧采样方法,通过语义分镜和信息论目标选择代表性与异常性关键帧,提升长视频理解中关键事件捕捉能力,并引入SynFlash合成基准进行评估。

Details Motivation: 现有视频帧采样方法在帧数受限时难以兼顾全局覆盖与短暂关键事件,导致下游预测不可靠。 Method: InfoShot首先将视频语义分割为一致镜头,再为每镜头选取两个互补关键帧:一个表征主内容,一个捕获镜头内异常变化;优化目标基于信息论,兼顾镜头结构与稀疏偏差。同时构建合成基准SynFlash用于评估亚秒级异常。 Result: InfoShot在帧数受限下显著提升异常命中率和Video-QA准确率,在标准视频理解基准上匹配或超越强基线。 Conclusion: InfoShot是一种无需重训练、泛化性强的高效帧采样方法,有效平衡长视频的整体上下文建模与关键瞬态事件保留。 Abstract: Video frame sampling is essential for efficient long-video understanding with Vision-Language Models (VLMs), since dense inputs are costly and often exceed context limits. Yet when only a small number of frames can be retained, existing samplers often fail to balance broad video coverage with brief but critical events, which can lead to unreliable downstream predictions. To address this issue, we present InfoShot, a task-agnostic, shot-aware frame sampler for long-video understanding. InfoShot first partitions a video into semantically consistent shots, and then selects two complementary keyframes from each shot: one to represent the main content and one to capture unusual within-shot changes. This design is guided by an information-theoretic objective that encourages the sampled set to retain high information about both shot structure and sparse within-shot deviations. In this way, it improves the chance of preserving both overall video context and short decision-critical moments without requiring any retraining. To better evaluate such short-lived events, we further introduce SynFlash, a synthetic benchmark with controllable sub-second anomaly patterns and frame-level ground truth, and we also evaluate InfoShot on existing anomaly datasets and general video understanding tasks. Experiments show that InfoShot improves anomaly hit rate and downstream Video-QA accuracy under frame number constraints, while matching or outperforming strong baselines on standard video understanding benchmarks.

[114] Stereo World Model: Camera-Guided Stereo Video Generation

Yang-Tian Sun,Zehuan Huang,Yifan Niu,Lin Ma,Yan-Pei Cao,Yuewen Ma,Xiaojuan Qi

Main category: cs.CV

TL;DR: StereoWorld 是一种相机条件化的立体世界模型,能在 RGB 模态下端到端生成立体视频,通过引入相机感知的 RoPE 和立体感知注意力分解,提升立体一致性、视差精度与相机运动保真度,并支持 VR 渲染与具身策略学习。

Details Motivation: 现有方法多依赖单目视频生成后再转为立体,或需额外深度估计/补全,缺乏对双目几何的直接建模;StereoWorld 旨在仅用 RGB 输入、端到端联合学习外观与立体几何。 Method: 提出两个核心设计:(1) 统一相机帧下的 RoPE(旋转位置编码),实现视角与时间一致的相机感知潜变量建模;(2) 立体感知注意力分解,将 4D 注意力拆解为 3D 帧内注意力 + 水平行注意力,利用对极几何先验高效建模视差对齐对应关系。 Result: 在多个基准上显著优于单目转立体流程:立体一致性、视差精度、相机运动保真度提升,生成速度快 3 倍以上,视点一致性提升 5%;并支持免深度估计的端到端双目 VR 渲染、具身策略学习中的度量级深度对齐,以及长视频蒸馏。 Conclusion: StereoWorld 验证了仅用 RGB 输入即可端到端建模双目几何与外观的可行性,为高效、高保真立体视频生成及下游应用(如 VR、机器人)提供了新范式。 Abstract: We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.

[115] VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm

Hongbo Lu,Liang Yao,Chenghao He,Fan Liu,Wenlong Liao,Tao He,Pai Peng

Main category: cs.CV

TL;DR: 本文提出VisionNVS,一种仅使用相机的新型视图合成框架,通过虚拟平移策略将视图合成重构为自监督修复任务,并利用伪3D接缝合成提升空间一致性,在不依赖LiDAR的情况下实现更优几何保真度与视觉质量。

Details Motivation: 自动驾驶中新型视图合成(NVS)面临训练时缺乏新轨迹对应真实图像监督的根本性瓶颈。 Method: 提出VisionNVS框架:1)采用‘虚拟平移’策略,利用单目深度估计模拟遮挡并映射到原视图,将NVS转为自监督图像修复任务;2)引入伪3D接缝合成策略,在训练中融合相邻相机视觉数据,建模光度差异与标定误差。 Result: 实验表明VisionNVS在几何保真度和视觉质量上优于依赖LiDAR的基线方法,适用于可扩展驾驶仿真。 Conclusion: VisionNVS通过消除监督域差与增强空间一致性,实现了纯相机驱动的高性能新型视图合成,为无LiDAR的自动驾驶仿真提供了鲁棒方案。 Abstract: A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift'' strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.

[116] Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

Rui Hong,Jana Kosecka

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的3D手语动作生成方法,通过引入ASL-LEX 2.0音系属性(如手形、位置、运动)作为条件,并系统比较了CLIP与T5文本编码器及不同条件输入格式的影响,发现将符号化音系标注映射为自然语言对CLIP至关重要,而T5更鲁棒;最佳模型在各项指标上均超越现有SOTA方法SignAvatar。

Details Motivation: 当前基于文本生成自然、准确且视觉流畅的3D手语 avatar 动作仍极具挑战性,尤其缺乏对音系学特征的有效建模与条件控制。 Method: 构建基于SMPL-X表示的MDM风格扩散模型作为强基线;系统研究文本编码器(CLIP/T5)、条件模式(仅gloss vs. gloss+音系属性)及属性表示格式(符号化 vs. 自然语言)的影响;特别设计ASL-LEX符号到自然语言的映射策略以适配CLIP。 Result: 所提扩散基线已超越SignAvatar(CVAE方法)在gloss判别性指标上的表现;CLIP配合映射后的音系属性方案全面优于SignAvatar;T5对属性表示格式不敏感,而CLIP必须依赖自然语言映射才能有效利用音系信息。 Conclusion: 输入表征是文本编码器驱动的属性条件生成的关键因素;应采用结构化条件机制——即gloss与音系属性通过独立通路编码——以提升手语动作生成质量。 Abstract: Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.

[117] Harnessing the Power of Foundation Models for Accurate Material Classification

Qingran Lin,Fengwei Yang,Chaolun Zhu

Main category: cs.CV

TL;DR: 本文提出了一种利用视觉-语言基础模型(VLMs)缓解材料分类中标注数据稀缺问题的新框架,包含合成数据生成与自动标注、以及VLM先验融合与联合微调两大创新,显著提升了分类准确率和泛化性。

Details Motivation: 材料分类面临标注数据稀缺导致模型准确率和泛化性受限的问题,而现有基于VLM的方法在该任务上表现仍不理想。 Method: 提出两阶段方法:(a) 构建材料中心的图像生成与自动标注流水线,通过融合物体语义与材料属性的文本提示生成高质量合成数据并自动打标;(b) 设计VLM先验蒸馏与视觉基础模型联合微调策略,在保持通用性的同时适配材料特异性特征。 Result: 在多个数据集上实现显著性能提升;验证了合成数据能有效反映真实材料特性,且VLM先验融合显著增强最终分类性能。 Conclusion: 所提框架成功克服了材料分类中的数据瓶颈,证明了结合生成式合成数据与VLM先验知识是提升小样本材料识别能力的有效范式。 Abstract: Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.

[118] Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

Rui Hong,Jana Kosecka

Main category: cs.CV

TL;DR: 本文提出了一种利用手势语义作为归纳偏置来提升单目RGB图像中3D手部姿态估计精度的两阶段框架,包含手势感知预训练和基于手势嵌入引导的Transformer回归,实验表明其在InterHand2.6M上优于EANet基线。

Details Motivation: 单目RGB图像下的3D手部姿态估计对AR/VR、人机交互和手语理解至关重要,而现有方法未充分利用已有的离散手势标签所提供的语义信息。 Method: 提出两阶段框架:第一阶段是基于InterHand2.6M数据集的手势感知预训练,学习具有判别性的手势嵌入空间;第二阶段是采用手势嵌入引导的逐关节Token Transformer,回归MANO手部参数,并使用分层损失(参数层、关节点层、结构约束层)进行训练。 Result: 在InterHand2.6M数据集上的实验表明,所提手势感知预训练能持续提升单手姿态估计精度,优于当前最优的EANet基线,且该增益可跨不同网络架构迁移,无需任何结构调整。 Conclusion: 手势语义是3D手部姿态估计中一种有效且可迁移的归纳偏置,所提出的两阶段框架为结合高层语义与底层几何建模提供了新思路。 Abstract: Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.

[119] Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

Rui Hong,Shuxue Quan

Main category: cs.CV

TL;DR: 本文提出了一种运动自适应的时间注意力机制,用于在冻结的Stable Diffusion模型基础上实现参数高效的视频生成,仅增加2.9%可训练参数即取得优异效果。

Details Motivation: 现有视频生成方法通常对所有帧内容一视同仁,缺乏对运动变化的动态建模能力,导致高运动区域细节丢失或低运动区域不一致。 Method: 设计运动自适应时间注意力机制:根据估计的运动强度动态调整注意力感受野(高运动局部注意、低运动全局注意);采用级联策略将轻量级时间注意力模块注入UNet各Transformer块(下采样和中段用全局注意以稳定语义,上采样用运动自适应注意以精细重构);结合时序相关噪声初始化与运动感知门控。 Result: 仅引入25.8M可训练参数(占UNet的2.9%),在WebVid验证集上达到具有竞争力的性能;验证了标准去噪目标本身已提供足够的时间正则化,优于显式添加时间一致性损失的方法;消融实验揭示了噪声相关性与运动幅度间的明确权衡关系。 Conclusion: 运动自适应时间注意力是一种高效且有效的视频生成建模范式,无需额外时间损失函数,在参数效率与生成质量间取得良好平衡,并支持推理时灵活调控生成行为。 Abstract: We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy -- global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9\% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.

[120] Joint Degradation-Aware Arbitrary-Scale Super-Resolution for Variable-Rate Extreme Image Compression

Xinning Chai,Zhengxue Cheng,Xin Li,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 本文提出ASSR-EIC框架,通过任意尺度超分辨率(ASSR)实现单模型支持可变码率的极限图像压缩,兼顾高压缩率与高质量重建。

Details Motivation: 现有扩散模型需为每个目标码率单独训练,计算开销大;而联合超分方法在超低码率下因信息严重丢失且尺度固定,难以灵活适配多码率需求。 Method: 设计包含任意尺度下采样编码器和扩散驱动、退化感知的ASSR解码器的端到端框架;引入全局压缩-缩放适配器与局部压缩-缩放调制器,并采用双语义增强机制。 Result: 在极限图像压缩任务上达到SOTA性能,同时支持灵活码率控制与码率自适应重建。 Conclusion: ASSR-EIC成功统一了超低码率压缩与任意尺度重建,提升了实际部署可行性与重建质量。 Abstract: Recent diffusion-based extreme image compression methods have demonstrated remarkable performance at ultra-low bitrates. However, most approaches require training separate diffusion models for each target bitrate, resulting in substantial computational overhead and hindering practical deployment. Meanwhile, recent studies have shown that joint super-resolution can serve as an effective approach for enhancing low-bitrate reconstruction. However, when moving toward ultra-low bitrate regimes, these methods struggle due to severe information loss, and their reliance on fixed super-resolution scales prevents flexible adaptation across diverse bitrates. To address these limitations, we propose ASSR-EIC, a novel image compression framework that leverages arbitrary-scale super-resolution (ASSR) to support variable-rate extreme image compression (EIC). An arbitrary-scale downsampling module is introduced at the encoder side to provide controllable rate reduction, while a diffusion-based, joint degradation-aware ASSR decoder enables rate-adaptive reconstruction within a single model. We exploit the compression- and rescaling-aware diffusion prior to guide the reconstruction, yielding high fidelity and high realism restoration across diverse compression and rescaling settings. Specifically, we design a global compression-rescaling adaptor that offers holistic guidance for rate adaptation, and a local compression-rescaling modulator that dynamically balances generative and fidelity-oriented behaviors to achieve fine-grained, bitrate-adaptive detail restoration. To further enhance reconstruction quality, we introduce a dual semantic-enhanced design. Extensive experiments demonstrate that ASSR-EIC delivers state-of-the-art performance in extreme image compression while simultaneously supporting flexible bitrate control and adaptive rate-dependent reconstruction.

[121] Mutually Causal Semantic Distillation Network for Zero-Shot Learning

Shiming Chen,Shuhuang Chen,Guo-Sen Xie,Xinge You

Main category: cs.CV

TL;DR: 本文提出了一种互因果语义蒸馏网络(MSDN++),通过双向因果注意力机制学习视觉与属性特征间的内在语义关联,提升零样本学习性能。

Details Motivation: 现有零样本学习方法仅使用单向弱监督注意力,难以挖掘视觉与属性特征之间内在、充分的语义知识(如属性语义),导致语义迁移效果受限。 Method: 提出MSDN++网络,包含两个因果注意力子网:属性→视觉子网学习属性驱动的视觉特征,视觉→属性子网学习视觉驱动的属性特征;二者在语义蒸馏损失引导下协同训练、相互教学。 Result: 在CUB、SUN、AWA2和FLO四个基准数据集上显著优于强基线,达到新的SOTA性能。 Conclusion: 双向因果注意力与互蒸馏机制能有效建模视觉-属性间的可靠因果关系,提升零样本学习中语义知识的泛化能力。 Abstract: Zero-shot learning (ZSL) aims to recognize the unseen classes in the open-world guided by the side-information (e.g., attributes). Its key task is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus conducting a desirable semantic knowledge transfer from seen classes to unseen ones. Prior works simply utilize unidirectional attention within a weakly-supervised manner to learn the spurious and limited latent semantic representations, which fail to effectively discover the intrinsic semantic knowledge (e.g., attribute semantic) between visual and attribute features. To solve the above challenges, we propose a mutually causal semantic distillation network (termed MSDN++) to distill the intrinsic and sufficient semantic representations for ZSL. MSDN++ consists of an attribute$\rightarrow$visual causal attention sub-net that learns attribute-based visual features, and a visual$\rightarrow$attribute causal attention sub-net that learns visual-based attribute features. The causal attentions encourages the two sub-nets to learn causal vision-attribute associations for representing reliable features with causal visual/attribute learning. With the guidance of semantic distillation loss, the two mutual attention sub-nets learn collaboratively and teach each other throughout the training process. Extensive experiments on three widely-used benchmark datasets (e.g., CUB, SUN, AWA2, and FLO) show that our MSDN++ yields significant improvements over the strong baselines, leading to new state-of-the-art performances.

[122] Towards Motion-aware Referring Image Segmentation

Chaeyun Kim,Seunghoon Yi,Yejin Kim,Yohan Jo,Joonseok Lee

Main category: cs.CV

TL;DR: 本文提出了一种针对运动相关描述的指代图像分割(RIS)方法,通过运动中心短语的数据增强和多模态径向对比学习(MRaCL),显著提升了模型对运动类查询的理解能力,并构建了新的运动导向测试集M-Bench。

Details Motivation: 现有RIS方法在处理运动相关的文本查询时性能明显弱于外观相关查询,亟需提升模型对动作、动态语义的理解能力。 Method: 1)设计运动中心短语提取的数据增强策略,无需额外标注即可丰富运动表达;2)提出基于图像-文本融合嵌入的多模态径向对比学习(MRaCL),增强上下文感知的跨模态对齐。 Result: 在新构建的运动导向测试集和M-Bench基准上,所提方法显著提升多种RIS模型对运动类查询的分割性能,同时保持外观类查询的竞争力。 Conclusion: 运动语义建模需融合上下文感知的多模态对比学习与针对性数据增强,MRaCL为提升RIS中动态理解能力提供了有效范式。 Abstract: Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions. Codes are available at https://github.com/snuviplab/MRaCL

[123] SHIFT: Motion Alignment in Video Diffusion Models with Adversarial Hybrid Fine-Tuning

Xi Ye,Wenjia Yang,Yangyang Xu,Xiaoyang Liu,Duo Su,Mengfei Xia,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出了一种名为SHIFT的奖励驱动微调框架,通过像素运动奖励和混合微调策略,有效提升了视频扩散模型的运动保真度与时间一致性。

Details Motivation: 图像条件视频扩散模型虽视觉真实感强,但运动保真度弱(如动态性下降、长时序连贯性差),尤其在微调后更明显;本文聚焦于模型后训练阶段的运动对齐问题。 Method: 提出基于像素流动力学的像素-运动奖励,刻画瞬时与长期运动一致性;设计Smooth Hybrid Fine-tuning(SHIFT)框架,融合监督微调与优势加权微调,并引入新型对抗优势以加速收敛、抑制奖励作弊。 Result: 实验表明SHIFT能高效缓解现代视频扩散模型监督微调中的动态程度坍缩问题,提升运动动态性与长时序 coherence。 Conclusion: SHIFT是一种可扩展、稳健的奖励驱动微调方法,为提升视频生成模型的运动质量提供了新范式。 Abstract: Image-conditioned Video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We study the problem of motion alignment in video diffusion models post-training. To address this, we introduce pixel-motion rewards based on pixel flux dynamics, capturing both instantaneous and long-term motion consistency. We further propose Smooth Hybrid Fine-tuning (SHIFT), a scalable reward-driven fine-tuning framework for video diffusion models. SHIFT fuses the normal supervised fine-tuning and advantage weighted fine-tuning into a unified framework. Benefiting from novel adversarial advantages, SHIFT improves convergence speed and mitigates reward hacking. Experiments show that our approach efficiently resolves dynamic-degree collapse in modern video diffusion models supervised fine-tuning.

[124] ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Xiangyu Kong,Xiaoyu Jin,Yihan Pan,Haoqin Sun,Hengde Zhu,Xiaoming Xu,Xiaoming Wei,Lu Liu,Siyang Song

Main category: cs.CV

TL;DR: 本文提出ECHO框架,通过长程上下文理解(LCU)和空间感知解耦交叉注意力调制(SDCM)模块,解决交互式头像生成中面部行为缺乏上下文恰当性及唇部同步不佳的问题,显著提升生成效果。

Details Motivation: 现有交互式头像生成(IHG)方法依赖短时行为线索且双信号融合方式易导致跨信号干扰,难以兼顾面部行为的上下文恰当性与唇部同步精度。 Method: 提出ECHO框架:1)长程上下文理解(LCU)组件建模行为动态与语言情感语义;2)块级空间感知解耦交叉注意力调制(SDCM)模块分离音频驱动唇部运动与用户行为引导的非唇区面部运动;3)两阶段训练范式协同优化。 Result: 实验表明ECHO在上下文恰当性、情感合理性、唇同步精度与视觉保真度方面均优于现有方法。 Conclusion: ECHO通过解耦建模与长程上下文理解,有效提升了交互式头像生成的真实感与自然性,为面向实时人机交互的虚拟头像合成提供了新思路。 Abstract: In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.

[125] AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

Siqi Pei,Liang Tang,Tiaonan Duan,Long Chen,Shuxian Li,Kaer Huang,Yanzhe Jing,Yiqiang Yan,Bo Zhang,Chenghao Jiang,Borui Zhang,Jiwen Lu

Main category: cs.CV

TL;DR: 本文提出AdaZoom-GUI框架,通过指令重写与自适应缩放策略提升GUI界面中自然语言指令到UI元素的定位精度,并构建高质量数据集与采用GRPO训练方法,在公开基准上达到SOTA性能。

Details Motivation: GUI截图定位面临高分辨率、小尺寸UI元素及用户指令模糊等挑战,需提升视觉语言模型在GUI上的定位与理解能力。 Method: 提出AdaZoom-GUI框架:1)指令细化模块将自然语言指令重写为更明确详尽的描述;2)条件缩放策略对预测的小元素进行第二阶段细粒度推理;3)构建高质量GUI接地数据集,并采用Group Relative Policy Optimization(GRPO)训练模型以联合预测点击坐标与边界框。 Result: 在多个公开GUI接地基准上,AdaZoom-GUI取得同等或更大参数量模型中的最优性能,验证了其在高分辨率GUI理解和实际GUI智能体部署中的有效性。 Conclusion: AdaZoom-GUI通过指令增强与自适应计算策略,在不显著增加开销前提下显著提升了GUI接地精度与鲁棒性,为实用化GUI智能体提供了新范式。 Abstract: GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.

[126] FACE-net: Factual Calibration and Emotion Augmentation for Retrieval-enhanced Emotional Video Captioning

Weidong Chen,Cheng Ye,Zhendong Mao,Peipei Song,Xinyan Liu,Lei Zhang,Xiaojun Chang,Yongdong Zhang

Main category: cs.CV

TL;DR: 本文提出FACE-net框架,通过检索增强、事实校准和情感增强,解决情感视频描述中的事实-情感偏差问题。

Details Motivation: 现有方法在挖掘和协调事实与情感线索方面不足,难以应对不同样本中事实与情感需求差异导致的事实-情感偏差。 Method: 提出检索增强的FACE-net框架:引入外部语料库进行语义增强;通过不确定性估计实现事实语义的主谓宾三元组提取与自/跨校准;利用校准后的事实语义作为专家,结合视频内容与情感词典渐进式增强情感;设计动态偏差调整路由模块预测并调节样本偏差程度。 Result: 该方法在多个基准数据集上显著提升情感视频描述性能,有效缓解事实-情感偏差,生成更准确、自适应的事实与情感融合描述。 Conclusion: FACE-net通过协同挖掘与自适应引导,突破了传统方法在事实与情感描述上的折衷倾向,为EVC任务提供了统一、灵活且高效的解决方案。 Abstract: Emotional Video Captioning (EVC) is an emerging task, which aims to describe factual content with the intrinsic emotions expressed in videos. Existing works perceive global emotional cues and then combine with video content to generate descriptions. However, insufficient factual and emotional cues mining and coordination during generation make their methods difficult to deal with the factual-emotional bias, which refers to the factual and emotional requirements being different in different samples on generation. To this end, we propose a retrieval-enhanced framework with FActual Calibration and Emotion augmentation (FACE-net), which through a unified architecture collaboratively mines factual-emotional semantics and provides adaptive and accurate guidance for generation, breaking through the compromising tendency of factual-emotional descriptions in all sample learning. Technically, we firstly introduces an external repository and retrieves the most relevant sentences with the video content to augment the semantic information. Subsequently, our factual calibration via uncertainty estimation module splits the retrieved information into subject-predicate-object triplets, and self-refines and cross-refines different components through video content to effectively mine the factual semantics; while our progressive visual emotion augmentation module leverages the calibrated factual semantics as experts, interacts with the video content and emotion dictionary to generate visual queries and candidate emotions, and then aggregates them to adaptively augment emotions to each factual semantics. Moreover, to alleviate the factual-emotional bias, we design a dynamic bias adjustment routing module to predict and adjust the degree of bias of a sample.

[127] AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

Dailan He,Guanlin Feng,Xingtong Ge,Yi Zhang,Bingqi Ma,Guanglu Song,Yu Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出AR-CoPO框架,通过分块对比策略优化和半在线策略训练,提升流式自回归视频生成器在RLHF中的对齐效果与泛化能力。

Details Motivation: 现有基于SDE的GRPO方法难以适用于低延迟、少步数的流式自回归视频生成,因其轨迹短、随机性低、对初始化噪声敏感,导致中间SDE探索失效。 Method: 提出AR-CoPO:1)采用分块叉变机制,在随机选择的视频块处构建邻域候选;2)分配序列级奖励;3)执行局部GRPO更新;4)引入半在线策略训练,结合在线探索与参考回放缓冲区的离线利用。 Result: 在Self-Forcing任务上,AR-CoPO显著提升域外泛化能力与域内人类偏好对齐效果,且验证了真实对齐而非奖励作弊。 Conclusion: AR-CoPO为流式AR视频生成提供了更鲁棒、更有效的RLHF对齐方案,兼顾低延迟要求与高质量人类反馈学习。 Abstract: Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.

[128] VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection

Chupeng Liu,Jiyong Rao,Shangquan Sun,Runkai Zhao,Weidong Cai

Main category: cs.CV

TL;DR: 本文提出了一种名为VirPro的视觉引导概率提示学习方法,用于单目3D目标检测中的弱监督预训练,通过自适应多模态提示建模和对比对齐提升性能。

Details Motivation: 现有基于手工文本描述的弱监督信号难以刻画场景中个体的视觉多样性,限制了模型学习场景感知表征的能力。 Method: 提出Visual-referred Probabilistic Prompt Learning(VirPro):构建自适应提示库(APB),引入多高斯提示建模(MGPM)融合视觉特征与文本嵌入以表达视觉不确定性,并通过RoI级对比匹配实现跨模态对齐。 Result: 在KITTI数据集上实验表明,该方法相较基线平均精度最高提升4.8%。 Conclusion: VirPro是一种可即插即用的多模态预训练范式,能有效增强弱监督下单目3D检测的性能与语义一致性。 Abstract: Monocular 3D object detection typically relies on pseudo-labeling techniques to reduce dependency on real-world annotations. Recent advances demonstrate that deterministic linguistic cues can serve as effective auxiliary weak supervision signals, providing complementary semantic context. However, hand-crafted textual descriptions struggle to capture the inherent visual diversity of individuals across scenes, limiting the model's ability to learn scene-aware representations. To address this challenge, we propose Visual-referred Probabilistic Prompt Learning (VirPro), an adaptive multi-modal pretraining paradigm that can be seamlessly integrated into diverse weakly supervised monocular 3D detection frameworks. Specifically, we generate a diverse set of learnable, instance-conditioned prompts across scenes and store them in an Adaptive Prompt Bank (APB). Subsequently, we introduce Multi-Gaussian Prompt Modeling (MGPM), which incorporates scene-based visual features into the corresponding textual embeddings, allowing the text prompts to express visual uncertainties. Then, from the fused vision-language embeddings, we decode a prompt-targeted Gaussian, from which we derive a unified object-level prompt embedding for each instance. RoI-level contrastive matching is employed to enforce modality alignment, bringing embeddings of co-occurring objects within the same scene closer in the latent space, thus enhancing semantic coherence. Extensive experiments on the KITTI benchmark demonstrate that integrating our pretraining paradigm consistently yields substantial performance gains, achieving up to a 4.8% average precision improvement than the baseline.

[129] Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning

Zelin Zang,Yehui Yang,Fei Wang,Liangyu Li,Baigui Sun

Main category: cs.CV

TL;DR: 本文提出DACSM框架,通过引入有益噪声正则化跨注意力机制,并结合域自适应Transformer与跨尺度匹配模块,提升无监督域自适应中内容一致性与尺度鲁棒性,在多个基准上达到SOTA性能。

Details Motivation: 现有基于交叉注意力的Transformer在面对大外观和尺度差异时难以保持语义一致性,UDA任务中域间与尺度差距导致性能下降。 Method: 提出有益噪声(beneficial noise)概念以正则化跨注意力;构建DACSM框架,包括域自适应Transformer(DAT)用于解耦内容与风格,以及跨尺度匹配(CSM)模块实现多分辨率特征对齐。 Result: 在VisDA-2017、Office-Home和DomainNet上达到SOTA,VisDA-2017上比CDTrans提升最多2.3%,‘truck’类提升5.9%。 Conclusion: 结合域翻译、有益噪声增强的注意力机制与尺度感知对齐,能显著提升跨域表征学习的鲁棒性与泛化能力。 Abstract: Unsupervised Domain Adaptation (UDA) seeks to transfer knowledge from a labeled source domain to an unlabeled target domain but often suffers from severe domain and scale gaps that degrade performance. Existing cross-attention-based transformers can align features across domains, yet they struggle to preserve content semantics under large appearance and scale variations. To explicitly address these challenges, we introduce the concept of beneficial noise, which regularizes cross-attention by injecting controlled perturbations, encouraging the model to ignore style distractions and focus on content. We propose the Domain-Adaptive Cross-Scale Matching (DACSM) framework, which consists of a Domain-Adaptive Transformer (DAT) for disentangling domain-shared content from domain-specific style, and a Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions. DAT incorporates beneficial noise into cross-attention, enabling progressive domain translation with enhanced robustness, yielding content-consistent and style-invariant representations. Meanwhile, CSM ensures semantic consistency under scale changes. Extensive experiments on VisDA-2017, Office-Home, and DomainNet demonstrate that DACSM achieves state-of-the-art performance, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably, DACSM achieves a +5.9% gain on the challenging "truck" class of VisDA, evidencing the strength of beneficial noise in handling scale discrepancies. These results highlight the effectiveness of combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment for robust cross-domain representation learning.

[130] UAV-CB: A Complex-Background RGB-T Dataset and Local Frequency Bridge Network for UAV Detection

Shenghui Huang,Menghao Hu,Longkun Zou,Hongyu Chi,Zekai Li,Feng Gao,Fan Yang,Qingyao Wu,Ke Chen

Main category: cs.CV

TL;DR: 本文提出UAV-CB数据集和LFBNet网络,以解决低空环境下无人机检测中因复杂背景与伪装导致的挑战,通过局部频率空间建模实现RGB-T模态融合,在伪装和杂乱场景下达到SOTA性能。

Details Motivation: 现有UAV检测数据集未专门针对低空复杂背景与伪装挑战设计,限制了真实场景鲁棒感知的发展。 Method: 构建强调复杂背景与伪装特性的RGB-T UAV检测数据集UAV-CB;提出局部频率桥接网络(LFBNet),在局部频率空间建模以融合频域-空间信息并缓解RGB-T跨模态差异。 Result: 在UAV-CB及公开基准上实验表明,LFBNet在伪装与杂乱条件下检测性能达SOTA,具备强鲁棒性。 Conclusion: LFBNet为真实世界多模态无人机感知提供了基于频率感知的新视角,有效提升了低空复杂环境下的检测能力。 Abstract: Detecting Unmanned Aerial Vehicles (UAVs) in low-altitude environments is essential for perception and defense systems but remains highly challenging due to complex backgrounds, camouflage, and multimodal interference. In real-world scenarios, UAVs are frequently visually blended with surrounding structures such as buildings, vegetation, and power lines, resulting in low contrast, weak boundaries, and strong confusion with cluttered background textures. Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB-T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. Furthermore, we propose the Local Frequency Bridge Network (LFBNet), which models features in localized frequency space to bridge both the frequency-spatial fusion gap and the cross-modality discrepancy gap in RGB-T fusion. Extensive experiments on UAV-CB and public benchmarks demonstrate that LFBNet achieves state-of-the-art detection performance and strong robustness under camouflaged and cluttered conditions, offering a frequency-aware perspective on multimodal UAV perception in real-world applications.

[131] Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation

Jiawei Zhou,Chi Zhang,Xiang Feng,Qiming Zhang,Haibo Qiu,Lihuo He,Dengpan Ye,Xinbo Gao,Jing Zhang

Main category: cs.CV

TL;DR: Omni-I2C是一个用于评估大视觉语言模型(LMMs)将复杂数字图形转换为可执行代码能力的综合性基准,强调视觉感知与代码生成的协同,并揭示当前模型在结构完整性上的显著不足。

Details Motivation: 现有LMM在将复杂数字图形精准转换为可执行代码方面面临挑战,需同时具备高保真视觉理解与精确代码生成能力,而传统任务无法充分暴露其结构性缺陷。 Method: 构建包含1080个精心筛选样本的Omni-I2C基准,覆盖多学科、多模态图像和多种编程语言;引入用户真实案例;设计解耦评估框架,分别衡量感知保真度与符号精度。 Result: 实验表明,当前领先LMM在复杂场景下难以保持结构完整性,存在显著性能差距,暴露出感知幻觉与逻辑错误等深层瓶颈。 Conclusion: 多模态代码生成仍是重大挑战,Omni-I2C为推动该领域发展提供了系统性评估工具与基准。 Abstract: We present Omni-I2C, a comprehensive benchmark designed to evaluate the capability of Large Multimodal Models (LMMs) in converting complex, structured digital graphics into executable code. We argue that this task represents a non-trivial challenge for the current generation of LMMs: it demands an unprecedented synergy between high-fidelity visual perception -- to parse intricate spatial hierarchies and symbolic details -- and precise generative expression -- to synthesize syntactically sound and logically consistent code. Unlike traditional descriptive tasks, Omni-I2C requires a holistic understanding where any minor perceptual hallucination or coding error leads to a complete failure in visual reconstruction. Omni-I2C features 1080 meticulously curated samples, defined by its breadth across subjects, image modalities, and programming languages. By incorporating authentic user-sourced cases, the benchmark spans a vast spectrum of digital content -- from scientific visualizations to complex symbolic notations -- each paired with executable reference code. To complement this diversity, our evaluation framework provides necessary depth; by decoupling performance into perceptual fidelity and symbolic precision, it transcends surface-level accuracy to expose the granular structural failures and reasoning bottlenecks of current LMMs. Our evaluation reveals a substantial performance gap among leading LMMs; even state-of-the-art models struggle to preserve structural integrity in complex scenarios, underscoring that multimodal code generation remains a formidable challenge. Data and code are available at https://github.com/MiliLab/Omni-I2C.

[132] EI: Early Intervention for Multimodal Imaging based Disease Recognition

Qijie Wei,Hailan Lin,Xirong Li

Main category: cs.CV

TL;DR: 本文提出了一种名为Early Intervention (EI)的框架,通过早期引入参考模态的高层语义token来引导目标模态嵌入,并结合一种参数高效的微调方法MoR,以应对多模态医学影像疾病识别中信息融合不充分和标注数据稀缺两大挑战。

Details Motivation: 当前多模态医学影像疾病识别面临两大挑战:一是主流‘单模态嵌入后融合’范式无法充分利用模态间互补与相关性;二是标注多模态医学图像稀缺且与自然图像存在显著域偏移,限制了视觉基础模型(VFMs)的应用。 Method: 提出Early Intervention(EI)框架:以一模态为目标、其余为参考,用参考模态的高层语义token作为干预token,在目标模态嵌入早期进行引导;并设计Mixture of Low-varied-Ranks Adaptation(MoR),一种含可变秩低秩适配器与权重松弛路由器的参数高效微调方法,用于VFM适配。 Result: 在视网膜疾病、皮肤病变和膝关节异常分类三个公开数据集上,该方法显著优于多个强基线。 Conclusion: EI框架与MoR方法有效提升了多模态医学影像疾病识别性能,解决了信息融合不足与VFM迁移困难问题,为医学多模态学习提供了新思路。 Abstract: Current methods for multimodal medical imaging based disease recognition face two major challenges. First, the prevailing "fusion after unimodal image embedding" paradigm cannot fully leverage the complementary and correlated information in the multimodal data. Second, the scarcity of labeled multimodal medical images, coupled with their significant domain shift from natural images, hinders the use of cutting-edge Vision Foundation Models (VFMs) for medical image embedding. To jointly address the challenges, we propose a novel Early Intervention (EI) framework. Treating one modality as target and the rest as reference, EI harnesses high-level semantic tokens from the reference as intervention tokens to steer the target modality's embedding process at an early stage. Furthermore, we introduce Mixture of Low-varied-Ranks Adaptation (MoR), a parameter-efficient fine-tuning method that employs a set of low-rank adapters with varied ranks and a weight-relaxed router for VFM adaptation. Extensive experiments on three public datasets for retinal disease, skin lesion, and keen anomaly classification verify the effectiveness of the proposed method against a number of competitive baselines.

[133] UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images

Guibiao Liao,Qian Ren,Kaimin Liao,Hua Wang,Zhi Chen,Luchao Wang,Yaohua Tang

Main category: cs.CV

TL;DR: 本文提出UniSem框架,通过Error-aware Gaussian Dropout和Mix-training Curriculum两个组件,联合提升稀疏图像下的3D高斯泼溅深度估计精度与开放词汇3D语义分割泛化能力。

Details Motivation: 现有基于前馈3D高斯泼溅的语义感知三维重建方法在稀疏、无位姿图像下存在高斯基元冗余导致几何不稳定、深度质量差,以及仅依赖2D分割器特征进行语义提升导致3D语义不完整、泛化性弱的问题。 Method: 提出UniSem统一框架:1)Error-aware Gaussian Dropout(EGD),利用渲染误差引导抑制冗余高斯,提升几何稳定性与深度精度;2)Mix-training Curriculum(MTC),渐进融合2D分割器提升的语义与模型自身涌现的3D语义先验,并通过物体级原型对齐增强语义一致性与完整性。 Result: 在ScanNet和Replica数据集上实验表明,UniSem在不同输入视图数下均显著优于强基线:16视图时,相对深度误差(Rel)降低15.2%,开放词汇3D分割平均准确率(mAcc)提升3.7%。 Conclusion: UniSem有效解决了稀疏视角下3D高斯泼溅中几何不稳定与语义泛化不足的双重挑战,为语义感知的轻量级3D重建提供了新范式。 Abstract: Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model's own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.

[134] PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

Jianjian Yin,Tao Chen,Yi Chen,Gensheng Pei,Xiangbo Shu,Yazhou Yao,Fumin Shen

Main category: cs.CV

TL;DR: 本文提出了一种并行成本聚合(PCA-Seg)范式,通过专家驱动感知学习(EPL)模块和特征正交化解耦(FOD)策略,缓解视觉-语言对齐中语义与空间上下文的知识干扰,显著提升开放词汇语义与部件分割(OSPS)性能。

Details Motivation: 现有VLM方法在开放词汇语义与部件分割中采用串行的空间与类别聚合方式,导致类别语义与空间上下文之间产生知识干扰。 Method: 提出并行成本聚合(PCA-Seg)范式;设计专家驱动感知学习(EPL)模块,含多专家解析器与系数映射器;引入特征正交化解耦(FOD)策略以减少语义流与上下文流间的冗余。 Result: 在八个基准上取得当前最优OSPS性能,每个并行模块仅增加0.35M参数。 Conclusion: 并行结构与正交化解耦能更有效地建模视觉-语言对齐,提升分割精度与泛化能力,同时保持模型轻量。 Abstract: Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.

[135] MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Yimin Wei,Aoran Xiao,Hongruixuan Chen,Junshi Xia,Naoto Yokoya

Main category: cs.CV

TL;DR: 本文提出MM-OVSeg,一种融合光学与SAR图像的多模态开放词汇分割框架,以提升遥感图像在云雾等恶劣天气下的鲁棒性与泛化能力。

Details Motivation: 现有开放词汇分割方法主要依赖晴空光学数据,在云、雾等干扰条件下性能受限,而遥感应用亟需全天候鲁棒的像素级语义理解能力。 Method: 提出跨模态统一过程对齐光学与SAR特征,并设计双编码器融合模块,整合多个视觉基础模型的分层特征,实现文本对齐的多模态分割。 Result: 在多种云覆盖条件下,MM-OVSeg展现出优越的鲁棒性与泛化能力,显著优于现有方法。 Conclusion: 光学与SAR模态互补融合可有效缓解天气干扰,为开放词汇遥感分割提供了一种可靠、可扩展的新范式。 Abstract: Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

[136] AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection

Manuel Barusco,Davide Dalle Pezze,Francesco Borsatti,Gian Antonio Susto

Main category: cs.CV

TL;DR: 本文提出AdapTS,一种面向多类别和持续学习场景的轻量级教师-学生框架,通过共享冻结骨干网络与可插拔适配器实现边缘端高效部署,在保持性能的同时大幅降低内存开销。

Details Motivation: 现有视觉异常检测方法多局限于单类别场景,难以满足工业环境中多类别和持续学习的实际需求;教师-学生架构虽高效,但在持续学习设置中尚未被探索。 Method: 提出AdapTS框架:采用单一共享冻结骨干网络,在学生路径注入轻量可训练适配器;引入分割引导损失和合成Perlin噪声增强训练;设计基于原型的任务识别机制,在推理时以99%准确率动态选择适配器。 Result: 在MVTec AD和VisA数据集上,AdapTS在多类别与持续学习任务中达到与现有TS方法相当的性能;最轻量版本AdapTS-S仅需8MB额外内存,显著低于STFPM(95MB)、RD4AD(360MB)和DeSTSeg(1120MB)。 Conclusion: AdapTS是一种高效、可扩展、适用于边缘部署的多类别持续视觉异常检测解决方案,兼顾性能与资源效率。 Abstract: Visual Anomaly Detection (VAD) is crucial for industrial inspection, yet most existing methods are limited to single-category scenarios, failing to address the multi-class and continual learning demands of real-world environments. While Teacher-Student (TS) architectures are efficient, they remain unexplored for the Continual Setting. To bridge this gap, we propose AdapTS, a unified TS framework designed for multi-class and continual settings, optimized for edge deployment. AdapTS eliminates the need for two different architectures by utilizing a single shared frozen backbone and injecting lightweight trainable adapters into the student pathway. Training is enhanced via a segmentation-guided objective and synthetic Perlin noise, while a prototype-based task identification mechanism dynamically selects adapters at inference with 99\% accuracy. Experiments on MVTec AD and VisA demonstrate that AdapTS matches the performance of existing TS methods across multi-class and continual learning scenarios, while drastically reducing memory overhead. Our lightest variant, AdapTS-S, requires only 8 MB of additional memory, 13x less than STFPM (95 MB), 48x less than RD4AD (360 MB), and 149x less than DeSTSeg (1120 MB), making it a highly scalable solution for edge deployment in complex industrial environments.

[137] Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

Pengzhen Chen,Yanwei Liu,Xiaoyan Gu,Xiaojun Chen,Wu Liu,Weiping Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Relational Zero-Watermarking(Rel-Zero)的新型零水印框架,利用AI编辑过程中图像块间关系距离的不变性,无需修改原始图像即可生成鲁棒的零水印,显著提升了对各类AI编辑操作的抗攻击能力。

Details Motivation: 扩散模型图像编辑威胁数字内容真实性;传统嵌入式水印影响视觉质量,现有零水印方法依赖全局特征、抗编辑能力弱。 Method: 发现AI编辑中图像块对间关系距离相对不变,据此设计Rel-Zero框架,从块关系中提取编辑不变的零水印,不修改原始图像,基于内在结构一致性而非绝对外观进行认证。 Result: 在多种AI编辑模型和操作下,Rel-Zero相比以往零水印方法展现出显著提升的鲁棒性。 Conclusion: Rel-Zero提供了一种非侵入、高鲁棒的内容认证机制,为应对AI生成/编辑内容的真实性挑战提供了新思路。 Abstract: Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

[138] Learning Coordinate-based Convolutional Kernels for Continuous SE(3) Equivariant and Efficient Point Cloud Analysis

Jaein Kim,Hee Bin Yoo,Dong-Sig Han,Byoung-Tak Zhang

Main category: cs.CV

TL;DR: 本文提出ECKConv,一种基于坐标网络的等变核卷积方法,在SE(3)群上实现严格等变性,兼顾可扩展性与高效性,并在多种点云任务中表现优异。

Details Motivation: 现有群卷积方法难以同时保证严格的SE(3)对称性和大规模可扩展性,需更先进的核架构来解决该权衡问题。 Method: 提出Equivariant Coordinate-based Kernel Convolution(ECKConv),利用双陪集空间定义核域以获得SE(3)等变性,并采用基于坐标的显式核网络设计提升学习能力与内存效率。 Result: 在点云分类、姿态配准、部件分割及大规模语义分割等任务中,ECKConv展现出严格的刚性等变性、良好的内存可扩展性及优于当前最优等变方法的性能。 Conclusion: ECKConv通过 intertwiner 框架与坐标化核设计,成功弥合了严格SE(3)等变性与实际可扩展性之间的鸿沟,为3D点云学习提供了高效且理论严谨的新范式。 Abstract: A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade-off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large-scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate-based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate-based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large-scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state-of-the-art equivariant methods.

[139] Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Linghao Zhang,Jungang Li,Yonghua Hei,Sicheng Tao,Song Dai,Yibo Yan,Zihao Dongfang,Weiting Liu,Chenxi Qin,Hanqian Li,Xin Zou,Jiahao Zhang,Shuhang Xun,Haiyun Jiang,Xuming Hu

Main category: cs.CV

TL;DR: 本文系统研究了视频监督微调(Video-SFT)对多模态大语言模型(MLLMs)视觉能力演化的精细影响,发现其虽提升视频理解能力,却常损害静态图像理解能力;该权衡与采样帧数(时间预算)密切相关,并提出一种指令感知的混合帧策略以缓解该问题。

Details Motivation: 现有研究缺乏对Video-SFT如何精细塑造MLLMs空间与时间视觉能力平衡的理解,尤其在联合图像-视频训练中空间理解易被削弱的问题亟待探究。 Method: 在不同架构、参数量和帧采样设置下系统评估Video-SFT对视频与静态图像基准的影响;分析时间预算(帧数)与性能的关系;提出并验证指令感知的Hybrid-Frame自适应帧分配策略。 Result: Video-SFT普遍提升视频性能但常导致静态图像性能停滞或下降;增加帧数可提升视频性能,但不改善图像性能;Hybrid-Frame策略部分缓解图像-视频性能权衡。 Conclusion: Video-SFT并非对MLLMs视觉能力的‘免费午餐’,在视频能力增强的同时维持空间理解能力仍是联合图像-视频训练的核心挑战。 Abstract: Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

[140] ProGVC: Progressive-based Generative Video Compression via Auto-Regressive Context Modeling

Daowen Li,Ruixiao Dong,Ying Chen,Kai Li,Ding Ding,Li Li

Main category: cs.CV

TL;DR: ProGVC是一种基于渐进式传输的生成式视频压缩框架,通过多尺度残差令牌图和Transformer自回归上下文模型,实现灵活码率适配、高效熵编码与细节合成的统一。

Details Motivation: 现有感知视频编解码器缺乏对可变码率和渐进式传输的原生支持,且生成模块与熵编码耦合较弱,限制了码率降低能力。 Method: 提出ProGVC框架:将视频编码为分层多尺度残差令牌图,支持由粗到细的渐进式传输;采用Transformer多尺度自回归上下文模型估计令牌概率,用于熵编码和解码端截断细尺度令牌的预测以恢复感知细节。 Result: 实验表明ProGVC在低码率下具有优异的感知压缩性能,并同时具备实用的可扩展性。 Conclusion: ProGVC作为一种新型编码范式,成功融合了渐进式传输、高效熵编码与生成式细节合成,提升了低码率视频压缩的实用性与性能。 Abstract: Perceptual video compression leverages generative priors to reconstruct realistic textures and motions at low bitrates. However, existing perceptual codecs often lack native support for variable bitrate and progressive delivery, and their generative modules are weakly coupled with entropy coding, limiting bitrate reduction. Inspired by the next-scale prediction in the Visual Auto-Regressive (VAR) models, we propose ProGVC, a Progressive-based Generative Video Compression framework that unifies progressive transmission, efficient entropy coding, and detail synthesis within a single codec. ProGVC encodes videos into hierarchical multi-scale residual token maps, enabling flexible rate adaptation by transmitting a coarse-to-fine subset of scales in a progressive manner. A Transformer-based multi-scale autoregressive context model estimates token probabilities, utilized both for efficient entropy coding of the transmitted tokens and for predicting truncated fine-scale tokens at the decoder to restore perceptual details. Extensive experiments demonstrate that as a new coding paradigm, ProGVC delivers promising perceptual compression performance at low bitrates while offering practical scalability at the same time.

[141] Prompt-Free Universal Region Proposal Network

Qihong Tang,Changhan Liu,Shaofeng Zhang,Wenbin Li,Qi Fan,Yang Gao

Main category: cs.CV

TL;DR: 本文提出了一种无需外部提示(如图像或文本)的通用区域建议网络PF-RPN,通过三个创新模块(SIA、CSP、CG-QS)实现灵活、自适应的潜在目标定位,在少量数据下即可训练并泛化至多种检测场景。

Details Motivation: 现有方法依赖示例图像、预定义类别或文本描述来定位潜在目标,灵活性和现实场景适应性受限。 Method: 提出Prompt-Free Universal Region Proposal Network(PF-RPN),包含:1)Sparse Image-Aware Adapter(SIA)模块,用动态更新的可学习查询嵌入进行初步定位;2)Cascade Self-Prompt(CSP)模块,通过自提示式级联聚合视觉特征发现剩余目标;3)Centerness-Guided Query Selection(CG-QS)模块,利用中心度打分选择高质量查询嵌入。 Result: 在19个数据集上验证有效;仅需5% MS COCO数据即可训练;无需微调即可直接应用于水下检测、工业缺陷检测、遥感图像检测等多领域。 Conclusion: PF-RPN实现了真正提示无关、轻量高效、强泛化的区域建议能力,为开放世界目标识别提供了新范式。 Abstract: Identifying potential objects is critical for object recognition and analysis across various computer vision applications. Existing methods typically localize potential objects by relying on exemplar images, predefined categories, or textual descriptions. However, their reliance on image and text prompts often limits flexibility, restricting adaptability in real-world scenarios. In this paper, we introduce a novel Prompt-Free Universal Region Proposal Network (PF-RPN), which identifies potential objects without relying on external prompts. First, the Sparse Image-Aware Adapter (SIA) module performs initial localization of potential objects using a learnable query embedding dynamically updated with visual features. Next, the Cascade Self-Prompt (CSP) module identifies the remaining potential objects by leveraging the self-prompted learnable embedding, autonomously aggregating informative visual features in a cascading manner. Finally, the Centerness-Guided Query Selection (CG-QS) module facilitates the selection of high-quality query embeddings using a centerness scoring network. Our method can be optimized with limited data (e.g., 5% of MS COCO data) and applied directly to various object detection application domains for identifying potential objects without fine-tuning, such as underwater object detection, industrial defect detection, and remote sensing image object detection. Experimental results across 19 datasets validate the effectiveness of our method. Code is available at https://github.com/tangqh03/PF-RPN.

[142] FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

Hugo Caselles-Dupré,Mathis Koroglu,Guillaume Jeanneret,Arnaud Dapogny,Matthieu Cord

Main category: cs.CV

TL;DR: 本文提出FrescoDiffusion,一种无需训练的图像到视频生成方法,通过在分块去噪中引入预计算的低分辨率潜空间先验,提升超高清(如4K)视频生成中的全局时空一致性与局部细节保留能力,特别适用于结构复杂的壁画动画场景。

Details Motivation: 现有扩散模型在超高清图像到视频生成中难以兼顾全局布局一致性和局部细节;尤其在壁画动画这类含多角色、多子场景的复杂图像中,分块生成易导致时空不连贯。 Method: 提出无需训练的FrescoDiffusion方法:先生成低分辨率视频并上采样其潜轨迹作为全局先验;在4K分块去噪过程中,每步在模型输出空间中通过加权最小二乘目标融合各块噪声预测与该先验,并引入空间正则化变量实现区域级运动控制。 Result: 在VBench-I2V和自建壁画I2V数据集上,相比分块基线方法,显著提升了全局一致性与视觉保真度,同时保持计算高效性,并支持对创意性与一致性权衡的显式调控。 Conclusion: FrescoDiffusion有效解决了超高清I2V生成中全局-局部协同难题,为复杂静态图像驱动的高质量长时序视频生成提供了实用、可控且高效的训练-free新范式。 Abstract: Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.

[143] Face anonymization preserving facial expressions and photometric realism

Luigi Celona,Simone Bianco,Raimondo Schettini

Main category: cs.CV

TL;DR: 本文提出了一种特征保持的人脸匿名化框架,在DeepPrivacy基础上引入密集面部关键点以保留表情,并添加轻量级后处理模块确保光照方向和肤色一致性;同时设计了新的评估指标来量化表情保真度、光照一致性和颜色保持能力,实验表明该方法在CelebA-HQ数据集上优于现有最先进方法。

Details Motivation: 现有生成式人脸匿名化方法多关注身份移除和图像真实性,忽视表情、光照和肤色等对下游任务(如重打光、医学分析)至关重要的光度一致性属性,存在隐私保护与实用性难以兼顾的问题。 Method: 在DeepPrivacy基础上,引入密集面部关键点引导以更好保留表情;加入轻量级后处理模块保障光照方向与皮肤色调一致性;并设计专用评估指标量化表达保真度、光照一致性和颜色保持。 Result: 在CelebA-HQ数据集上的实验表明,所提方法在图像真实性、表情保真度、光照一致性和肤色保持方面均显著优于当前最优基线方法。 Conclusion: 特征感知的匿名化是提升人脸数据实用性、公平性与可信度的关键一步,有助于构建更负责任的隐私保护范式。 Abstract: The widespread sharing of face images on social media platforms and in large-scale datasets raises pressing privacy concerns, as biometric identifiers can be exploited without consent. Face anonymization seeks to generate realistic facial images that irreversibly conceal the subject's identity while preserving their usefulness for downstream tasks. However, most existing generative approaches focus on identity removal and image realism, often neglecting facial expressions as well as photometric consistency -- specifically attributes such as illumination and skin tone -- that are critical for applications like relighting, color constancy, and medical or affective analysis. In this work, we propose a feature-preserving anonymization framework that extends DeepPrivacy by incorporating dense facial landmarks to better retain expressions, and by introducing lightweight post-processing modules that ensure consistency in lighting direction and skin color. We further establish evaluation metrics specifically designed to quantify expression fidelity, lighting consistency, and color preservation, complementing standard measures of image realism, pose accuracy, and re-identification resistance. Experiments on the CelebA-HQ dataset demonstrate that our method produces anonymized faces with improved realism and significantly higher fidelity in expression, illumination, and skin tone compared to state-of-the-art baselines. These results underscore the importance of feature-aware anonymization as a step toward more useful, fair, and trustworthy privacy-preserving facial data.

[144] PanoVGGT: Feed-Forward 3D Reconstruction from Panoramic Imagery

Yijing Guo,Mengjun Chao,Luo Wang,Tianyang Zhao,Haizhao Dai,Yingliang Zhang,Jingyi Yu,Yujiao Shi

Main category: cs.CV

TL;DR: 本文提出PanoVGGT,一种面向全景图像的置换等变Transformer框架,用于联合估计相机姿态、深度图和3D点云,并构建了大规模户外全景数据集PanoCity。

Details Motivation: 全景图像存在非针孔畸变,导致现有基于透视相机的前馈模型泛化能力差,难以进行联合位姿估计与3D重建。 Method: 提出PanoVGGT框架,引入球面感知的位置编码、全景专用的三轴SO(3)旋转增强,以及训练时的随机锚定策略以解决全局坐标系歧义问题。 Result: 在自建PanoCity数据集及标准基准上实验表明,PanoVGGT具有竞争力的精度、强鲁棒性及更优的跨域泛化能力。 Conclusion: PanoVGGT为全景图像下的联合几何理解提供了有效新范式,配套发布的数据集和代码将推动该领域发展。 Abstract: Panoramic imagery offers a full 360° field of view and is increasingly common in consumer devices. However, it introduces non-pinhole distortions that challenge joint pose estimation and 3D reconstruction. Existing feed-forward models, built for perspective cameras, generalize poorly to this setting. We propose PanoVGGT, a permutation-equivariant Transformer framework that jointly predicts camera poses, depth maps, and 3D point clouds from one or multiple panoramas in a single forward pass. The model incorporates spherical-aware positional embeddings and a panorama-specific three-axis SO(3) rotation augmentation, enabling effective geometric reasoning in the spherical domain. To resolve inherent global-frame ambiguity, we further introduce a stochastic anchoring strategy during training. In addition, we contribute PanoCity, a large-scale outdoor panoramic dataset with dense depth and 6-DoF pose annotations. Extensive experiments on PanoCity and standard benchmarks demonstrate that PanoVGGT achieves competitive accuracy, strong robustness, and improved cross-domain generalization. Code and dataset will be released.

[145] LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

Mohammad Robaitul Islam Bhuiyan,Sheethal Bhat,Melika Qahqaie,Tri-Thien Nguyen,Paula Andrea Pérez Toro,Tomas Arias Vergara,Andreas Maier

Main category: cs.CV

TL;DR: LoGSAM是一个参数高效的检测驱动框架,将放射科医生的语音描述转化为文本提示,用于基于基础模型的脑肿瘤定位与分割,无需额外微调即在BRISC 2025上达到80.32% Dice分数。

Details Motivation: 现有MRI脑肿瘤分割方法依赖大量标注数据和任务专用监督模型,而临床标注稀缺;需探索少样本、低参数更新、结合临床工作流(如语音报告)的新范式。 Method: 提出LoGSAM:1)用Whisper转录并用临床NLP(含否定识别)提取肿瘤文本提示;2)LoRA微调Grounding DINO(仅5%参数)实现文本引导的肿瘤定位;3)将预测框作为提示输入冻结的MedSAM生成像素级掩码。 Result: 在BRISC 2025上Dice达80.32%;对12例未见MRI德语语音测试达91.7%病例级准确率。 Conclusion: 验证了利用预训练多模态基础模型构建模块化‘语音→分割’流程的可行性,仅需极小参数更新即可实现高性能、临床可部署的脑肿瘤分析系统。 Abstract: Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.

[146] Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

Seongrae Noh,SeungWon Seo,Gyeong-Moon Park,HyeongYeop Kang

Main category: cs.CV

TL;DR: 本文提出Edit-As-Act框架,将3D室内场景的自然语言编辑建模为基于目标的符号化动作规划问题,通过设计EditLang动作语言与验证机制,实现高保真、语义一致且物理合理的编辑效果。

Details Motivation: 现有开放词汇场景编辑方法多依赖重生成或图像空间编辑,易破坏空间结构和物理一致性;作者认为编辑应是最小动作序列以达成用户指令所定义的世界状态,而非单纯生成任务。 Method: 提出Edit-As-Act框架:1)预测符号化目标谓词;2)在自定义的PDDL风格动作语言EditLang中进行目标回归式规划,显式建模支撑、接触、碰撞等几何关系;3)由语言驱动的规划器生成动作,验证器确保目标导向性、单调性和物理可行性。 Result: 在包含63个任务、9种室内环境的E2A-Bench基准上,Edit-As-Act在所有编辑类型和场景类别中均显著优于先前方法。 Conclusion: 分离高层符号推理与低层生成是实现高指令保真度、语义一致性与物理合理性的关键,Edit-As-Act为此提供了可行框架,并为3D场景编辑开辟了新范式。 Abstract: Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.

[147] Trust the Unreliability: Inward Backward Dynamic Unreliability Driven Coreset Selection for Medical Image Classification

Yan Liang,Ziyuan Yang,Zhuxin Lei,Mengyu Sun,Yingyu Chen,Yi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种动态不可靠性驱动的医疗影像数据核心集选择方法(DUCS),通过评估样本在训练过程中的置信度波动和遗忘频率,筛选靠近决策边界的不可靠样本,从而提升小规模数据下的模型性能,尤其在高压缩率下优于现有方法。

Details Motivation: 医疗影像数据具有类内差异大、类间相似性高特点,传统coreset选择方法效果受限;同时,神经网络对靠近类中心的样本记忆稳定,但这些样本对决策边界建模帮助有限,而不可靠样本反而更具信息量。 Method: 提出DUCS策略,包含两个核心机制:1)Inward Self-Awareness——基于训练中置信度演化量化样本不确定性;2)Backward Memory Tracking——通过统计样本被遗忘频次评估模型对其记忆能力;最终联合二者选择高波动+高遗忘的不可靠样本构成coreset。 Result: 在多个公开医疗影像数据集上实验表明,DUCS在高压缩率下显著优于当前最优(SOTA)coreset方法,提升了下游模型训练效率与泛化性能。 Conclusion: 不可靠样本蕴含更多边界信息,DUCS通过动态建模不确定性与记忆遗忘行为,实现了更有效的医疗影像coreset构建,为资源受限下的医学AI训练提供了新思路。 Abstract: Efficiently managing and utilizing large-scale medical imaging datasets with limited resources presents significant challenges. While coreset selection helps reduce computational costs, its effectiveness in medical data remains limited due to inherent complexity, such as large intra-class variation and high inter-class similarity. To address this, we revisit the training process and observe that neural networks consistently produce stable confidence predictions and better remember samples near class centers in training. However, concentrating on these samples may complicate the modeling of decision boundaries. Hence, we argue that the more unreliable samples are, in fact, the more informative in helping build the decision boundary. Based on this, we propose the Dynamic Unreliability-Driven Coreset Selection(DUCS) strategy. Specifically, we introduce an inward-backward unreliability assessment perspective: 1) Inward Self-Awareness: The model introspects its behavior by analyzing the evolution of confidence during training, thereby quantifying uncertainty of each sample. 2) Backward Memory Tracking: The model reflects on its training tracking by tracking the frequency of forgetting samples, thus evaluating its retention ability for each sample. Next, we select unreliable samples that exhibit substantial confidence fluctuations and are repeatedly forgotten during training. This selection process ensures that the chosen samples are near the decision boundary, thereby aiding the model in refining the boundary. Extensive experiments on public medical datasets demonstrate our superior performance compared to state-of-the-art(SOTA) methods, particularly at high compression rates.

[148] ReLaGS: Relational Language Gaussian Splatting

Yaxu Xie,Abdalla Arafa,Alireza Javanmardi,Christen Millerdurai,Jia Cheng Hu,Shaoxiang Wang,Alain Pagani,Didier Stricker

Main category: cs.CV

TL;DR: 本文提出了一种无需场景特定训练的统一3D感知与推理框架,通过语言蒸馏高斯场景和3D语义场景图实现开放词汇下的分割、场景图生成与关系引导检索。

Details Motivation: 现有方法在统一3D感知与推理(如分割、检索、关系理解)方面存在局限:要么以物体为中心,要么依赖高成本的物体间推理训练。 Method: 构建分层的语言蒸馏高斯场景与3D语义场景图;采用高斯剪枝优化几何结构,结合鲁棒多视角语言对齐策略聚合2D特征为准确3D物体嵌入;基于Vision-Language标注和图神经网络进行关系推理。 Result: 在开放词汇分割、场景图生成和关系引导检索等任务上验证了方法的有效性,实现了高效、可扩展的开放词汇3D推理。 Conclusion: 该框架通过联合建模分层语义与物体内外关系,无需场景特定训练即可实现统一、开放词汇的3D感知与推理。 Abstract: Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: https://dfki-av.github.io/ReLaGS/

[149] S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models

Xinze Li,Pengxu Chen,Yiyuan Wang,Weifeng Su,Wentao Cheng

Main category: cs.CV

TL;DR: 本文提出S-VGGT方法,通过在结构帧层面消除冗余,构建场景图并软划分子场景,利用共享参考帧实现并行高效处理,从根本上降低全局注意力计算成本,并可与token级加速方法正交结合。

Details Motivation: Feed-forward 3D基础模型面临全局注意力带来的二次计算开销问题,现有token级加速方法(如token merging)因近邻搜索引入额外开销,且未能解决密集捕获数据中的结构性冗余问题。 Method: 提出S-VGGT:首先利用初始特征构建稠密场景图以刻画结构性冗余;基于该图将帧软分配至少量子场景,保证组间均衡与几何过渡平滑;设计各子场景共享一个参考帧,形成并行几何桥梁,支持无需显式几何对齐的独立高效处理。 Result: 显著降低全局注意力计算复杂度,提供强内在加速效果;与token级加速方法完全正交,可无缝组合获得叠加加速,同时不损害重建保真度。 Conclusion: S-VGGT从结构帧层面重新组织输入,从根本上缓解了3D基础模型的计算瓶颈,为高效、高保真3D理解提供了新范式。 Abstract: Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce \textbf{S-VGGT}, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at https://github.com/Powertony102/S-VGGT.

[150] A Multi-Agent System for Building-Age Cohort Mapping to Support Urban Energy Planning

Kundan Thota,Thorsten Schlachter,Veit Hagenmeyer

Main category: cs.CV

TL;DR: 本文提出了一种基于多智能体大语言模型(LLM)与卫星图像分类模型BuildingAgeCNN相结合的方法,用于估计城市建筑年代分布,以支持可持续供热规划;该系统融合人口普查、开放街道地图和历史遗迹等异构数据,并通过改进的ConvNeXt架构实现高准确率分类,同时引入置信度校准以应对类别不平衡问题。

Details Motivation: 确定城市建筑存量的年龄分布对可持续市政供热规划和改造优先级排序至关重要,但现有方法依赖传感器或遥感数据,存在不一致和数据缺失问题。 Method: 构建了一个包含Zensus、OSM和Monument三个智能体的多智能体LLM系统,由数据协调器进行地理编码与去重;在此融合数据基础上,设计了基于ConvNeXt主干、FPN、CoordConv和SE模块的卫星图像分类模型BuildingAgeCNN,并集成置信度校准与低置信预警机制。 Result: BuildingAgeCNN在空间交叉验证下整体准确率达90.69%,但宏观F1仅为67.25%,表明存在显著类别不平衡及相邻历史时期建筑混淆;系统支持地址级预测并提供可校准置信度,辅助人工复核。 Conclusion: 该多智能体LLM系统不仅能高效整合异构建筑年代数据,还可为区域能源规划者优化区域供热网络、推动低碳可持续能源系统建设提供可靠技术支撑。 Abstract: Determining the age distribution of the urban building stock is crucial for sustainable municipal heat planning and upgrade prioritization. However, existing approaches often rely on datasets gathered via sensors or remote sensing techniques, leaving inconsistencies and gaps in data. We present a multi-agent LLM system comprising three key agents, the Zensus agent, the OSM agent, and the Monument agent, that fuse data from heterogeneous sources. A data orchestrator and harmonizer geocodes and deduplicates building imprints. Using this fused ground truth, we introduce BuildingAgeCNN, a satellite-only classifier based on a ConvNeXt backbone augmented with a Feature Pyramid Network (FPN), CoordConv spatial channels, and Squeeze-and-Excitation (SE) blocks. Under spatial cross validation, BuildingAgeCNN attains an overall accuracy of 90.69% but a modest macro-F1 of 67.25%, reflecting strong class imbalance and persistent confusions between adjacent historical cohorts. To mitigate risk for planning applications, the address-to prediction pipeline includes calibrated confidence estimates and flags low-confidence cases for manual review. This multi-agent LLM system not only assists in gathering structured data but also helps energy demand planners optimize district-heating networks and target low-carbon sustainable energy systems.

[151] Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment

Dongqiang Gou,Xuming He

Main category: cs.CV

TL;DR: 本文提出了一种两阶段跨模态框架,用于解决开放词汇、细粒度几何对齐和部件级语义一致性等挑战,实现自然语言驱动的3D物体功能区域定位(即语言驱动3D功能接地)。

Details Motivation: 现有方法在开放词汇泛化、细粒度几何对齐和部件级语义一致性方面仍存在挑战。 Method: 提出两阶段跨模态框架:第一阶段利用大语言模型生成部件感知指令以恢复缺失语义;第二阶段引入功能原型聚合(APA)和物体内部关系建模(IORM)以增强几何一致性和区分性。 Result: 在新构建及两个现有基准上的实验表明,该方法性能优于现有方法。 Conclusion: 所提框架有效提升了开放词汇3D功能接地任务中的语义与几何表征能力。 Abstract: Grounding natural language questions to functionally relevant regions in 3D objects -- termed language-driven 3D affordance grounding -- is essential for embodied intelligence and human-AI interaction. Existing methods, while progressing from label-based to language-driven approaches, still face challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency. To address these issues, we propose a novel two-stage cross-modal framework that enhances both semantic and geometric representations for open-vocabulary 3D affordance grounding. In the first stage, large language models generate part-aware instructions to recover missing semantics, enabling the model to link semantically similar affordances. In the second stage, we introduce two key components: Affordance Prototype Aggregation (APA), which captures cross-object geometric consistency for each affordance, and Intra-Object Relational Modeling (IORM), which refines geometric differentiation within objects to support precise semantic alignment. We validate the effectiveness of our method through extensive experiments on a newly introduced benchmark, as well as two existing benchmarks, demonstrating superior performance in comparison with existing methods.

[152] Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Tae Eun Choi,Sumin Shim,Junhyeok Kim,Seong Jae Hwang

Main category: cs.CV

TL;DR: 本文提出了一种文本条件下的生成式中间帧插值(GI)方法,通过Keyframe-anchored Attention Bias和Rescaled Temporal RoPE提升语义一致性、时序稳定性和帧间连贯性,并构建首个专用评测基准TGI-Bench。

Details Motivation: 现有生成式中间帧插值(GI)模型在稀疏关键帧和大运动场景下存在帧不一致、节奏不稳定和语义错位问题,需结合关键帧与文本提供更明确的语义与时间引导。 Method: 引入Keyframe-anchored Attention Bias,将关键帧和文本语义信息作为注意力偏置作用于各中间帧;采用Rescaled Temporal RoPE增强自注意力对关键帧的时间感知能力;构建TGI-Bench基准用于细粒度评测。 Result: 在短/长序列及多种挑战场景下,无需额外训练即达到SOTA性能,显著提升帧一致性、语义保真度与时序稳定性。 Conclusion: 语义与时间联合引导机制可有效解决生成式中间帧插值中的核心难题,TGI-Bench为该任务提供了标准化评估基础。 Abstract: Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.

[153] Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

Yaze Zhao,Yixiong Zou,Yuhua Li,Ruixuan Li

Main category: cs.CV

TL;DR: 本文提出CC-CDFSL方法,通过循环一致性自监督机制和语义锚点机制,解决CLIP在跨域少样本学习中局部视觉-语言对齐困难的问题,提升细粒度识别的可解释性与性能。

Details Motivation: 现有基于CLIP的跨域少样本学习方法难以聚焦下游领域(如医学诊断)所需的细粒度视觉线索,域间差异与数据稀缺进一步加剧了局部对齐问题(即局部错位问题)。 Method: 提出CC-CDFSL方法:1)利用循环一致性自监督,实现局部视觉特征与文本特征的双向翻译与重建约束;2)引入语义锚点机制,先增强视觉特征以扩充文本到图像映射语料,再压缩图像特征以过滤无关映射,降低视觉模态噪声。 Result: 在多个基准、骨干网络和微调方法上验证:1)显著提升局部视觉-语言对齐能力;2)增强模型决策与学习模式的可视化可解释性;3)达到当前最优性能。 Conclusion: 局部错位是CLIP应用于跨域少样本学习的关键瓶颈,而基于循环一致性和语义锚点的自监督策略能有效缓解该问题,兼顾对齐精度、可解释性与泛化性能。 Abstract: Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.

[154] FINER: MLLMs Hallucinate under Fine-grained Negative Queries

Rui Xiao,Sanghwan Kim,Yongqin Xian,Zeynep Akata,Stephan Alaniz

Main category: cs.CV

TL;DR: 本文提出FINER基准和FINER-Tuning方法,专门针对多模态大语言模型(MLLMs)在细粒度查询中的幻觉问题,通过直接偏好优化显著降低幻觉并提升多模态能力。

Details Motivation: 现有基准主要关注粗粒度图像问题,无法充分反映MLLMs在细粒度查询中的幻觉问题,亟需更精细的评估与改进方法。 Method: 构建FINER系列基准(FINER-CompreCap和FINER-DOCCI),覆盖多物体、多属性、多关系及'what'类细粒度问题;提出FINER-Tuning,基于FINER启发的数据,采用直接偏好优化(DPO)对前沿MLLMs进行微调。 Result: 在FINER基准上,四款前沿MLLMs经FINER-Tuning后幻觉率最高降低24.2%(InternVL3.5-14B);同时在8个现有幻觉评测套件和6个通用多模态基准上均取得性能提升。 Conclusion: 细粒度幻觉常由图像中真实存在元素与细粒度错配共同引发;FINER-Tuning是一种有效且泛化性强的幻觉缓解策略。 Abstract: Multimodal large language models (MLLMs) struggle with hallucinations, particularly with fine-grained queries, a challenge underrepresented by existing benchmarks that focus on coarse image-related questions. We introduce FIne-grained NEgative queRies (FINER), alongside two benchmarks: FINER-CompreCap and FINER-DOCCI. Using FINER, we analyze hallucinations across four settings: multi-object, multi-attribute, multi-relation, and ``what'' questions. Our benchmarks reveal that MLLMs hallucinate when fine-grained mismatches co-occur with genuinely present elements in the image. To address this, we propose FINER-Tuning, leveraging Direct Preference Optimization (DPO) on FINER-inspired data. Finetuning four frontier MLLMs with FINER-Tuning yields up to 24.2\% gains (InternVL3.5-14B) on hallucinations from our benchmarks, while simultaneously improving performance on eight existing hallucination suites, and enhancing general multimodal capabilities across six benchmarks. Code, benchmark, and models are available at \href{https://explainableml.github.io/finer-project/}{https://explainableml.github.io/finer-project/}.

[155] Few-Step Diffusion Sampling Through Instance-Aware Discretizations

Liangyu Yuan,Ruoyu Wang,Tong Zhao,Dingwen Fu,Mingkun Lei,Beier Zhu,Chi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种实例感知的离散化框架,用于改进扩散和流匹配模型的采样过程,通过根据输入依赖先验自适应分配时间步长,从而提升生成质量。

Details Motivation: 现有方法多采用全局统一的时间步长调度,忽略了生成过程中样本特定的复杂性,导致性能受限;受合成数据上的受控实验启发,发现全局调度在实例特定动态下存在次优性。 Method: 提出实例感知的离散化框架,将基于梯度的离散化搜索扩展到条件生成设置中,使时间步长分配依赖于输入相关的先验。 Result: 在合成数据、像素空间扩散、潜在空间图像及视频流匹配模型等多种设置上,该方法一致提升了生成质量,且调优成本低、推理开销可忽略。 Conclusion: 实例感知的离散化策略能更有效地适配不同样本的生成动态,是提升扩散与流匹配模型采样效率与质量的重要方向。 Abstract: Diffusion and flow matching models generate high-fidelity data by simulating paths defined by Ordinary or Stochastic Differential Equations (ODEs/SDEs), starting from a tractable prior distribution. The probability flow ODE formulation enables the use of advanced numerical solvers to accelerate sampling. Orthogonal yet vital to solver design is the discretization strategy. While early approaches employed handcrafted heuristics and recent methods adopt optimization-based techniques, most existing strategies enforce a globally shared timestep schedule across all samples. This uniform treatment fails to account for instance-specific complexity in the generative process, potentially limiting performance. Motivated by controlled experiments on synthetic data, which reveals the suboptimality of global schedules under instance-specific dynamics, we propose an instance-aware discretization framework. Our method learns to adapt timestep allocations based on input-dependent priors, extending gradient-based discretization search to the conditional generative setting. Empirical results across diverse settings, including synthetic data, pixel-space diffusion, latent-space images and video flow matching models, demonstrate that our method consistently improves generation quality with marginal tuning cost compared to training and negligible inference overhead.

[156] DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation

Sarra Harrabi,Yichen Wu,Geoffrey H. Tison,Minhaj Ansari,Milos Vukadinovic,David Ouyang,Joshua P. Barrios,Jacques Delfrate,Robert Avram

Main category: cs.CV

TL;DR: DeepCORO-CLIP是一个基于视频-文本对比学习的多视角冠状动脉造影基础模型,可实现对狭窄、慢性完全闭塞、血栓、钙化等病变的精准检测,并支持预后预测与疾病进展评估。

Details Motivation: 现有AI方法多基于单帧或单视角分析,且主要聚焦于狭窄检测,难以实现冠脉全面评估;人工判读又存在观察者间差异。 Method: 提出DeepCORO-CLIP模型,采用视频-文本对比学习范式,在20余万例冠脉造影视频上训练,融合多投影视角并使用注意力池化实现研究级综合评估。 Result: 在显著狭窄检测中AUROC达0.888(内部)和0.89(外部);QCA定量误差13.6%(优于临床报告的19.0%);CTO、血栓、钙化检测表现优异;迁移后可预测MACE(AUROC 0.79)和LVEF(MAE 7.3%);嵌入表征可捕捉疾病进展;部署推理耗时4.2秒。 Conclusion: DeepCORO-CLIP是首个面向冠状动脉造影全任务的基础模型,具备临床落地潜力,并已开源代码、数据与模型。 Abstract: Coronary angiography is the reference standard for evaluating coronary artery disease, yet visual interpretation remains variable between readers. Existing artificial intelligence methods typically analyze single frames or projections and focus mainly on stenosis, limiting comprehensive coronary assessment. We present DeepCORO-CLIP, a multi-view foundation model trained with video-text contrastive learning on 203,808 angiography videos from 28,117 patients across 32,473 studies at the Montreal Heart Institute and externally validated on 4,249 studies from the University of California, San Francisco. DeepCORO-CLIP integrates multiple projections with attention-based pooling for study-level assessment across diagnostic, prognostic, and disease progression tasks. For significant stenosis detection, the model achieved an AUROC of 0.888 internally and 0.89 on external validation. Mean absolute error against core laboratory quantitative coronary angiography was 13.6%, lower than clinical reports at 19.0%. The model also performed strongly for chronic total occlusion, intracoronary thrombus, and coronary calcification detection. Transfer learning enabled prediction of one-year major adverse cardiovascular events with AUROC 0.79 and estimation of left ventricular ejection fraction with mean absolute error 7.3%. Embeddings also captured disease progression across serial examinations. With a mean inference time of 4.2 seconds in hospital deployment, DeepCORO-CLIP provides a foundation for automated coronary angiography interpretation at the point of care. Code, sample data, model weights, and deployment infrastructure are publicly released.

[157] Illumination-Aware Contactless Fingerprint Spoof Detection via Paired Flash-Non-Flash Imaging

Roja Sahoo,Anoop Namboodiri

Main category: cs.CV

TL;DR: 本文提出利用闪光/非闪光配对成像的轻量级主动感知机制,通过分析光照诱导的差异(如通道相关性、镜面反射特性等)来提升无接触指纹活体检测的鲁棒性与可解释性。

Details Motivation: 无接触指纹识别缺乏物理接触和传统活性线索,现有单图像外观特征方法跨设备、环境和伪造材料泛化能力差。 Method: 采用闪光-非闪光配对采集方式,结合可解释指标(通道相关性、镜面反射、纹理真实性、差分成像)分析光照引起的材质与结构差异。 Result: 该方法能有效区分真实指纹与打印、数字显示及模具伪造攻击,并揭示了配对采集在成像设置敏感性、数据规模和高保真伪造方面的局限性。 Conclusion: 光照感知分析可提升无接触指纹活体检测的鲁棒性与可解释性,为未来配对采集与物理信息驱动的特征设计提供新方向。 Abstract: Contactless fingerprint recognition enables hygienic and convenient biometric authentication but poses new challenges for spoof detection due to the absence of physical contact and traditional liveness cues. Most existing methods rely on single-image acquisition and appearance-based features, which often generalize poorly across devices, capture conditions, and spoof materials. In this work, we study paired flash-non-flash contactless fingerprint acquisition as a lightweight active sensing mechanism for spoof detection. Through a preliminary empirical analysis, we show that flash illumination accentuates material- and structure-dependent properties, including ridge visibility, subsurface scattering, micro-geometry, and surface oils, while non-flash images provide a baseline appearance context. We analyze lighting-induced differences using interpretable metrics such as inter-channel correlation, specular reflection characteristics, texture realism, and differential imaging. These complementary features help discriminate genuine fingerprints from printed, digital, and molded presentation attacks. We further examine the limitations of paired acquisition, including sensitivity to imaging settings, dataset scale, and emerging high-fidelity spoofs. Our findings demonstrate the potential of illumination-aware analysis to improve robustness and interpretability in contactless fingerprint presentation attack detection, motivating future work on paired acquisition and physics-informed feature design. Code is available in the repository.

[158] WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models

Wanjun Du,Zifeng Yuan,Tingting Chen,Fucai Ke,Beibei Lin,Shunli Zhang

Main category: cs.CV

TL;DR: 本文提出WeatherReasonSeg基准,用于评估视觉语言模型(VLMs)在雨、雪、雾等恶劣天气下进行推理式分割的鲁棒性,包含可控合成数据与真实恶劣天气数据两部分,并揭示了性能随天气恶化而单调下降及不同天气引发不同脆弱性模式的关键发现。

Details Motivation: 现有视觉语言模型在理想高质量图像上表现优异,但在雨、雪、雾等恶劣天气导致视觉线索严重退化时,其推理分割能力是否可靠尚不明确,亟需针对性评测基准。 Method: 构建WeatherReasonSeg基准:1)通过合成不同强度的天气退化,改造现有分割数据集,形成可控推理数据集;2)利用掩码引导的大语言模型生成语义一致查询,构建真实恶劣天气推理分割数据集;3)从功能、应用场景、结构属性、交互、需求匹配五个维度拓展评测。 Result: 在多种VLM上的实验表明:(1)模型性能随天气严重程度增加而单调下降;(2)不同天气类型(如雨、雪、雾)引发差异化的脆弱性模式。 Conclusion: WeatherReasonSeg为评估和推动面向恶劣天气的鲁棒推理分割提供了系统性基准,有望促进天气感知的视觉语言理解发展。 Abstract: Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are primarily constructed from high-quality images captured under idealized conditions. This raises a critical question: when visual cues are severely degraded by adverse weather conditions such as rain, snow, or fog, can VLMs sustain reliable reasoning segmentation capabilities? In response to this challenge, we introduce WeatherReasonSeg, a benchmark designed to evaluate VLM performance in reasoning-based segmentation under adverse weather conditions. It consists of two complementary components. First, we construct a controllable reasoning dataset by applying synthetic weather with varying severity levels to existing segmentation datasets, enabling fine-grained robustness analysis. Second, to capture real-world complexity, we curate a real-world adverse-weather reasoning segmentation dataset with semantically consistent queries generated via mask-guided LLM prompting. We further broaden the evaluation scope across five reasoning dimensions, including functionality, application scenarios, structural attributes, interactions, and requirement matching. Extensive experiments across diverse VLMs reveal two key findings: (1) VLM performance degrades monotonically with increasing weather severity, and (2) different weather types induce distinct vulnerability patterns. We hope WeatherReasonSeg will serve as a foundation for advancing robust, weather-aware reasoning.

[159] Does YOLO Really Need to See Every Training Image in Every Epoch?

Xingxing Xie,Jiahua Dong,Junwei Han,Gong Cheng

Main category: cs.CV

TL;DR: 本文提出了一种反遗忘采样策略(AFSS),通过动态评估每张训练图像的学习充分性(取检测召回率与精度的最小值),将其分为易、中、难三类,并差异化采样:易样本稀疏重采、中样本部分采样、难样本全量采样,从而在加速YOLO训练的同时提升精度。

Details Motivation: YOLO训练耗时长,因其每轮都处理全部图像,而许多图像已学得充分,违背了'You Only Look Once'的高效初衷;因此需探究是否真需每轮遍历全部图像。 Method: 提出反遗忘采样策略(AFSS):基于每张图像的min(召回率, 精度)衡量学习充分性,动态划分易/中/难样本;易样本按时间间隔稀疏重采(优先久未使用的),中样本部分采样(优先近期未用+随机补充),难样本每轮全采;并周期更新学习充分性以自适应聚焦信息性样本。 Result: 在MS COCO 2017、PASCAL VOC 2007、DOTA-v1.0和DIOR-R等多个基准上,AFSS为YOLO系列检测器带来超1.43倍训练加速,同时提升检测精度。 Conclusion: YOLO无需每轮遍历全部训练图像;AFSS通过动态、分层、防遗忘的采样机制,显著提升了训练效率与模型性能,验证了选择性训练的有效性与实用性。 Abstract: YOLO detectors are known for their fast inference speed, yet training them remains unexpectedly time-consuming due to their exhaustive pipeline that processes every training image in every epoch, even when many images have already been sufficiently learned. This stands in clear contrast to the efficiency suggested by the ``You Only Look Once'' philosophy. This naturally raises an important question: \textit{Does YOLO really need to see every training image in every epoch?} To explore this, we propose an Anti-Forgetting Sampling Strategy (AFSS) that dynamically determines which images should be used and which can be skipped during each epoch, allowing the detector to learn more effectively and efficiently. Specifically, AFSS measures the learning sufficiency of each training image as the minimum of its detection recall and precision, and dynamically categorizes training images into easy, medium, or hard levels accordingly. Easy training images are sparsely resampled during training in a continuous review manner, with priority given to those that have not been used for a long time to reduce redundancy and prevent forgetting. Moderate training images are partially selected, prioritizing recently unused ones and randomly choosing the rest from unselected images to ensure coverage and prevent forgetting. Hard training images are fully sampled in every epoch to ensure sufficient learning. The learning sufficiency of each training image is periodically updated, enabling detectors to adaptively shift its focus toward the informative training images over time while progressively discarding redundant ones. On widely used natural image detection benchmarks (MS COCO 2017 and PASCAL VOC 2007) and remote sensing detection datasets (DOTA-v1.0 and DIOR-R), AFSS achieves more than $1.43\times$ training speedup for YOLO-series detectors while also improving accuracy.

[160] Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Songtao Jiang,Sibo Song,Chenyi Zhou,Yuan Wang,Ruizhe Chen,Tongkun Guan,Ruilin Luo,Yan Zhang,Zhihang Tang,Yuchong Sun,Hang Zhang,Zhibo Yang,Shuai Bai,Junyang Lin,Zuozhu Liu

Main category: cs.CV

TL;DR: 本文提出SynRL框架,通过程序生成的合成视频教授模型时间感知的基本要素(如方向、速度、状态跟踪),显著提升了视频理解能力,且仅用7.7K合成样本就超越了使用165K真实样本的Video-R1。

Details Motivation: 现有视觉语言模型在视频理解中难以有效建模时间动态(如运动轨迹、速度变化、状态转换),而当前后训练方法受限于数据集缺乏时间中心性及生成数据存在系统性时间感知错误。 Method: 提出SynRL后训练框架,将时间理解分解为短期感知原语(速度、方向)和长期认知原语;利用代码生成带帧级标注的7.7K CoT与7K RL样本(共7.7K几何形状合成视频),训练模型掌握时间原语。 Result: 在15个视频理解基准(含时间定位、复杂推理与通用视频理解)上显著提升性能;仅7.7K合成CoT样本即超越Video-R1(165K真实样本);验证了从抽象合成模式学到的时间技能可有效迁移到真实场景。 Conclusion: 基于精心设计的合成数据进行视频时间学习,是一种更低成本、更高效的视频后训练新范式。 Abstract: The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.

[161] Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation

Haocheng Li,Juepeng Zheng,Shuangxi Miao,Ruibo Lu,Guosheng Cai,Haohuan Fu,Jianxi Huang

Main category: cs.CV

TL;DR: 本文提出MoBaNet,一种参数高效且模态平衡的对称融合框架,用于多模态遥感语义分割,通过冻结视觉基础模型主干、设计跨模态提示注入适配器(CPIA)、差异引导门控融合模块(DGFM)和模态条件随机掩码(MCRM)策略,显著减少可训练参数并缓解模态不平衡问题,在ISPRS Vaihingen和Potsdam数据集上达到SOTA性能。

Details Motivation: 现有方法在将预训练视觉基础模型(VFMs)适配到多模态遥感语义分割任务时,面临计算开销大和模态不平衡(辅助模态贡献被抑制)两大挑战。 Method: 提出MoBaNet框架:1)基于冻结VFM主干的对称双流结构;2)跨模态提示注入适配器(CPIA),生成共享提示并注入瓶颈适配器以实现深层语义交互;3)差异引导门控融合模块(DGFM),利用跨模态差异自适应融合阶段特征;4)模态条件随机掩码(MCRM)策略,训练中随机屏蔽单一模态并施加硬像素辅助监督。 Result: 在ISPRS Vaihingen和Potsdam基准上达到SOTA性能,同时可训练参数显著少于全微调方法,验证了其鲁棒、平衡的多模态融合能力。 Conclusion: MoBaNet通过参数高效设计与模态平衡机制,有效提升了多模态遥感语义分割的性能与泛化性,为轻量级、公平融合多源遥感数据提供了新范式。 Abstract: Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general-purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter-efficient and modality-balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual-stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross-modal Prompt-Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference-Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross-modal discrepancy to guide feature selection. Furthermore, we propose a Modality-Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard-pixel auxiliary supervision on modality-specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state-of-the-art performance with significantly fewer trainable parameters than full fine-tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.

[162] Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3)

Diederick C. Niehorster,Marcus Nyström

Main category: cs.CV

TL;DR: 本文评估了最新版Segment Anything Model(SAM3)在眼图像分割任务上的表现,发现其性能并未超越SAM2,且SAM2速度更快,因此仍为眼图像分割的更优选择。

Details Motivation: 检验SAM3是否在眼图像分割任务上优于SAM2,并探索其新增的概念(文本)提示模式的效果。 Method: 在多种数据集(包括实验室高分辨率视频和野外采集的TEyeD数据集)上对比评估SAM3(使用视觉提示和概念提示)与SAM2的分割性能,并提供适配SAM3以处理任意时长视频的代码。 Result: 在大多数情况下,SAM3无论使用视觉提示还是概念提示,其眼图像分割性能均未超过SAM2;且SAM2运行速度更快。 Conclusion: SAM2仍是眼图像分割任务的最佳选择。 Abstract: Previous work has reported that vision foundation models show promising zero-shot performance in eye image segmentation. Here we examine whether the latest iteration of the Segment Anything Model, SAM3, offers better eye image segmentation performance than SAM2, and explore the performance of its new concept (text) prompting mode. Eye image segmentation performance was evaluated using diverse datasets encompassing both high-resolution high-quality videos from a lab environment and the TEyeD dataset consisting of challenging eye videos acquired in the wild. Results show that in most cases SAM3 with either visual or concept prompts did not perform better than SAM2, for both lab and in-the-wild datasets. Since SAM2 not only performed better but was also faster, we conclude that SAM2 remains the best option for eye image segmentation. We provide our adaptation of SAM3's codebase that allows processing videos of arbitrary duration.

[163] DiffVP: Differential Visual Semantic Prompting for LLM-Based CT Report Generation

Yuhe Tian,Kun Zhang,Haoran Ma,Rui Yan,Yingtai Li,Rongsheng Wang,Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: 本文提出Differential Visual Prompting (DiffVP)方法,通过提取CT扫描与参考图像之间的语义差异作为条件信号,提升大语言模型在CT报告生成任务中的性能,无需显式病灶定位即可提高生成准确性和临床有效性。

Details Motivation: 现有CT报告生成方法通常整体编码3D体数据,难以区分关键诊断线索与冗余解剖背景;受放射科医生认知减影启发,需建模扫描与参考间的语义差异。 Method: 提出DiffVP框架,包含分层差异提取器(捕获全局与局部语义差异)和差异到提示生成器(将差异映射为可学习的视觉前缀token),用作LLM的结构化条件输入。 Result: 在两个大规模基准上显著优于先前方法,BLEU-1至BLEU-4平均提升+10.98和+4.36;在RadGenome-ChestCT上F1达0.421,提升临床有效性。 Conclusion: DiffVP通过差异驱动的视觉提示机制,有效抑制不变解剖结构、增强诊断相关视觉证据,在不依赖病灶定位的前提下提升了CT报告生成质量与临床实用性。 Abstract: While large language models (LLMs) have advanced CT report generation, existing methods typically encode 3D volumes holistically, failing to distinguish informative cues from redundant anatomical background. Inspired by radiological cognitive subtraction, we propose Differential Visual Prompting (DiffVP), which conditions report generation on explicit, high-level semantic scan-to-reference differences rather than solely on absolute visual features. DiffVP employs a hierarchical difference extractor to capture complementary global and local semantic discrepancies into a shared latent space, along with a difference-to-prompt generator that transforms these signals into learnable visual prefix tokens for LLM conditioning. These difference prompts serve as structured conditioning signals that implicitly suppress invariant anatomy while amplifying diagnostically relevant visual evidence, thereby facilitating accurate report generation without explicit lesion localization. On two large-scale benchmarks, DiffVP consistently outperforms prior methods, improving the average BLEU-1-4 by +10.98 and +4.36, respectively, and further boosts clinical efficacy on RadGenome-ChestCT (F1 score 0.421). All codes will be released at https://github.com/ArielTYH/DiffVP/.

[164] SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Jingxiao Yang,DaLin He,Miao Pan,Ge Su,Wenqi Zhang,Yifeng Hu,Tangwei Li,Yuke Li,Xuhong Zhang

Main category: cs.CV

TL;DR: 本文提出SARE框架,通过样本自适应的级联设计(快速候选检索+细粒度推理)和自反思经验机制,在无需训练的情况下显著提升细粒度视觉识别的准确率与效率。

Details Motivation: 现有LVLMs用于细粒度视觉识别时存在两个根本限制:一是对所有样本采用统一推理流程,未考虑识别难度差异;二是缺乏错误经验复用机制,导致类似困难样本反复失败。 Method: 提出SARE框架,采用级联结构:先进行快速候选检索,仅在必要时触发细粒度推理;推理过程中引入自反思经验机制,利用历史失败案例提供可迁移的判别性指导,且不更新参数。 Result: 在14个数据集上的大量实验表明,SARE在性能上达到SOTA,同时大幅降低计算开销。 Conclusion: SARE通过样本自适应推理与错误经验重用,有效缓解了LVLMs在细粒度视觉识别中的视觉歧义问题,实现了高精度、高效率的训练-free识别。 Abstract: Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

[165] TAPESTRY: From Geometry to Appearance via Consistent Turntable Videos

Yan Zeng,Haoran Jiang,Kaixin Yao,Qixuan Zhang,Longwen Zhang,Lan Xu,Jingyi Yu

Main category: cs.CV

TL;DR: 本文提出TAPESTRY框架,利用几何约束的视频扩散模型,根据无纹理3D网格生成高保真、几何一致的360度旋转视频(TTVs),并进一步用于高质量纹理合成与神经渲染。

Details Motivation: 现有通用视频生成模型难以在全视角下保持严格的几何一致性和外观稳定性,无法满足高质量3D重建需求;而自动为无纹理3D模型生成逼真且自洽的外观是数字内容创作的关键挑战。 Method: 将3D外观生成重构为几何条件视频扩散问题:对输入3D网格进行多模态几何特征渲染与编码,以像素级精度约束视频生成;并设计基于3D感知修复的多阶段下游重建流程,通过旋转建模与上下文感知二次生成补全自遮挡区域。 Result: 生成的TTVs兼具高质量动态预览与3D感知中间表示能力,可无缝反投影至UV纹理或监督3DGS等神经渲染方法;实验表明该方法在视频一致性与最终重建质量上均优于现有方法。 Conclusion: TAPESTRY实现了从无纹理3D网格到生产就绪、完整3D资产的全自动创建, bridging video generation and 3D reconstruction via geometry-conditioned diffusion. Abstract: Automatically generating photorealistic and self-consistent appearances for untextured 3D models is a critical challenge in digital content creation. The advancement of large-scale video generation models offers a natural approach: directly synthesizing 360-degree turntable videos (TTVs), which can serve not only as high-quality dynamic previews but also as an intermediate representation to drive texture synthesis and neural rendering. However, existing general-purpose video diffusion models struggle to maintain strict geometric consistency and appearance stability across the full range of views, making their outputs ill-suited for high-quality 3D reconstruction. To this end, we introduce TAPESTRY, a framework for generating high-fidelity TTVs conditioned on explicit 3D geometry. We reframe the 3D appearance generation task as a geometry-conditioned video diffusion problem: given a 3D mesh, we first render and encode multi-modal geometric features to constrain the video generation process with pixel-level precision, thereby enabling the creation of high-quality and consistent TTVs. Building upon this, we also design a method for downstream reconstruction tasks from the TTV input, featuring a multi-stage pipeline with 3D-Aware Inpainting. By rotating the model and performing a context-aware secondary generation, this pipeline effectively completes self-occluded regions to achieve full surface coverage. The videos generated by TAPESTRY are not only high-quality dynamic previews but also serve as a reliable, 3D-aware intermediate representation that can be seamlessly back-projected into UV textures or used to supervise neural rendering methods like 3DGS. This enables the automated creation of production-ready, complete 3D assets from untextured meshes. Experimental results demonstrate that our method outperforms existing approaches in both video consistency and final reconstruction quality.

[166] Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation

Haoyun Chen,Fenghe Tang,Wenxin Ma,Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: 本文提出了一种无需人工提示的通用医学图像分割框架Concept-to-Pixel(C2P),通过解耦解剖学知识为几何与语义表征,并利用多模态大语言模型生成语义Token、显式监督几何Token,结合动态核预测和几何感知推理共识机制,显著提升了跨模态、零样本及跨任务泛化能力。

Details Motivation: 现有通用医学图像分割方法依赖人工视觉提示或参考图像,自动化与鲁棒性不足;且简单联合训练难以应对多模态间巨大域偏移。 Method: 提出C2P框架:1)将解剖知识解耦为几何Token(显式监督)和语义Token(由MLLM蒸馏高阶概念);2)二者与图像特征深度交互生成输入自适应动态卷积核;3)引入几何感知推理共识机制评估预测可靠性并抑制异常值。 Result: 在涵盖7种模态、8个数据集的统一基准上,C2P显著优于单模态或独立通用模型;在零样本和跨模态迁移任务中表现优异,展现出强泛化能力。 Conclusion: C2P通过解耦与协同建模几何与语义先验,实现了真正提示无关、鲁棒且可泛化的通用医学图像分割,为构建统一医学AI基础模型提供了新范式。 Abstract: Universal medical image segmentation seeks to use a single foundational model to handle diverse tasks across multiple imaging modalities. However, existing approaches often rely heavily on manual visual prompts or retrieved reference images, which limits their automation and robustness. In addition, naive joint training across modalities often fails to address large domain shifts. To address these limitations, we propose Concept-to-Pixel (C2P), a novel prompt-free universal segmentation framework. C2P explicitly separates anatomical knowledge into two components: Geometric and Semantic representations. It leverages Multimodal Large Language Models (MLLMs) to distill abstract, high-level medical concepts into learnable Semantic Tokens and introduces explicitly supervised Geometric Tokens to enforce universal physical and structural constraints. These disentangled tokens interact deeply with image features to generate input-specific dynamic kernels for precise mask prediction. Furthermore, we introduce a Geometry-Aware Inference Consensus mechanism, which utilizes the model's predicted geometric constraints to assess prediction reliability and suppress outliers. Extensive experiments and analysis on a unified benchmark comprising eight diverse datasets across seven modalities demonstrate the significant superiority of our jointly trained approach, compared to universe- or single-model approaches. Remarkably, our unified model demonstrates strong generalization, achieving impressive results not only on zero-shot tasks involving unseen cases but also in cross-modal transfers across similar tasks. Code is available at: https://github.com/Yundi218/Concept-to-Pixel

[167] PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

Wenbin Tan,Jiawen Lin,Fangyong Wang,Yuan Xie,Yong Xie,Yachao Zhang,Yanyun Qu

Main category: cs.CV

TL;DR: 本文提出PC-CrossDiff框架,通过点级和簇级双层跨模态差分注意力机制,有效提升3D视觉定位在复杂多物体场景下的性能,尤其在隐式空间线索理解上取得显著突破。

Details Motivation: 现有3D视觉定位方法在复杂多物体场景中因难以解析隐式定位线索和抑制动态空间干扰而性能显著下降,限制了实际应用。 Method: 提出PC-CrossDiff统一双任务框架,包含点级差分注意力(PLDA)模块用于双向文本-点云差分建模以提取隐式线索,以及簇级差分注意力(CLDA)模块构建层次化注意力机制以增强相关空间关系并抑制干扰。 Result: 在ScanRefer、NR3D和SR3D基准上达到SOTA;在ScanRefer隐式子集上,3DREC任务的Overall@0.50指标提升+10.16%。 Conclusion: PC-CrossDiff通过双级差分注意力机制有效解决了复杂场景中隐式线索解析与空间干扰抑制两大挑战,显著提升了3DREC和3DRES任务的鲁棒性与准确性。 Abstract: 3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes that are common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the Overall@0.50 score by +10.16% for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.

[168] Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs

Yuxin Liu,Fei Wang,Kun Li,Yiqi Nie,Junjie Chen,Zhangling Duan,Zhaohong Jia

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的语义一致证据包(SCEP)框架,利用大视觉语言模型(LVLM)进行图像深度伪造检测,通过挖掘可疑图像块并融合语义与噪声特征提升检测性能,无需微调且泛化性强。

Details Motivation: 现有基于大视觉语言模型(LVLM)的图像深度伪造检测方法通常需要昂贵的微调,且难以泛化到多样、动态演化的伪造手段。 Method: SCEP不进行全图推理,而是从图像中挖掘能揭示篡改线索的可疑图像块token;以视觉编码器CLS token为全局参考,对patch特征聚类,并结合CLS引导的语义不匹配度与频域/噪声异常度进行打分;再按簇采样高置信度patch并应用网格化NMS,生成紧凑的证据包以驱动冻结的LVLM完成预测。 Result: 在多个多样化基准测试上,SCEP在不微调LVLM的前提下显著优于强基线方法。 Conclusion: SCEP是一种高效、通用、无需训练的LVLM适配框架,为图像深度伪造检测提供了新范式。 Abstract: Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.

[169] CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image

Yizheng Song,Yiyu Zhuang,Qipeng Xu,Haixiang Wang,Jiahe Zhu,Jing Tian,Siyu Zhu,Hao Zhu

Main category: cs.CV

TL;DR: 本文提出CrowdGaussian框架,用于从单张图像重建多人3D高斯泼溅(3DGS)模型,通过自监督适配和自校准学习解决严重遮挡、低清晰度与外观多样性等挑战,实现几何一致且逼真的多人3D重建。

Details Motivation: 现有单视角3D人体重建方法主要针对清晰、近景的单人图像,在更常见的多人场景(存在严重遮挡、低清晰度、外观多样)下表现不佳,亟需专门解决多人拥挤场景的重建方法。 Method: 提出CrowdGaussian统一框架:1)基于预训练大人体模型的自监督适应流程,处理严重遮挡;2)引入自校准学习(SCL),利用单步扩散模型融合身份保持样本与清/污图像对,迭代优化粗渲染结果,并将优化结果蒸馏回3DGS表示。 Result: 在多人场景下实现了几何连贯、照片级真实的3D重建,实验表明其显著优于现有方法。 Conclusion: CrowdGaussian为单图多人3D重建提供了有效新范式,通过协同优化渲染质量与3D表示,成功应对复杂真实场景的核心挑战。 Abstract: Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs. Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.

[170] Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime

Haiyu Yang,Sumit Sharma,Enhong Liu,Miel Hostens

Main category: cs.CV

TL;DR: 本研究系统比较了三种自动化行为分类方法(从头训练、冻结特征提取和参数高效微调PEFT),发现PEFT(尤其是QLoRA)在有限标注数据下显著优于其他方法,实现了高精度与低计算成本的平衡。

Details Motivation: 解决精准畜牧养殖中自动化行为分类面临的高计算成本和标注数据稀缺问题。 Method: 对比训练从头训练(ResNet-18、ViT-Small)、冻结DINOv3特征提取、以及基于DINOv3的参数高效微调(QLoRA/DoRA),在不同秩和目标模块配置下评估性能。 Result: QLoRA(all-linear layers, rank=64)以2.72%参数量、5.8小时训练时间达到83.16%测试准确率,显著优于ResNet-18(72.87%)、ViT-Small(61.91%)和冻结DINOv3(76.56%);扩大适配器容量持续提升泛化性能,表明农业图像任务中主要挑战是欠拟合而非过拟合。 Conclusion: 参数高效微调(特别是QLoRA)是将十亿级视觉基础模型部署于农业畜牧业场景的有效可行方案,并为类似小样本农业视觉任务提供了实用指南。 Abstract: Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.

[171] ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis

Romil Imtiaz,Dimitris K. Iakovidis

Main category: cs.CV

TL;DR: 本文提出了一种基于ResNet-50和解剖学引导时序解码的多标签胃肠道视频分析流程,通过改进损失函数和时序建模策略,在挑战赛测试集上将时间mAP从0.3801提升至0.4303。

Details Motivation: 解决胃肠道视频中严重类别不平衡(尤其是罕见病理标签)以及时序事件标注与帧预测之间不一致的问题。 Method: 采用ResNet-50作为帧级分类器,输入336×336图像;使用裁剪式类别加权损失缓解类别不平衡;在时序阶段融合GT风格的逐帧事件组合、解剖投票平滑、解剖驱动的病理门控及保守滞后解码器。 Result: 在挑战赛测试集上,时间mAP由0.3801提升至0.4303。 Conclusion: 解剖学先验知识与时序建模策略的结合可显著提升多标签胃肠道视频分析性能,尤其对罕见病理识别更鲁棒。 Abstract: We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.

[172] Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients

Ziwei Xiang,Fanhu Zeng,Hongjian Fang,Rui-Qi Wang,Renxing Chen,Yanan Zhu,Yi Chen,Peipei Yang,Xu-Yao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于量化感知集成梯度(QIG)的细粒度量化方法,将敏感性分析从模态级提升至词元级,显著提升了大视觉语言模型(LVLMs)在低比特量化下的准确性,同时保持极低延迟开销。

Details Motivation: 现有LVLM量化方法仅在模态层面衡量词元敏感性,无法刻画跨词元复杂交互,也难以定量评估词元级量化误差;而LVLM中词元间交互使模态边界逐渐模糊,亟需更细粒度的校准策略。 Method: 受机制可解释性中公理化归因启发,提出Quantization-aware Integrated Gradients(QIG):利用集成梯度定量评估每个词元的敏感性,实现从模态级到词元级的细粒度量化校准,兼顾跨模态与模态内动态。 Result: 在多个LVLM上W4A8和W3A16设置下实验表明,该方法在各模型与基准上均提升准确率且延迟开销可忽略;例如在3-bit权重量化下,LLaVA-onevision-7B平均准确率提升1.60%,与全精度差距缩小至仅1.33%。 Conclusion: 词元级细粒度量化是提升LVLM低比特压缩性能的有效路径;QIG通过引入集成梯度实现更精准的敏感性建模,在精度与效率间取得更好平衡。 Abstract: Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG.

[173] ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Dmitriy Rivkin,Parker Ewen,Lili Gao,Julian Ost,Stefanie Walz,Rasika Kangutkar,Mario Bijelic,Felix Heide

Main category: cs.CV

TL;DR: 本文提出ChopGrad,一种用于视频解码的截断反向传播方案,通过限制梯度计算在局部帧窗口内,同时保持全局一致性,从而显著降低内存消耗并支持高效微调。

Details Motivation: 现有视频扩散模型在像素域训练时因递归帧处理导致内存开销随视频长度线性增长,难以进行长视频或高分辨率视频的像素级损失微调。 Method: 提出ChopGrad方法,采用局部帧窗口内的截断反向传播,并提供理论分析以保证全局一致性。 Result: ChopGrad将训练内存从与帧数成线性关系降至常数级,并在视频超分辨率、视频修复、神经渲染场景增强和可控驾驶视频生成等任务中优于现有SOTA方法。 Conclusion: ChopGrad有效解决了视频扩散模型训练内存瓶颈问题,为像素级监督下的高效视频生成微调提供了可行方案。 Abstract: Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

[174] M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

Qiangqiang Wu,Tianyu Yang,Bo Fang,Jia Wan,Matias Di Martino,Guillermo Sapiro,Antoni B. Chan

Main category: cs.CV

TL;DR: 本文提出Mask-to-Point(M2P)学习方法,利用视频对象分割(VOS)掩码标注,通过三种掩码驱动的弱监督约束(局部结构一致性、掩码标签一致性、掩码边界约束)改进视觉基础模型(如DINOv2/v3),显著提升视频中任意点跟踪(TAP)性能。

Details Motivation: 现有基于静态图像预训练的视觉基础模型(VFMs)难以建模视频中密集的时间对应关系,导致在任意点跟踪任务中性能受限。 Method: 提出Mask-to-Point(M2P)学习框架,引入三种掩码驱动的弱监督约束:1)基于Procrustes分析的局部结构一致性损失;2)掩码标签一致性(MLC)损失,确保前景点跨帧匹配前景区域;3)掩码边界约束,显式监督边界点。仅用3.6K VOS视频进行高效训练。 Result: M2P在TAP-Vid-DAVIS基准上分别比DINOv2-B/14和DINOv3-B/16提升12.8%和14.6%;可作为通用预训练骨干,适配测试时优化与离线微调两类TAP任务。 Conclusion: M2P是一种有效的弱监督表征学习范式,能显著增强VFMs在密集点跟踪任务中的时序建模能力,为视频理解提供更优的预训练基础模型。 Abstract: Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.

[175] Steering Video Diffusion Transformers with Massive Activations

Xianhang Cheng,Yujian Zheng,Zhenyu Xie,Tingting Liao,Hao Li

Main category: cs.CV

TL;DR: 本文发现视频扩散Transformer中存在大规模激活(MAs)现象,其在不同时间位置token上呈现结构化幅度分布;据此提出无需训练的Structured Activation Steering(STAS)方法,通过引导关键位置token的MA值提升生成视频质量与时间一致性,且开销极小。

Details Motivation: 尽管视频扩散Transformer进展迅速,但如何以最小开销利用其内部模型信号来提升视频生成质量仍缺乏探索。 Method: 提出Structured Activation Steering(STAS),一种训练-free的自引导式方法,通过将首帧和潜在帧边界token的MA值导向缩放后的全局最大参考幅值来实现引导。 Result: STAS在多个文本到视频模型上均一致提升了视频质量和时间连贯性,并引入可忽略的计算开销。 Conclusion: 视频扩散Transformer中MA的结构化分布揭示了模型对潜在时间分块的隐式建模偏好;STAS验证了仅通过轻量级激活引导即可显著增强生成性能。 Abstract: Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.

[176] TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models

Qianlong Xiang,Miao Zhang,Haoyu Zhang,Kun Wang,Junhui Hou,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出TINA(Text-free INversion Attack),一种无需文本引导的逆向攻击方法,用于揭示当前文本到图像扩散模型概念擦除技术的缺陷:尽管文本映射被切断,但视觉知识仍残留,TINA能绕过文本中心防御成功恢复被擦除概念,表明现有方法仅是掩盖而非真正删除。

Details Motivation: 现有概念擦除技术聚焦于切断文本到图像的映射,忽视了模型内部视觉知识的持续存在;需从视觉角度验证擦除是否真正生效。 Method: 提出TINA方法:在零文本条件下进行DDIM逆向,避免触发文本中心防御,并引入优化机制以校正无文本引导导致的累积近似误差。 Result: TINA成功在经最先进遗忘训练处理的模型上重建出被擦除的概念,证明当前擦除方法仅是掩盖而非消除视觉知识。 Conclusion: 当前概念擦除范式存在根本性缺陷,亟需转向直接操作和修改模型内部视觉知识的新范式。 Abstract: Although text-to-image diffusion models exhibit remarkable generative power, concept erasure techniques are essential for their safe deployment to prevent the creation of harmful content. This has fostered a dynamic interplay between the development of erasure defenses and the adversarial probes designed to bypass them, and this co-evolution has progressively enhanced the efficacy of erasure methods. However, this adversarial co-evolution has converged on a narrow, text-centric paradigm that equates erasure with severing the text-to-image mapping, ignoring that the underlying visual knowledge related to undesired concepts still persist. To substantiate this claim, we investigate from a visual perspective, leveraging DDIM inversion to probe whether a generative pathway for the erased concept can still be found. However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model. To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses. Moreover, TINA integrates an optimization procedure to overcome the accumulating approximation errors that arise when standard inversion operates without its usual textual guidance. Our experiments demonstrate that TINA regenerates erased concepts from models treated with state-of-the-art unlearning. The success of TINA proves that current methods merely obscure concepts, highlighting an urgent need for paradigms that operate directly on internal visual knowledge.

[177] Video Understanding: From Geometry and Semantics to Unified Models

Zhaochong An,Zirui Li,Mingqiao Ye,Feng Qiao,Jiaang Li,Zongwei Wu,Vishal Thengane,Chengzu Li,Lei Li,Luc Van Gool,Guolei Sun,Serge Belongie

Main category: cs.CV

TL;DR: This survey provides a structured overview of video understanding research, categorizing it into low-level geometry, high-level semantics, and unified models, while emphasizing the shift toward adaptable, unified paradigms and identifying challenges for future video foundation models.

Details Motivation: Video understanding is more challenging than image understanding due to the need for spatiotemporal reasoning; a systematic, structured survey is needed to map progress and guide future development of unified video foundation models. Method: The paper organizes existing literature into three complementary perspectives—low-level video geometry understanding, high-level semantic understanding, and unified video understanding models—and analyzes trends toward unified, adaptable modeling paradigms. Result: A coherent taxonomy and synthesis of video understanding research, highlighting key modeling trends, design principles, and the transition from task-specific pipelines to unified frameworks. Conclusion: The survey establishes a comprehensive landscape of video understanding, underscoring the importance of unified modeling and outlining open challenges for building robust, scalable, and general-purpose video foundation models. Abstract: Video understanding aims to enable models to perceive, reason about, and interact with the dynamic visual world. In contrast to image understanding, video understanding inherently requires modeling temporal dynamics and evolving visual context, placing stronger demands on spatiotemporal reasoning and making it a foundational problem in computer vision. In this survey, we present a structured overview of video understanding by organizing the literature into three complementary perspectives: low-level video geometry understanding, high-level semantic understanding, and unified video understanding models. We further highlight a broader shift from isolated, task-specific pipelines toward unified modeling paradigms that can be adapted to diverse downstream objectives, enabling a more systematic view of recent progress. By consolidating these perspectives, this survey provides a coherent map of the evolving video understanding landscape, summarizes key modeling trends and design principles, and outlines open challenges toward building robust, scalable, and unified video foundation models.

[178] Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass

Chen Liyi,Wang Pengfei,Zhang Guowen,Ma Zhiyuan,Zhang Lei

Main category: cs.CV

TL;DR: Omni-3DEdit 是一种统一、基于学习的 3D 编辑模型,通过构建配对多视角数据集和改进 SEVA 骨干网络(引入双流 LoRA 模块),实现多种 3D 编辑任务的一次前向推理,显著提升效率(从数十分钟降至约两分钟)。

Details Motivation: 现有基于2D指导的3D编辑方法存在任务设计不通用(需针对不同任务定制几何操作规则)和优化过程耗时(需数千次迭代)两大问题。 Method: 构建高质量配对多视角编辑数据集;以预训练生成模型SEVA为骨干,将源视角潜在表示与条件token在序列空间拼接;提出双流LoRA模块解耦不同视角线索。 Result: Omni-3DEdit 可在单次前向传播中完成多种3D编辑任务,推理时间从数十分钟缩短至约两分钟,并在实验中展现出优越的有效性与效率。 Conclusion: Omni-3DEdit 提供了一种高效、通用的隐式3D编辑范式,克服了显式优化方法在泛化性和计算开销上的瓶颈。 Abstract: Most instruction-driven 3D editing methods rely on 2D models to guide the explicit and iterative optimization of 3D representations. This paradigm, however, suffers from two primary drawbacks. First, it lacks a universal design of different 3D editing tasks because the explicit manipulation of 3D geometry necessitates task-dependent rules, e.g., 3D appearance editing demands inherent source 3D geometry, while 3D removal alters source geometry. Second, the iterative optimization process is highly time-consuming, often requiring thousands of invocations of 2D/3D updating. We present Omni-3DEdit, a unified, learning-based model that generalizes various 3D editing tasks implicitly. One key challenge to achieve our goal is the scarcity of paired source-edited multi-view assets for training. To address this issue, we construct a data pipeline, synthesizing a relatively rich number of high-quality paired multi-view editing samples. Subsequently, we adapt the pre-trained generative model SEVA as our backbone by concatenating source view latents along with conditional tokens in sequence space. A dual-stream LoRA module is proposed to disentangle different view cues, largely enhancing our model's representational learning capability. As a learning-based model, our model is free of the time-consuming online optimization, and it can complete various 3D editing tasks in one forward pass, reducing the inference time from tens of minutes to approximately two minutes. Extensive experiments demonstrate the effectiveness and efficiency of Omni-3DEdit.

[179] Revisiting foundation models for cell instance segmentation

Anwai Archit,Constantin Pape

Main category: cs.CV

TL;DR: 本文全面评估了多种基于SAM的细胞分割基础模型(如CellPoseSAM、CellSAM、μSAM)及通用分割模型(SAM、SAM2、SAM3)在多种显微镜图像数据集上的性能,并提出一种自动提示生成(APG)策略,显著提升μSAM等模型的实例分割效果,达到与CellPoseSAM相当的水平。

Details Motivation: 现有细胞分割基础模型多为SAM的变体,但缺乏对最新通用模型(如SAM2、SAM3)及专用模型的系统性比较;同时,需探索更有效的显微镜图像适配策略。 Method: 在多样化的显微镜数据集(细胞、细胞核、类器官)上统一评估多个SAM系列模型;提出自动提示生成(APG)策略,用于增强SAM类模型在显微镜图像上的实例分割性能,并以μSAM为基线进行验证。 Result: APG策略在μSAM上持续提升分割性能,结果媲美当前最优模型CellPoseSAM;系统评估揭示了不同模型在各类显微镜任务中的优劣与适应规律。 Conclusion: APG是一种简单而有效的显微镜图像适配策略;本研究为SAM风格模型向生物医学成像领域的迁移提供了可复现的经验和可扩展的方法论框架。 Abstract: Cell segmentation is a fundamental task in microscopy image analysis. Several foundation models for cell segmentation have been introduced, virtually all of them are extensions of Segment Anything Model (SAM), improving it for microscopy data. Recently, SAM2 and SAM3 have been published, further improving and extending the capabilities of general-purpose segmentation foundation models. Here, we comprehensively evaluate foundation models for cell segmentation (CellPoseSAM, CellSAM, $μ$SAM) and for general-purpose segmentation (SAM, SAM2, SAM3) on a diverse set of (light) microscopy datasets, for tasks including cell, nucleus and organoid segmentation. Furthermore, we introduce a new instance segmentation strategy called automatic prompt generation (APG) that can be used to further improve SAM-based microscopy foundation models. APG consistently improves segmentation results for $μ$SAM, which is used as the base model, and is competitive with the state-of-the-art model CellPoseSAM. Moreover, our work provides important lessons for adaptation strategies of SAM-style models to microscopy and provides a strategy for creating even more powerful microscopy foundation models. Our code is publicly available at https://github.com/computational-cell-analytics/micro-sam.

[180] VISER: Visually-Informed System for Enhanced Robustness in Open-Set Iris Presentation Attack Detection

Byron Dowling,Eleanor Frederick,Jacob Piland,Adam Czajka

Main category: cs.CV

TL;DR: 本文比较了多种人类感知先验(如手绘标注、眼动热图、分割掩码和DINOv2嵌入)在开集虹膜呈现攻击检测(PAD)任务中的有效性,发现去噪后的眼动热图在AUROC和APCER指标上提升最显著。

Details Motivation: 探索在开集虹膜呈现攻击检测(PAD)中,哪种人类感知先验(如手绘标注、眼动热图等)最有效,因当前对此尚无充分研究。 Method: 在开集虹膜PAD任务中,采用留一攻击类型(leave-one-attack-type-out)范式,对比手绘标注、眼动热图、分割掩码和DINOv2嵌入四种人类感知先验与基于交叉熵的深度学习基线模型的性能。 Result: 去噪后的眼动热图在AUROC和BPCER=1%下的APCER指标上表现最优,显著优于其他先验和基线方法。 Conclusion: 去噪眼动热图是最有效的感知先验,在开集虹膜PAD中能带来最佳泛化性能提升;论文同时开源模型、代码和显著性图以支持复现与后续研究。 Abstract: Human perceptual priors have shown promise in saliency-guided deep learning training, particularly in the domain of iris presentation attack detection (PAD). Common saliency approaches include hand annotations obtained via mouse clicks and eye gaze heatmaps derived from eye tracking data. However, the most effective form of human saliency for open-set iris PAD remains underexplored. In this paper, we conduct a series of experiments comparing hand annotations, eye tracking heatmaps, segmentation masks, and DINOv2 embeddings to a state-of-the-art deep learning-based baseline on the task of open-set iris PAD. Results for open-set PAD in a leave-one-attack-type out paradigm indicate that denoised eye tracking heatmaps show the best generalization improvement over cross entropy in terms of Area Under the ROC curve (AUROC) and Attack Presentation Classification Error Rate (APCER) at Bona Fide Presentation Classification Error Rate (BPCER) of 1%. Along with this paper, we offer trained models, code, and saliency maps for reproducibility and to facilitate follow-up research efforts.

[181] Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?

Guandong Li,Zhaobin Chu

Main category: cs.CV

TL;DR: 本文提出EditSpilloverProbe框架,将图像编辑模型中常见的'编辑溢出'现象视为探测其世界知识的自然工具,通过构建分类体系、检测流程和基准数据集EditSpilloverBench,系统评估5个主流模型,发现溢出率差异显著、存在编辑控制与世界理解的权衡,并证实语义溢出反映真实世界知识而非注意力泄漏。

Details Motivation: 探究图像编辑模型中‘编辑溢出’现象的本质:是模型具备隐式世界理解能力的表现,还是仅由注意力机制缺陷导致的泄漏。 Method: 提出EditSpilloverProbe框架,包括溢出分类法(空间型、语义型、混合型、随机型)、自动化检测与分类流程,以及基于真实中文文本编辑任务构建的基准数据集EditSpilloverBench;对5个代表性编辑模型进行系统评测,分析溢出率、语义溢出量及空间衰减规律。 Result: (1)不同模型溢出率差异达3.3倍(3.49%–11.46%);(2)语义溢出量体现世界理解能力,nano_banana语义溢出最多(27.8/图),qwen_2511编辑最精准但语义溢出较少(16.3/图),揭示控制力与理解力的权衡;(3)空间上溢出密度呈指数衰减,但语义相关溢出比例稳定在40%–58%,证明其源于真实世界知识。 Conclusion: 语义编辑溢出是图像编辑模型内在世界知识的可靠指标,而非单纯注意力泄漏;该现象可被系统化用作评估和提升模型常识推理能力的新范式。 Abstract: Instruction-following image editing models are expected to modify only the specified region while keeping the rest of the image unchanged. However, in practice, we observe a pervasive phenomenon -- edit spillover: models alter semantically related but unspecified content outside the edit region. This raises a fundamental question -- does spillover reflect genuine implicit world understanding, or is it merely attention leakage? We propose EditSpilloverProbe, a systematic framework that repurposes edit spillover as a natural probe for world knowledge in image editing models. We introduce a spillover taxonomy (spatial, semantic, mixed, random), an automated detection-and-classification pipeline, and a benchmark dataset constructed from real-world Chinese text editing tasks, EditSpilloverBench. Systematic evaluation of 5 representative editing models reveals three core findings: (1) spillover rates vary dramatically across architectures, from 3.49% to 11.46%, with a 3.3x ratio; (2) absolute semantic spillover quantity reveals models' world understanding capability -- nano_banana produces the most semantic spillover (27.8 per image), while qwen_2511 has the most precise editing control but lower semantic spillover (16.3 per image), revealing a trade-off between editing control and world understanding; (3) spatial decay analysis shows spillover area density decays exponentially with distance, but the proportion of semantically relevant spillover remains constant (40%-58%), providing direct evidence that semantic spillover reflects genuine world understanding rather than spatial diffusion.

[182] Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification

Podakanti Satyajith Chary,Nagarajan Ganapathy

Main category: cs.CV

TL;DR: 本文提出了一种用于视频胶囊内镜(VCE)的多标签分类框架,通过改进BiomedCLIP模型(引入差分注意力机制)、多种不平衡学习策略(如开方频率加权采样、非对称焦点损失等)以及时间一致性处理,在Galar数据集上取得了较好的mAP性能。

Details Motivation: 解决Galar数据集中病理发现占比极低(<0.1%)导致的极端类别不平衡问题,提升VCE视频中稀有病变检测的准确性。 Method: 改进BiomedCLIP:用差分注意力机制替代标准多头自注意力;采用sqrt-frequency加权采样、非对称焦点损失、mixup正则化和逐类阈值优化;引入中值滤波和平滑、间隙合并以保证时间一致性。 Result: 在RARE-VISION测试集(161,025帧)上达到temporal mAP@0.5 = 0.2456,mAP@0.95 = 0.2353,单GPU推理耗时约8.6分钟。 Conclusion: 所提框架在极端不平衡的VCE多标签分类任务中有效提升了检测性能与时间鲁棒性,为临床辅助诊断提供了可行方案。 Abstract: This work presents a multi-label classification framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset through a combination of architectural and optimization-level strategies. Our approach modifies BiomedCLIP, a biomedical vision-language foundation model, by replacing its standard multi-head self-attention with a differential attention mechanism that computes the difference between two softmax attention maps to suppress attention noise. To counteract the skewed label distribution, where pathological findings constitute less than 0.1% of all annotated frames, a sqrt-frequency weighted sampler, asymmetric focal loss, mixup regularization, and per-class threshold optimization are employed. Temporal coherence is enforced through median-filter smoothing and gap merging prior to event-level JSON generation. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the pipeline achieves an overall temporal mAP@0.5 of 0.2456 and mAP@0.95 of 0.2353, with total inference completed in approximately 8.6 minutes on a single GPU.

[183] Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Yingjie Chen,Shilun Lin,Cai Xing,Qixin Yan,Wenjing Wang,Dingming Liu,Hao Liu,Chen Li,Jing Lyu

Main category: cs.CV

TL;DR: 本文提出了一种统一且可扩展的身份感知音视频联合生成框架,支持对多身份的面部外观和声音音色进行细粒度控制,并通过自动数据整理、灵活的身份注入机制及多阶段训练策略实现高保真、一致的个性化生成。

Details Motivation: 现有方法缺乏公开可用的、能对多身份的面部外观和声音音色进行细粒度控制的统一框架。 Method: 提出自动跨模态(音/视)身份信息提取的数据整理流程;设计适用于单/多主体场景的灵活可扩展身份注入机制;采用多阶段训练策略缓解模态差异、加速收敛并增强跨模态一致性。 Result: 实验表明该框架在身份一致性、生成质量和跨模态协同方面优于现有方法。 Conclusion: 所提框架为身份感知的音视频联合生成提供了高效、可控、可扩展的解决方案,推动了个性化内容创作的发展。 Abstract: Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our webpage: \href{https://chen-yingjie.github.io/projects/Identity-as-Presence}{Identity-as-Presence}.

[184] A Creative Agent is Worth a 64-Token Template

Ruixiao Shi,Fu Feng,Yucheng Xie,Xu Yang,Jing Wang,Xin Geng

Main category: cs.CV

TL;DR: 本文提出CAT框架,通过Creative Tokenizer将模糊提示嵌入映射为可复用的创意令牌模板,直接注入T2I模型以提升生成创意性,无需反复推理或提示增强,在建筑、家具和自然混合设计任务中显著提升效率与效果。

Details Motivation: 现有文本到图像模型受限于离散自然语言提示,难以理解模糊创意提示(如“创意黑胶唱片风格摩天楼”)背后的意图;而当前基于推理或智能体的方法虽能增强创意但计算与金钱成本高、不可复用。 Method: 提出CAT(Creative Agent Tokenization)框架,核心是Creative Tokenizer,通过创意语义解耦训练,利用部分重叠概念对间的关系学习智能体的潜在创意表征;将模糊提示嵌入输入Tokenizer生成可复用的创意令牌模板,直接拼接到原始提示嵌入中以增强T2I模型的创意生成能力。 Result: 在建筑、家具与自然混合设计任务上验证,CAT相比SOTA方法实现3.7倍加速与4.8倍计算成本下降,同时图像更受人类偏好且图文对齐更优。 Conclusion: CAT提供了一种可扩展、高效且可复用的创意增强范式,突破了传统T2I模型在模糊创意提示下的表达瓶颈,推动创意生成从依赖人工提示设计向自动化、语义化创意注入演进。 Abstract: Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as ``a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes ``creativity'' costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents' intrinsic understanding of ``creativity'' through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent's latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.

[185] SpiderCam: Low-Power Snapshot Depth from Differential Defocus

Marcos A. Ferreira,Tianao Li,John Mamish,Josiah Hester,Yaman Sangar,Qi Guo,Emma Alexander

Main category: cs.CV

TL;DR: SpiderCam是一种基于FPGA的快照式散焦深度相机,可在52 cm工作距离内实时生成480×400稀疏深度图(32.5 FPS),总功耗仅624 mW,是文献中首个亚瓦级被动式FPGA 3D相机。

Details Motivation: 为满足低功耗、实时、小型化三维感知需求,尤其针对资源受限嵌入式平台,需克服低功耗传感器噪声大、片上存储极小等挑战。 Method: 设计定制双焦距同步成像相机,并在低功耗FPGA上用SystemVerilog实现改进的深度从微分散焦(DfDD)算法;提出面向低功耗传感器的算法优化及内存局部化流式计算架构,避免存储整幅图像对。 Result: 实现480×400稀疏深度图、32.5 FPS实时处理、52 cm工作范围、总功耗624 mW;达成文献中首个亚瓦级(<1 W)被动式FPGA 3D相机的实测功耗。 Conclusion: SpiderCam验证了在极小FPGA资源与严苛功耗约束下,通过协同算法-硬件优化可实现高性能实时深度感知,为边缘端低功耗3D视觉提供了可行技术路径。 Abstract: We introduce SpiderCam, an FPGA-based snapshot depth-from-defocus camera which produces 480x400 sparse depth maps in real-time at 32.5 FPS over a working range of 52 cm while consuming 624 mW of power in total. SpiderCam comprises a custom camera that simultaneously captures two differently focused images of the same scene, processed with a SystemVerilog implementation of depth from differential defocus (DfDD) on a low-power FPGA. To achieve state-of-the-art power consumption, we present algorithmic improvements to DfDD that overcome challenges caused by low-power sensors, and design a memory-local implementation for streaming depth computation on a device that is too small to store even a single image pair. We report the first sub-Watt total power measurement for passive FPGA-based 3D cameras in the literature.

[186] Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference

Shima Yousefi,Saptarshi Debroy

Main category: cs.CV

TL;DR: 本文提出了一种基于变分自编码器(VAE)的半灰盒、噪声感知异常检测框架,用于防御边缘协同推理中由恶意数据注入引发的隐蔽误分类,在真实噪声环境下实现了高达90%的AUROC。

Details Motivation: 边缘协同推理易受恶意数据注入攻击,导致难以检测的隐蔽误分类,尤其在环境噪声存在时更为严峻。 Method: 提出一种半灰盒、噪声感知的异常检测框架,核心是利用变分自编码器(VAE)建模正常数据分布,并引入鲁棒的噪声感知特征以区分对抗扰动与环境噪声。 Result: 在多种主流目标分类DNN上验证,该方法在真实噪声条件下检测性能稳健,AUROC最高达90%;但也揭示了特征相似性和高噪声水平会限制其效果。 Conclusion: 所提VAE驱动的噪声感知检测框架能有效提升边缘协同推理中的对抗攻击检测能力与鲁棒性,但需进一步应对高噪声与类间特征混淆的挑战。 Abstract: Collaborative inference of object classification Deep neural Networks (DNNs) where resource-constrained end-devices offload partially processed data to remote edge servers to complete end-to-end processing, is becoming a key enabler of edge-AI. However, such edge-offloading is vulnerable to malicious data injections leading to stealthy misclassifications that are tricky to detect, especially in the presence of environmental noise. In this paper, we propose a semi-gray-box and noise- aware anomaly detection framework fueled by a variational autoencoder (VAE) to capture deviations caused by adversarial manipulation. The proposed framework incorporates a robust noise-aware feature that captures the characteristic behavior of environmental noise to improve detection accuracy while reducing false alarm rates. Our evaluation with popular object classification DNNs demonstrate the robustness of the proposed detection (up to 90% AUROC across DNN configurations) under realistic noisy conditions while revealing limitations caused by feature similarity and elevated noise levels.

[187] SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

Markus Gross,Sai Bharadhwaj Matha,Rui Song,Viswanathan Muthuveerappan,Conrad Christoph,Julius Huber,Daniel Cremers

Main category: cs.CV

TL;DR: 本文提出了一种基于几何驱动的2D-3D-2D范式,用于在无人机图像中高效生成RGB和热成像(RGB-T)语义分割标签,无需大量人工标注,并构建了大规模多模态数据集SegFly。

Details Motivation: 现有无人机语义分割数据集规模小、多样性不足、标注成本高,且RGB-T对齐困难,限制了多模态空中场景理解的发展。 Method: 利用高重叠航拍图像的多视角冗余性,将少量人工标注的RGB图像升维至语义3D点云,再反投影到所有视角,实现RGB与热图像的自动伪标签生成与跨模态像素级对齐;不依赖硬件同步,以3D几何为中间空间完成RGB-T配准。 Result: 在仅标注<3% RGB图像的情况下,自动生成97% RGB和100%热图像标签,标注准确率达91%和88%;RGB-T配准准确率达87%;构建了含2万+ RGB图与1.5万+ RGB-T对的大规模基准SegFly,覆盖多种环境与季节。 Conclusion: 该几何驱动的2D-3D-2D范式显著提升了多模态无人机图像标注效率与对齐精度,SegFly数据集及Firefly基线模型验证了其对传统网络与视觉基础模型的有效性,为可扩展多模态场景理解提供了新路径。 Abstract: Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at https://github.com/markus-42/SegFly.

Javier Venema,Stefano De Luca,Pablo Mesejo,Óscar Ibáñez

Main category: cs.CV

TL;DR: 本文提出了一种基于锁骨CT扫描的可解释、多阶段法律年龄估计方法,结合自动检测、梯度引导切片选择和共形预测,实现了1.55年MAE的SOTA性能,并支持不确定性量化。

Details Motivation: 法律年龄估计在法医和医学法律场景中至关重要,需准确、鲁棒、可重复且具备明确不确定性量化;尽管锁骨CT在年龄估计中效果已被证实,但此前AI研究多集中于手部X光或牙科影像,锁骨CT尚未被充分探索。 Method: 提出多阶段可解释流程:(i) 基于特征的连通组件法实现低标注需求的锁骨自动检测;(ii) 基于集成梯度(Integrated Gradients)的切片选择策略,构建多切片CNN输入;(iii) 采用共形预测生成可信区间以满足国际法医协议要求。 Result: 在1158例全尸CT公开数据集(NM Decedent Image Database)上测试,MAE达1.55 ± 0.16年,优于人类专家(~1.90年)及先前方法(MAE > 1.75年);共形预测支持按需配置置信覆盖水平;归因图显示模型聚焦于锁骨内侧骨骺解剖区域。 Conclusion: 该方法具备高精度、可解释性与不确定性感知能力,已集成至Skeleton-ID软件,作为多因素法医工作流中的决策支持工具。 Abstract: Legal age estimation plays a critical role in forensic and medico-legal contexts, where decisions must be supported by accurate, robust, and reproducible methods with explicit uncertainty quantification. While prior artificial intelligence (AI)-based approaches have primarily focused on hand radiographs or dental imaging, clavicle computed tomography (CT) scans remain underexplored despite their documented effectiveness for legal age estimation. In this work, we present an interpretable, multi-stage pipeline for legal age estimation from clavicle CT scans. The proposed framework combines (i) a feature-based connected-component method for automatic clavicle detection that requires minimal manual annotation, (ii) an Integrated Gradients-guided slice selection strategy used to construct the input data for a multi-slice convolutional neural network that estimates legal age, and (iii) conformal prediction intervals to support uncertainty-aware decisions in accordance with established international protocols. The pipeline is evaluated on 1,158 full-body post-mortem CT scans from a public forensic dataset (the New Mexico Decedent Image Database). The final model achieves state-of-the-art performance with a mean absolute error (MAE) of 1.55 $\pm$ 0.16 years on a held-out test set, outperforming both human experts (MAE of approximately 1.90 years) and previous methods (MAEs above 1.75 years in our same dataset). Furthermore, conformal prediction enables configurable coverage levels aligned with forensic requirements. Attribution maps indicate that the model focuses on anatomically relevant regions of the medial clavicular epiphysis. The proposed method, which is currently being added as part of the Skeleton-ID software (https://skeleton-id.com/skeleton-id/), is intended as a decision-support component within multi-factorial forensic workflows.

Jingchun Yang,Jinchang Zhang

Main category: cs.CV

TL;DR: 本文提出C-TRAIL多模态法律数据集及两阶段框架,将行车记录仪视频与中文交通法规责任模式对齐,实现可解释的事故责任判定。

Details Motivation: 现有研究在行车记录仪视频理解与法律推理之间存在明显断层:视频感知研究缺乏法律对接,而法律大模型又缺乏视频证据支持。 Method: 构建首个对齐视频、文本描述与责任模式及对应法条的中文多模态法律数据集C-TRAIL;设计两阶段框架:第一阶段为事故理解模块生成视频文本描述,第二阶段为法律多智能体框架输出责任模式、适用法条和完整判决报告。 Result: 在C-TRAIL和MM-AU数据集上,该方法在责任判定、法条匹配和报告生成任务中均优于通用及法律领域大模型和现有智能体方法,并具备透明可解释的法律推理过程。 Conclusion: 本工作弥合了交通视频理解与法律推理之间的鸿沟,验证了多模态法律智能在真实交通执法场景中的可行性与有效性。 Abstract: The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming "what happened in the video" into "who is responsible under which legal provisions" still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.

[190] TransText: Transparency Aware Image-to-Video Typography Animation

Fei Zhang,Zijian Zhou,Bohao Tang,Sen He,Hang Li,Zhe Wang,Soubhik Sanyal,Pengfei Liu,Viktar Atliha,Tao Xiang,Frost Xu,Semih Gunel

Main category: cs.CV

TL;DR: 本文提出TransText框架,首次实现图像到视频模型对层感知文本(字形)动画的适配,通过Alpha-as-RGB新范式避免重训练VAE,在保持预训练语义先验的同时生成高质量透明动画。

Details Motivation: 现有方法将透明度(alpha通道)作为RGB空间外的额外潜变量处理,需重构RGB中心的VAE;但高质量透明字形数据稀缺,重训练计算昂贵且易破坏原有语义先验、导致潜空间混叠。 Method: 提出TransText框架,采用Alpha-as-RGB范式,通过潜空间拼接将alpha通道编码为RGB兼容视觉信号,在不修改预训练生成流形前提下联合建模外观与透明度,并显式保证RGB与Alpha跨模态一致性、防止特征纠缠。 Result: 实验表明TransText显著优于基线方法,能生成连贯、高保真、具多样细粒度效果的透明动画。 Conclusion: TransText为图像到视频模型赋予层感知字形动画能力,解决了透明通道建模与预训练先验保留之间的关键矛盾,推动动态视觉设计实用化。 Abstract: We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.

[191] VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Mohamed Eltahir,Ali Habibullah,Yazan Alshoibi,Lama Ayash,Tanveer Hussain,Naeemullah Khan

Main category: cs.CV

TL;DR: 本文提出VideoAtlas,一种无损、可导航、可扩展的视频分层网格表示方法,并结合递归语言模型(RLM)构建Video-RLM架构,实现高效、鲁棒的长视频理解。

Details Motivation: 现有视频语言模型面临两大挑战:一是视频表征依赖有损近似;二是长视频处理中基于字幕或代理的流程丢失视觉保真度。 Method: 提出VideoAtlas——一种任务无关、分层网格化的视频表示环境,支持无损、递归缩放与统一视觉表征;将其建模为马尔可夫决策过程,设计主-从式并行Video-RLM架构,主节点统筹全局探索,工作节点并行深入局部区域。 Result: 实验表明:(1) 计算开销随视频时长对数增长,并因网格结构复用获得30–60%多模态缓存命中率;(2) 最大探索深度可作为可控的计算-精度超参;(3) 模型能自适应分配计算资源以匹配问题粒度;在1小时到10小时视频基准上保持最强时长鲁棒性。 Conclusion: 结构化环境导航是一种可行且可扩展的视频理解新范式,VideoAtlas与Video-RLM共同解决了视频表征与长上下文建模的根本瓶颈。 Abstract: Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

[192] LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Vlad-Constantin Lungu-Stan,Ionut Mironica,Mariana-Iuliana Georgescu

Main category: cs.CV

TL;DR: LaDe是一种基于潜在扩散的媒体设计分层生成框架,能根据自然语言提示生成语义明确、数量可变的RGBA图层,支持文本到图层、图像到图层及设计分解任务,并在文本到图层对齐上优于Qwen-Image-Layered。

Details Motivation: 现有方法受限于固定图层数量或要求每层为连续区域,导致图层数随设计复杂度线性增长,难以灵活生成语义丰富的可编辑分层设计。 Method: 提出LaDe框架,包含三部分:1)LLM驱动的提示扩展器,将简短用户意图转化为结构化逐层描述;2)带4D RoPE位置编码的潜在扩散Transformer,联合生成完整媒体设计及其RGBA图层;3)支持Alpha通道的RGBA VAE解码器;训练中以图层样本为条件,实现统一框架下的三种任务。 Result: 在Crello测试集上,LaDe在文本到图层和图像到图层任务中均优于Qwen-Image-Layered,尤其在文本-图层对齐方面表现更优,该结果经GPT-4o mini和Qwen3-VL两个VLM评估器验证。 Conclusion: LaDe通过引入结构化提示扩展、4D RoPE增强的扩散建模与RGBA联合解码,实现了灵活、语义可控的分层媒体设计生成,显著提升了文本引导的图层生成质量与实用性。 Abstract: Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).

[193] Robust-ComBat: Mitigating Outlier Effects in Diffusion MRI Data Harmonization

Yoan David,Pierre-Marc Jodoin,Alzheimer's Disease Neuroimaging Initiative,The TRACK-TBI Investigators

Main category: cs.CV

TL;DR: 本文提出Robust-ComBat方法,通过引入MLP进行病理 outlier 补偿,在包含大量神经疾病患者的数据中实现更鲁棒的dMRI跨站点数据校正,优于传统ComBat及其变体。

Details Motivation: 传统ComBat假设受试者数据服从高斯分布,但在临床中神经疾病患者扩散指标常为病理离群值,导致站点效应估计失真;且临床难以预先排除这些待诊断患者。 Method: 评估10种离群值剔除方法与4种ComBat变体组合效果,发现多数策略在病理存在时失效;转而采用简单多层感知机(MLP)进行离群值补偿,构建Robust-ComBat框架。 Result: 在涵盖7种神经疾病、最多80%患者的真实多中心队列上,Robust-ComBat在所有ComBat变体下均显著降低校正误差,优于传统统计基线方法。 Conclusion: MLP驱动的离群值补偿可有效提升dMRI跨站点校正的鲁棒性,尤其适用于含大量未确诊病理样本的临床现实场景。 Abstract: Harmonization methods such as ComBat and its variants are widely used to mitigate diffusion MRI (dMRI) site-specific biases. However, ComBat assumes that subject distributions exhibit a Gaussian profile. In practice, patients with neurological disorders often present diffusion metrics that deviate markedly from those of healthy controls, introducing pathological outliers that distort site-effect estimation. This problem is particularly challenging in clinical practice as most patients undergoing brain imaging have an underlying and yet undiagnosed condition, making it difficult to exclude them from harmonization cohorts, as their scans were precisely prescribed to establish a diagnosis. In this paper, we show that harmonizing data to a normative reference population with ComBat while including pathological cases induces significant distortions. Across 7 neurological conditions, we evaluated 10 outlier rejection methods with 4 ComBat variants over a wide range of scenarios, revealing that many filtering strategies fail in the presence of pathology. In contrast, a simple MLP provides robust outlier compensation enabling reliable harmonization while preserving disease-related signal. Experiments on both control and real multi-site cohorts, comprising up to 80% of subjects with neurological disorders, demonstrate that Robust-ComBat consistently outperforms conventional statistical baselines with lower harmonization error across all ComBat variants.

[194] AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

Aymen Mir,Riza Alp Guler,Xiangjun Tang,Peter Wonka,Gerard Pons-Moll

Main category: cs.CV

TL;DR: AHOY是一种从野外单目视频中重建完整、可动画3D高斯化身的方法,即使存在严重遮挡也能实现;通过扩散模型生成监督信号、两阶段架构、地图/姿态解耦及头身分离监督策略解决遮挡导致的观测缺失与多视角不一致问题。

Details Motivation: 现有方法假设输入无遮挡(即主体完全可见且常处于标准姿态),无法处理现实世界中因家具、物体或其他人造成的频繁遮挡视频,限制了其在真实场景中的应用。 Method: 提出四种技术:(i) 利用身份微调的扩散模型生成未观测区域的密集监督信号(幻觉即监督);(ii) 两阶段架构(从规范空间到姿态依赖空间)从稀疏观测引导出完整的姿态相关高斯图;(iii) 地图姿态与线性混合蒙皮(LBS)姿态解耦以吸收生成数据中的多视角不一致性;(iv) 头部与身体分离监督策略以保持面部身份。 Result: 在带显著遮挡的YouTube视频和多视角采集数据上验证,达到当前最优重建质量;生成的高斯化身可稳健驱动新姿态,并能合成到由手机视频捕获的3DGS场景中。 Conclusion: AHOY有效克服了单目视频中严重遮挡带来的重建挑战,实现了高质量、可动画、可合成的3D高斯化身重建,拓展了3D人体建模在真实场景中的适用性。 Abstract: We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/

[195] AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception

Jinho Park,Se Young Chun,Mingoo Seok

Main category: cs.CV

TL;DR: 本文提出了一种面向自动驾驶雷达数据的在线自适应压缩方案,通过基于检测置信度代理梯度的动态压缩率调整、离散余弦变换系数选择性剪枝及缩放量化,在实现超100倍特征尺寸压缩的同时,仅带来约1个百分点的性能下降。

Details Motivation: 雷达数据维度高、数据量大,易饱和低带宽通信链路;现有图像域压缩方法固定压缩比、缺乏对变化或对抗性条件的适应能力,且缺乏通用雷达数据编解码器。 Method: 提出基于自适应反馈的雷达数据压缩方法:1)利用零阶梯度近似计算检测置信度对压缩率的代理梯度,动态调整压缩比;2)对雷达数据立方体做离散余弦变换(DCT),剪枝稀疏高频系数;3)采用缩放量化保留各雷达块的动态范围。 Result: 在RADIal、CARRADA和Radatron数据集上验证,实现超100倍特征尺寸压缩,检测性能下降仅约1个百分点(~1%p)。 Conclusion: 所提在线自适应压缩方案有效缓解了雷达高维数据与低带宽链路之间的矛盾,兼顾压缩率与感知性能,适用于真实车载部署场景。 Abstract: Radar is a critical perception modality in autonomous driving systems due to its all-weather characteristics and ability to measure range and Doppler velocity. However, the sheer volume of high-dimensional raw radar data saturates the communication link to the computing engine (e.g., an NPU), which is often a low-bandwidth interface with data rate provisioned only for a few low-resolution range-Doppler frames. A generalized codec for utilizing high-dimensional radar data is notably absent, while existing image-domain approaches are unsuitable, as they typically operate at fixed compression ratios and fail to adapt to varying or adversarial conditions. In light of this, we propose radar data compression with adaptive feedback. It dynamically adjusts the compression ratio by performing gradient descent from the proxy gradient of detection confidence with respect to the compression rate. We employ a zeroth-order gradient approximation as it enables gradient computation even with non-differentiable core operations--pruning and quantization. This also avoids transmitting the gradient tensors over the band-limited link, which, if estimated, would be as large as the original radar data. In addition, we have found that radar feature maps are heavily concentrated on a few frequency components. Thus, we apply the discrete cosine transform to the radar data cubes and selectively prune out the coefficients effectively. We preserve the dynamic range of each radar patch through scaled quantization. Combining those techniques, our proposed online adaptive compression scheme achieves over 100x feature size reduction at minimal performance drop (~1%p). We validate our results on the RADIal, CARRADA, and Radatron datasets.

[196] Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

Shuyao Shi,Kang G. Shin

Main category: cs.CV

TL;DR: 本文提出Motion-MLLM框架,利用IMU获取的自运动数据增强多模态大语言模型,通过运动-视觉关键帧筛选与不对称跨模态融合,提升3D场景的空间推理能力,并在精度和成本效益上优于现有方法。

Details Motivation: 现有MLLM在3D空间推理中依赖计算昂贵的3D表示(如点云、BEV)或缺乏物理尺度支撑,难以解决尺度与尺寸歧义问题。 Method: 提出Motion-MLLM框架:(1)级联式运动-视觉关键帧筛选模块,融合IMU与视觉特征选取稀疏代表性关键帧;(2)不对称跨模态融合模块,以运动token为中介,将自运动线索与跨帧视觉上下文注入视觉表征。 Result: 在多项3D场景理解与空间推理任务上显著提升性能;相比基于视频帧和显式3D数据的SOTA方法,准确率相当或更高,成本效益分别提高1.40×和1.63×。 Conclusion: 引入物理 grounded 的egomotion模态可有效增强MLLM的空间推理能力,尤其在绝对尺度与跨场景空间关系建模方面,且兼顾高效性与实用性。 Abstract: Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM exhibits similar or even higher accuracy with significantly less overhead (i.e., 1.40$\times$ and 1.63$\times$ higher cost-effectiveness, respectively).

[197] Versatile Editing of Video Content, Actions, and Dynamics without Training

Vladimir Kulikov,Roni Paiss,Andrey Voynov,Inbar Mosseri,Tali Dekel,Tomer Michaeli

Main category: cs.CV

TL;DR: 本文提出DynaEdit,一种无需训练的视频编辑方法,利用预训练文本到视频模型实现复杂动作修改、交互对象插入和全局效果添加等多样化编辑任务。

Details Motivation: 现有视频编辑方法在处理动作和动态事件编辑、以及插入影响其他物体行为的内容时面临挑战,主要受限于训练数据获取困难或编辑能力受限。 Method: 基于无需反演(inversion-free)的方法,设计新型机制解决低频错位和高频抖动问题,实现模型无关的通用视频编辑。 Result: DynaEdit在复杂文本驱动的视频编辑任务中达到当前最优性能,支持动作修改、交互对象插入和全局效果引入。 Conclusion: DynaEdit为无需训练的视频编辑提供了新范式,显著提升了编辑灵活性与真实性,且兼容多种预训练文本到视频模型。 Abstract: Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.

[198] GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Huajian Zeng,Abhishek Saroha,Daniel Cremers,Xi Wang

Main category: cs.CV

TL;DR: 本文提出GMT,一种多模态Transformer框架,用于在3D环境中生成可控的6自由度物体操作轨迹,融合3D边界框、点云、语义类别和目标姿态信息,在空间精度和朝向控制上显著优于现有方法。

Details Motivation: 现有方法多依赖2D或部分3D表示,难以完整建模场景几何,导致轨迹精度受限;而真实机器人操作需兼顾空间推理、物理可行性与多模态场景理解。 Method: GMT是一种多模态Transformer框架,将轨迹表示为连续6-DOF位姿序列,通过定制化条件策略融合3D边界框几何、点云上下文、语义类别及目标末端位姿。 Result: 在合成与真实世界基准测试中,GMT显著超越CHOIS、GIMO等SOTA方法,在空间精度和朝向控制方面取得大幅提升,并展现出对多样物体和杂乱3D环境的良好泛化能力。 Conclusion: GMT为基于学习的操作规划设立了新基准,验证了联合利用多模态3D信息进行高精度、可控轨迹生成的有效性与实用性。 Abstract: Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.

[199] LoST: Level of Semantics Tokenization for 3D Shapes

Niladri Shekhar Dutt,Zifan Shi,Paul Guerrero,Chun-Hao Paul Huang,Duygu Ceylan,Niloy J. Mitra,Xuelin Chen

Main category: cs.CV

TL;DR: 本文提出了一种面向自回归3D生成的语义导向分层分词方法Level-of-Semantics Tokenization (LoST),通过语义显著性排序token,并引入Relational Inter-Distance Alignment (RIDA)损失对齐3D形状与语义特征空间的关系,显著提升重建质量与生成效率。

Details Motivation: 现有基于几何细节层次(LoD)的3D分词方法在自回归建模中存在token效率低、语义不连贯的问题,缺乏对语义结构的有效建模。 Method: 提出Level-of-Semantics Tokenization (LoST),按语义显著性排序token;设计Relational Inter-Distance Alignment (RIDA)损失,对齐3D形状潜在空间与DINO语义特征空间的相对距离结构。 Result: LoST在几何与语义重建指标上大幅超越现有LoD方法;实现高效高质量的自回归3D生成;支持语义检索等下游任务;仅需先前AR模型0.1%-10%的token数。 Conclusion: 语义驱动的分词策略比几何层次分词更适配自回归3D生成,LoST为3D生成提供了更紧凑、语义一致且可扩展的表示范式。 Abstract: Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

[200] The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Yigit Ekin,Yossi Gandelsman

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、在测试时即可实现连续可控图像编辑的框架,通过在文本嵌入空间中进行简单引导(steering)来控制生成,利用大语言模型自动生成对比提示对以构建引导向量,并结合弹性范围搜索实现平滑连续编辑。

Details Motivation: 现有方法依赖额外训练或人工干预,缺乏轻量、通用且连续可控的测试时编辑能力。 Method: 利用大语言模型自动生成去偏的对比提示对,在文本编码器空间中计算语义引导向量;将该向量按可调幅度加到输入文本嵌入上;引入弹性范围搜索自动确定有效缩放区间,确保编辑连续性与保真性;仅修改文本表示,天然适配图文/视频生成等模态。 Result: 在连续编辑质量评估(新提出的语义变化均匀性指标)上,本方法性能媲美训练式方法,显著优于其他无训练方法;具备跨模态泛化能力。 Conclusion: 简单的文本嵌入空间引导配合自动提示构造与弹性缩放机制,足以实现高质量、连续、可控且无需训练的图像编辑,为测试时编辑提供了高效通用的新范式。 Abstract: We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.

[201] EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

Kai Zou,Hongbo Liu,Dian Zheng,Jianxiong Gao,Zhiwei Zhao,Bin Liu

Main category: cs.CV

TL;DR: 本文提出EchoGen,一个统一的布局到图像生成与图像定位框架,通过渐进式训练策略(PMTP、DJO、Cycle RL)联合优化两个任务,实现性能协同提升。

Details Motivation: 布局到图像生成在空间关系理解上存在局限,而图像定位具备强文本和布局理解能力;二者联合训练可互补优势、增强鲁棒性与多样性,但面临优化挑战。 Method: 提出三阶段渐进式训练策略:1)并行多任务预训练(PMTP),利用共享token加速;2)双任务联合优化(DJO),顺序整合两任务;3)循环强化学习(Cycle RL),以一致性约束为奖励、GRPO策略消除视觉监督依赖。 Result: 在布局到图像生成和图像定位多个基准上达到SOTA性能,实验验证了两任务联合优化带来的显著协同增益。 Conclusion: 联合建模布局到图像生成与图像定位可行且有效,渐进式训练策略成功克服了联合优化的难点,提升了模型的统一能力与泛化性。 Abstract: In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model's unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.

[202] Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Ziyi Wang,Peiming Li,Xinshun Wang,Yang Tang,Kai-Kuang Ma,Mengyuan Liu

Main category: cs.CV

TL;DR: 本文提出SkeletonLLM,通过可微分、格式无关的渲染器DrAction将任意骨架序列转化为MLLM可处理的图像序列,并结合因果推理蒸馏与判别式微调策略,实现对非原生模态(如人体骨架)的通用理解。

Details Motivation: 现有方法难以让多模态大语言模型(MLLMs)直接有效处理结构化非视觉数据(如人体骨架),存在信息损失或泛化能力差的问题。 Method: 提出SkeletonLLM框架:1)DrAction——端到端可微分、格式无关的骨架动力学图像渲染器;2)协同训练策略:因果推理蒸馏(从教师模型迁移结构化推理能力)和判别式微调(增强易混淆动作的决策边界)。 Result: 在动作识别、描述生成、视觉-语言推理及跨骨架格式迁移等多样化任务上展现出强泛化能力。 Conclusion: SkeletonLLM为MLLM拓展至非原生模态(尤其是结构化时序数据)提供了可行且有效的通用范式。 Abstract: Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM's native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer -- suggesting a viable path for applying MLLMs to non-native modalities. Code will be released upon acceptance.

[203] Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Jianrui Zhang,Yue Yang,Rohun Tripathi,Winson Han,Ranjay Krishna,Christopher Clark,Yong Jae Lee,Sangho Lee

Main category: cs.CV

TL;DR: 本文提出了一种名为Spatio-Temporal Token Scoring(STTS)的轻量级模块,用于在视觉-语言模型(VLMs)中统一、端到端地对视觉token进行时空联合剪枝,无需文本条件或token合并,在显著提升计算效率的同时仅带来微小性能下降。

Details Motivation: 现有token剪枝方法要么仅在ViT中进行且未适配多模态下游任务,要么仅在LLM中进行且依赖复杂文本条件机制;而视频任务中存在大量时空冗余,亟需一种更通用、高效、端到端可训练的剪枝方案。 Method: 提出STTS模块:通过辅助损失学习时间维度token评分,利用LLM下游梯度学习空间维度评分,并结合高效packing算法,在ViT和LLM中同步剪枝视觉token,全程无文本条件、不合并token、支持端到端训练。 Result: 在13个长短视频问答任务上,剪枝50%视觉token后,训练与推理效率提升62%,平均性能仅下降0.7%;帧数越多,效率增益越大;测试时缩放进一步带来0.5–1%性能提升。 Conclusion: STTS是一种新颖、简单而有效的统一架构级视觉token剪枝技术,兼顾高效率与强兼容性,为视频VLMs提供了实用的轻量化路径。 Abstract: Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.