Skip to content

Table of Contents

cs.CL [Back]

[1] Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Berkin Durmus,Chen Cen,Eduardo Pacheco,Arda Okan,Atila Orhon

Main category: cs.CL

TL;DR: 本文提出Contextual Earnings-22数据集,旨在填补上下文语音转文本(speech-to-text)缺乏标准化基准的空白,并通过关键词提示与关键词增强两类方法的六个强基线实验,验证其在真实定制词汇场景下的显著性能提升。

Details Motivation: 学术基准测试中语音识别准确率趋于停滞,而工业界和高风险领域表现更好;作者认为关键差异在于上下文条件化能力,尤其是对罕见、上下文定义的定制词汇的识别能力,而现有研究缺乏相应标准化基准。 Method: 构建开源数据集Contextual Earnings-22(基于Earnings-22),引入真实定制词汇上下文;建立六种强基线,涵盖关键词提示(keyword prompting)和关键词增强(keyword boosting)两大主流上下文语音识别方法。 Result: 两类方法在扩展至大规模系统后均达到可比且显著提升的识别准确率,揭示了上下文语音识别中被低估的潜在进展。 Conclusion: Contextual Earnings-22为上下文语音识别提供了首个现实、开放、标准化的基准;实验表明,规模化上下文建模能有效提升对定制词汇的识别能力,推动该方向实用化发展。 Abstract: The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems.

[2] Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Youcef Soufiane Gheffari,Oussama Mustapha Benouddane,Samiya Silarbi

Main category: cs.CL

TL;DR: 本文提出了一种基于CNN-Transformer混合架构的阿拉伯语语音情感识别(SER)系统,在EYASE数据集上达到97.8%准确率和0.98宏观F1分数,验证了该方法在低资源语言中的有效性。

Details Motivation: 阿拉伯语语音情感识别研究稀缺,主要受限于标注数据集匮乏。 Method: 采用CNN-Transformer混合架构:CNN提取梅尔频谱图的判别性频谱特征,Transformer编码器建模语音长程时序依赖。 Result: 在EYASE(埃及阿拉伯语情感语音)语料库上实现97.8%准确率和0.98宏观F1分数。 Conclusion: 卷积特征提取与注意力机制建模的结合对阿拉伯语SER非常有效,Transformer方法在低资源语言中具有应用潜力。 Abstract: Recognizing emotions from speech using machine learning has become an active research area due to its importance in building human-centered applications. However, while many studies have been conducted in English, German, and other European and Asian languages, research in Arabic remains scarce because of the limited availability of annotated datasets. In this paper, we present an Arabic Speech Emotion Recognition (SER) system based on a hybrid CNN-Transformer architecture. The model leverages convolutional layers to extract discriminative spectral features from Mel-spectrogram inputs and Transformer encoders to capture long-range temporal dependencies in speech. Experiments were conducted on the EYASE (Egyptian Arabic speech emotion) corpus, and the proposed model achieved 97.8% accuracy and a macro F1-score of 0.98. These results demonstrate the effectiveness of combining convolutional feature extraction with attention-based modeling for Arabic SER and highlight the potential of Transformer-based approaches in low-resource languages.

[3] Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Avyav Kumar Singh,Yen-Chen Wu,Alexandru Cioba,Alberto Bernacchia,Davide Buffelli

Main category: cs.CL

TL;DR: 本文提出了一种名为字节级蒸馏(BLD)的简单有效方法,通过在字节层面进行知识蒸馏来解决跨分词器蒸馏(CTD)问题,避免了传统方法中复杂的词汇对齐策略,并在多个基准上表现出色。

Details Motivation: 现有跨分词器蒸馏(CTD)方法依赖启发式词汇对齐策略,复杂且效果受限,亟需更简洁通用的解决方案。 Method: 提出字节级蒸馏(BLD):将教师模型输出分布转换为字节级概率,为学生模型附加轻量级字节解码头,在字节层面进行知识蒸馏。 Result: BLD在1B至8B参数规模模型的多种蒸馏任务中,性能媲美甚至超越更复杂的CTD方法,验证了字节层面作为通用接口的有效性。 Conclusion: 字节层面是跨分词器知识迁移的自然共同基础,但CTD仍是一个尚未完全解决的问题,不同任务和基准上的稳定提升仍有挑战。 Abstract: Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

[4] Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá

Opeyemi Osakuade,Simon King

Main category: cs.CL

TL;DR: 本文发现离散语音单元(DSUs)在编码超音段信息(如声调)方面不如音段结构可靠,尽管自监督学习(SSL)的潜在表示本身能编码声调;作者提出一种分阶段残差聚类方法以提升声调编码能力。

Details Motivation: DSUs广泛用于语音任务(如TTS、多模态对话),但其对超音段特征(如声调、语调)的编码不可靠,亟需改进。 Method: 在声调语言(普通话和约鲁巴语)上分析SSL模型隐表示与多种量化方法(包括K-means及残差K-means)对声调编码能力的差异。 Result: SSL隐表示本身可编码声调,但常规DSU量化(含K-means)更偏向音段结构,导致声调丢失;残差K-means能更好保留声调信息。 Conclusion: 当前DSU量化策略不适用于超音段建模,需发展声调/语调感知的新表征学习方法;残差量化是一种可行方向。 Abstract: Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yorùbá show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.

[5] Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma

Xuechen Zhang,Aviv Slobodkin,Joydeep Paul,Mandar Sharma,Samet Oymak,Shravya Shetty,Gautam Prasad

Main category: cs.CL

TL;DR: 本文提出DFR-Gemma框架,使大语言模型能直接对稠密地理空间嵌入进行推理,避免文本转换带来的冗余与失真,并在多任务地理空间基准上验证了其零样本推理能力与效率优势。

Details Motivation: 现有地理空间基础模型(如PDFM)的嵌入与大语言模型(LLM)集成方式存在冗余、token低效和数值不准确等问题,亟需更直接高效的融合范式。 Method: 提出Direct Feature Reasoning-Gemma(DFR-Gemma),通过轻量级投影器将高维地理空间嵌入对齐到LLM隐空间,使其作为语义token与自然语言指令共同输入,实现嵌入的直接推理。 Result: 在自建多任务地理空间基准(含特征查询、比较、语义描述)上,DFR-Gemma展现出强零样本推理能力,准确解码潜在空间模式,且推理效率显著优于文本基线方法。 Conclusion: 将地理空间嵌入作为主数据输入而非中间文本表示,是一种更直接、高效且可扩展的多模态地理空间智能新路径。 Abstract: Representation learning for geospatial and spatio-temporal data plays a critical role in enabling general-purpose geospatial intelligence. Recent geospatial foundation models, such as the Population Dynamics Foundation Model (PDFM), encode complex population and mobility dynamics into compact embeddings. However, their integration with Large Language Models (LLMs) remains limited. Existing approaches to LLM integration treat these embeddings as retrieval indices or convert them into textual descriptions for reasoning, introducing redundancy, token inefficiency, and numerical inaccuracies. We propose Direct Feature Reasoning-Gemma (DFR-Gemma), a novel framework that enables LLMs to reason directly over dense geospatial embeddings. DFR aligns high-dimensional embeddings with the latent space of an LLM via a lightweight projector, allowing embeddings to be injected as semantic tokens alongside natural language instructions. This design eliminates the need for intermediate textual representations and enables intrinsic reasoning over spatial features. To evaluate this paradigm, we introduce a multi-task geospatial benchmark that pairs embeddings with diverse question-answer tasks, including feature querying, comparison, and semantic description. Experimental results show that DFR allows LLMs to decode latent spatial patterns and perform accurate zero-shot reasoning across tasks, while significantly improving efficiency compared to text-based baselines. Our results demonstrate that treating embeddings as primary data inputs, provides a more direct, efficient, and scalable approach to multimodal geospatial intelligence.

[6] Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

Mengdan Zhu,Senhao Cheng,Liang Zhao

Main category: cs.CL

TL;DR: 本文提出了一种名为DLR(Decompose, Look, and Reason)的强化潜空间推理框架,用于提升视觉语言模型在复杂视觉推理任务中的性能,通过动态分解问题、提取前提条件驱动的连续视觉潜表示,并生成基于视觉依据的推理步骤,显著提升了准确率与可解释性。

Details Motivation: 现有视觉语言模型在复杂视觉推理中因文本思维链(CoT)导致视觉信息丢失;工具调用成本高,而局部块嵌入又难以支撑多步语义推理。 Method: 提出DLR框架:1)将查询动态分解为文本前提;2)提取前提条件驱动的连续视觉潜变量;3)基于视觉依据进行推理;设计三阶段训练流程及球面高斯潜策略以增强潜空间探索能力。 Result: 在多个以视觉为中心的基准上,DLR持续超越强基线方法(包括纯文本CoT、交错式多模态CoT和潜推理方法),同时提供更优的逐步可解释性。 Conclusion: DLR通过融合结构化问题分解、条件化视觉潜表征与强化潜空间探索,有效缓解视觉信息损失问题,为复杂视觉推理提供了高效且可解释的新范式。 Abstract: Vision-Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch-based embeddings that are insufficient to extract semantics in multi-step reasoning. We propose \emph{"Decompose, Look, and Reason" (DLR)}, a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise-conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three-stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision-centric benchmarks show that DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.

[7] EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

Xueren Ge,Sahil Murtaza,Anthony Cortez,Homa Alemzadeh

Main category: cs.CL

TL;DR: 本文提出了一种基于电子患者护理记录(ePCR)和主题流的多智能体生成流程,构建了EMSDialog数据集,用于提升急救医疗对话中的诊断预测性能。

Details Motivation: 现有医疗对话语料库多为双人对话,缺乏适用于多角色、多阶段临床会话中动态诊断预测所需的结构与标注。 Method: 设计了一个基于ePCR、以主题流为导向的多智能体生成流程,通过迭代规划、生成与自修正,并结合基于规则的事实性与主题连贯性检查,生成高质量合成对话。 Result: 构建了包含4,414个合成多说话人EMS对话的EMSDialog数据集,涵盖43种诊断、说话人角色及轮次级主题标注;经人工与大模型评估验证其高质量与真实性;增强训练显著提升了诊断预测的准确性、及时性与稳定性。 Conclusion: EMSDialog填补了多角色临床对话诊断预测数据集的空白,所提生成方法与数据集有效推动了该任务的发展。 Abstract: Conversational diagnosis prediction requires models to track evolving evidence in streaming clinical conversations and decide when to commit to a diagnosis. Existing medical dialogue corpora are largely dyadic or lack the multi-party workflow and annotations needed for this setting. We introduce an ePCR-grounded, topic-flow-based multi-agent generation pipeline that iteratively plans, generates, and self-refines dialogues with rule-based factual and topic flow checks. The pipeline yields EMSDialog, a dataset of 4,414 synthetic multi-speaker EMS conversations based on a real-world ePCR dataset, annotated with 43 diagnoses, speaker roles, and turn-level topics. Human and LLM evaluations confirm high quality and realism of EMSDialog using both utterance- and conversation-level metrics. Results show that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction.

[8] TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization

Figen Eğin,Aytuğ Onan

Main category: cs.CL

TL;DR: 本文提出AutoMUP方法,基于多个人类摘要自动生成土耳其语教育视频的黄金标准摘要,通过嵌入聚类与共识建模生成分级摘要,并在新构建的TR-EduVSum数据集上验证其有效性。

Details Motivation: 现有教育视频摘要缺乏自动、可复现的黄金标准构建方法,尤其在土耳其语等资源较少语言中尤为突出。 Method: 提出AutoMUP方法:从多人类摘要中提取意义单元,利用嵌入进行聚类,统计建模参与者间一致性,并依据共识权重生成分级摘要;黄金摘要取最高共识配置。 Result: AutoMUP摘要与Flash 2.5、GPT-5.1等强LLM摘要具有高语义重叠;消融实验表明共识权重与聚类对摘要质量起决定性作用;方法可低成本推广至其他突厥语族语言。 Conclusion: AutoMUP为低资源语言教育视频摘要提供了可扩展、可复现的黄金标准生成框架,显著提升了摘要评估的客观性与可靠性。 Abstract: This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of "Data Structures and Algorithms" and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM (Large Language Model) summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.

[9] Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

Tunazzina Islam

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)推理的无监督聚类结果优化框架,通过一致性验证、冗余裁决和标签接地三个阶段,提升聚类的连贯性、可解释性和人类对齐度,无需标注数据。

Details Motivation: 无监督语义聚类方法常产生不连贯、冗余或缺乏依据的簇,难以在无标注数据下有效验证。 Method: 设计三阶段LLM推理框架:(i)一致性验证——判断簇摘要是否被成员文本支持;(ii)冗余裁决——基于语义重叠合并或剔除候选簇;(iii)标签接地——完全无监督地生成可解释簇标签;解耦表征学习与结构验证。 Result: 在两个不同社交平台的真实语料上显著提升簇连贯性与人工评估的标签质量,优于经典主题模型及最新表征基线;人工评估显示LLM生成标签具有高一致性;跨平台鲁棒性分析证实其稳定性。 Conclusion: LLM可作为通用语义结构验证与优化机制,提升大规模文本无监督分析的可靠性与可解释性。 Abstract: Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.

[10] CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

Mohamed Ehab,Ali Hamdi,Khaled Shaban

Main category: cs.CL

TL;DR: 本文提出了一种面向类别不平衡问题的新型集成方法CAMO(Class-Aware Minority-Optimized),通过分层机制融合投票分布、置信度校准与模型间不确定性,动态增强少数类预测,在多个不平衡基准上显著提升宏F1分数。

Details Motivation: 现实世界分类任务常受类别不平衡严重制约,传统集成方法偏向多数类,损害少数类性能和整体F1得分。 Method: 提出CAMO——一种分层集成方法,整合投票分布建模、置信度校准与模型间不确定性估计,动态提升少数类预测权重。 Result: 在DIAR-AI/Emotion和BEA 2025两个高度不平衡领域数据集上,CAMO在零样本与微调设置下,结合8种语言模型(3个大模型+5个小模型),始终取得最高严格宏F1分数,优于7种基线集成方法。 Conclusion: CAMO是一种可靠、领域无关的不平衡分类集成框架,其优势与模型适配协同作用,表明最优集成策略依赖于基础模型特性。 Abstract: Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts. We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.

[11] ADAG: Automatically Describing Attribution Graphs

Aryaman Arora,Zhengxuan Wu,Jacob Steinhardt,Sarah Schwettmann

Main category: cs.CL

TL;DR: 本文提出ADAG,一种端到端自动化的电路追踪分析流程,通过归因特征剖面、新型聚类算法和LLM解释-模拟框架,实现对语言模型内部计算路径的可解释性建模,并成功应用于已知任务及Llama 3.1中危害性越狱行为的定位。

Details Motivation: 现有电路追踪方法严重依赖人工解读特征功能,缺乏自动化与可扩展性,难以支撑大规模模型的可解释性研究。 Method: 提出归因剖面(attribution profiles)量化特征的输入/输出梯度效应;设计新型特征聚类算法;构建LLM解释-模拟系统自动生成并评分自然语言功能解释。 Result: 在已知人工分析的电路追踪任务上复现了可解释电路;成功识别出Llama 3.1 8B Instruct中导致有害建议越狱行为的可操控特征簇。 Conclusion: ADAG实现了电路追踪的全自动化与可解释性提升,为大模型内部机制分析提供了可扩展、可验证的新范式。 Abstract: In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.

[12] DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification

Ziyi Wang,Siva Rajesh Kasa,Ankith M S,Santhosh Kumar Kasa,Jiaru Zou,Sumit Negi,Ruqi Zhang,Nan Jiang,Qifan Song

Main category: cs.CL

TL;DR: 本文提出DIVERSED方法,通过动态放松验证步骤来提升推测解码的效率,同时保持生成质量。

Details Motivation: 现有推测解码的验证步骤过于严格,导致接受率低、加速效果受限。 Method: 提出基于集成的动态验证器,按任务和上下文自适应地融合草稿模型与目标模型的概率分布。 Result: 理论分析支持该方法,并在实验中显著提升推理效率,优于标准推测解码方法。 Conclusion: DIVERSED在不牺牲生成质量的前提下,有效提升了大语言模型推理速度。 Abstract: Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.

[13] Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction

Mingchen Li,Jiatan Huang,Zonghai Yao,Hong yu

Main category: cs.CL

TL;DR: 本文提出Keys to Knowledge (K2K)框架,通过将临床知识编码进模型参数空间实现内部键值记忆快速检索,避免了传统RAG的高延迟问题,并在四个医疗结果预测数据集上达到SOTA性能。

Details Motivation: 大型语言模型在医疗场景中因幻觉和缺乏细粒度医学上下文而可靠性不足;现有RAG方法依赖大规模外部知识库检索,导致高延迟,难以满足临床实时需求。 Method: 提出K2K框架:将关键临床信息编码至模型参数空间形成内部键值记忆;引入激活引导的探针构建与交叉注意力重排序机制提升检索质量。 Result: 在四个医疗结果预测基准数据集上达到最先进(state-of-the-art)性能。 Conclusion: K2K通过内部知识访问替代外部检索,在保障低延迟的同时提升了LLM在临床预测任务中的可靠性与性能,为高风险医疗AI应用提供了新范式。 Abstract: Large language models (LLMs) hold significant promise for healthcare, yet their reliability in high-stakes clinical settings is often compromised by hallucinations and a lack of granular medical context. While Retrieval Augmented Generation (RAG) can mitigate these issues, standard supervised pipelines require computationally intensive searches over massive external knowledge bases, leading to high latency that is impractical for time-sensitive care. To address this, we introduce Keys to Knowledge (K2K), a novel framework that replaces external retrieval with internal, key-based knowledge access. By encoding essential clinical information directly into the model's parameter space, K2K enables rapid retrieval from internal key-value memory without inference-time overhead. We further enhance retrieval quality through activation-guided probe construction and cross-attention reranking. Experimental results demonstrate that K2K achieves state-of-the-art performance across four benchmark healthcare outcome prediction datasets.

Ziyi Chen,Yasir Khan,Mengyuan Zhang,Cheng Peng,Mengxian Lyu,Yiyang Liu,Krishna Vaddiparti,Robert L Cook,Mattia Prosperi,Yonghui Wu

Main category: cs.CL

TL;DR: 本研究开发了一个基于大语言模型(LLM)的工具,用于从临床病历中自动识别HIV相关污名,通过人工标注1332句并对比多种模型,发现GatorTron-large表现最优(Micro F1=0.62),few-shot提示显著提升生成式模型性能。

Details Motivation: HIV相关污名是影响感染者心理健康、就医依从性和治疗效果的关键心理社会因素,但目前缺乏可直接用于临床病历中污名内容提取与分类的现成NLP工具。 Method: 基于佛罗里达大学2012–2022年PLWH临床病历,利用专家定义关键词和临床词嵌入筛选候选句子;人工标注1332句,覆盖4个污名子维度;对比GatorTron-large、BERT等编码器模型及GPT-OSS-20B、LLaMA-8B、MedGemma-27B等生成式LLM在zero-shot/few-shot设置下的性能。 Result: GatorTron-large整体性能最佳(Micro F1=0.62);few-shot设置下GPT-OSS-20B和LLaMA-8B分别达0.57和0.59;Negative Self-Image最易预测,Personalized Stigma最难;zero-shot生成式推理失败率高达32%。 Conclusion: 本研究构建了首个面向临床病历的HIV污名识别实用NLP工具,验证了领域适配编码器模型的优越性,并揭示了few-shot提示对生成式模型在该任务中的关键增益。 Abstract: Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.

[15] SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

Jie Sun,Yu Liu,Lu Han,Qiwen Deng,Xiang Shu,Yang Xiao,Xingyu Lu,Jun Zhou,Pengfei Liu,Lintao Ma,Jiancan Wu,Xiang Wang

Main category: cs.CL

TL;DR: 本文提出SepSeq框架,通过插入分隔符标记来缓解LLM在处理长数值序列时因Softmax注意力分散导致的性能下降,无需训练且即插即用,在9个主流LLM上平均提升相对准确率35.6%,并减少16.4%推理token消耗。

Details Motivation: Transformer架构的大型语言模型(LLMs)理论上支持大上下文窗口,但在处理长数值序列时性能显著下降,作者将此归因于Softmax机制引起的注意力分散,导致模型难以聚焦关键信息。 Method: 提出Separate Sequence(SepSeq)——一种无需训练、即插即用的框架,通过在输入序列中策略性插入分隔符(separator)标记;理论分析表明分隔符充当‘注意力汇点’(attention sink),使注意力局部化,同时保留全局上下文。 Result: 在9个广泛采用的LLM上进行大量实验验证,SepSeq在多个领域平均带来35.6%的相对准确率提升,并平均降低16.4%的总推理token消耗。 Conclusion: SepSeq是一种轻量、通用且高效的方法,有效缓解LLM处理长数值序列时的注意力分散问题,兼具性能提升与计算效率优化。 Abstract: While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing total inference token consumption by 16.4% on average.

[16] Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

Steven Au,Sujit Noronha

Main category: cs.CL

TL;DR: 本文提出PPT-Bench基准,用于评估大语言模型在面对四类哲学压力(知识不稳定、价值消解、权威倒置、身份瓦解)时的‘认识论攻击’脆弱性,并揭示其与传统社会压力测试不同的不一致性模式及针对性缓解策略。

Details Motivation: 现有研究对LLM奉承行为(sycophancy)的关注局限于意见分歧、奉承和偏好对齐,忽视了更广泛的认识论失败;本文旨在系统评估LLM在知识、价值和身份等根本层面被挑战时的脆弱性。 Method: 构建PPT-Bench诊断基准,基于哲学压力分类法(PPT),涵盖四类压力类型,并在三层(L0基线、L1单轮压力、L2多轮苏格拉底式追问)上测试五种模型;对比多种缓解策略(如锚定提示、人格稳定性提示、对比解码)。 Result: 四类哲学压力引发统计上可区分的不一致模式;不同缓解方法效果高度依赖压力类型和模型架构:API模型中提示工程更有效,开源模型中Leading Query Contrastive Decoding最稳健。 Conclusion: 认识论攻击揭示了标准社会压力基准未能捕捉的LLM深层缺陷;需依据压力类型和部署环境选择定制化缓解方案。 Abstract: Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.

[17] An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

Clarissa Miranda-Pena,Andrew Reeson,Cécile Paris,Josiah Poon,Jonathan K. Kummerfeld

Main category: cs.CL

TL;DR: 本文研究了静态分析工具在检测和缓解大型语言模型(LLM)生成代码时的幻觉问题(尤其是库相关幻觉)方面的潜力与局限性,发现其可检测16%-70%的错误和14%-85%的库幻觉,但存在固有上限(48.5%-77%),表明其虽廉价有效,却无法彻底解决该问题。

Details Motivation: 大型语言模型在生成涉及库调用的代码时频繁产生幻觉(如调用不存在的API),亟需低成本、可扩展的检测与缓解方法。 Method: 系统评估多种静态分析工具在多个NL-to-code基准数据集上的幻觉检测能力,并通过人工分析确定其理论检测上限。 Result: 静态分析可检测16%-70%的全部错误、14%-85%的库幻觉;人工分析得出其理论上限为48.5%-77%。 Conclusion: 静态分析是一种廉价且部分有效的幻觉缓解手段,但受语言建模与代码语义本质限制,注定无法完全解决LLM代码幻觉问题。 Abstract: Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses.One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.

[18] Sensitivity-Positional Co-Localization in GQA Transformers

Manoj Chandrashekar Rao

Main category: cs.CL

TL;DR: 本文研究了分组查询注意力(GQA)Transformer中任务敏感层与位置编码(RoPE)适应关键层是否重合的问题,发现二者呈现强反向定位关系(late layers for task correctness, early layers for RoPE influence),但将两种适配方法(LSLoRA和GARFA)统一施加于任务敏感层反而取得最佳性能提升。

Details Motivation: 探究GQA模型中任务正确性最敏感的网络层是否与位置编码(RoPE)适应效果最强的层一致(即‘共定位假设’),以指导更高效、低开销的模型适配策略。 Method: 提出LSLoRA(基于正确性微分隐状态指标筛选任务敏感层并限制LoRA适配范围)和GARFA(在目标层为每个KV头引入8个可学习RoPE频率缩放因子);在Llama 3.1 8B(32层,4:1 Q:KV头比)上进行层敏感性分析与交叉消融实验。 Result: 发现任务敏感层集中于后段(层23–31),RoPE影响关键层集中于前段(层0–9),Spearman相关系数rs = −0.735(p = 1.66×10⁻⁶),证实强反定位;但将LSLoRA与GARFA共同作用于任务敏感层时,在六项基准上全面领先(+4–16个百分点),HumanEval+达67.1%,接近Claude 3.5 Haiku(68.3%),总计算成本仅100美元。 Conclusion: 共定位假设不成立,任务敏感性与RoPE影响力在GQA模型中呈显著反向分布;然而,统一将结构适配(参数微调)与位置编码适配聚焦于任务敏感层,能实现协同增益,为高效轻量适配提供新范式。 Abstract: We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.

[19] TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

Atahan Dokme,Benjamin Reichman,Larry Heck

Main category: cs.CL

TL;DR: 本文研究情感表达对大语言模型数值推理能力的影响,发现情感化表述即使保持数字内容不变,也会导致准确率下降2-10个百分点;而将其还原为中性表述可恢复大部分性能,表明问题源于情感风格而非内容失真。

Details Motivation: 现实世界中的查询常带有情绪(如沮丧、紧迫或兴奋),而现有大语言模型在训练和评估时多使用干净、中性语言的定量推理任务,因此需探究纯情感 framing 是否会损害模型推理能力。 Method: 构建了一个受控的情感翻译框架,将数学问题重写为情感变体但严格保留所有数量和逻辑关系,并据此创建了包含5400个语义验证的情感-中性配对的数据集Temper-5400,覆盖GSM8K、MultiArith和ARC-Challenge,在18个不同规模模型上进行评估。 Result: 情感化表述使模型准确率下降2–10个百分点;中性化处理可恢复大部分性能;非情感化改写无此影响;该方法还可推广为通用的受控风格迁移与鲁棒性评估框架。 Conclusion: 情感风格本身即可干扰大语言模型的定量推理能力,这种干扰是可逆的,提示可通过推理时轻量级中性化策略提升鲁棒性;同时提出了一种可扩展的风格控制评估范式。 Abstract: Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion--neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.

[20] GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning

Kaiyuan Tian,Yu Tang,Gongqingjian Jiang,Baihui Liu,Yifu Gao,Xialin Su,Linbo Qiao,Dongsheng Li

Main category: cs.CL

TL;DR: 本文提出GRASS框架,通过基于梯度的自适应层重要性采样和层优化器状态卸载机制,在降低内存消耗的同时提升微调性能。

Details Motivation: 现有低秩适配方法限制模型表达能力,层-wise微调方法忽略任务和训练阶段对层重要性的动态影响,导致下游任务性能欠佳。 Method: GRASS使用均值梯度范数作为任务和训练阶段感知的层重要性度量,并通过自适应训练策略动态调整层采样概率;同时引入层-wise优化器状态卸载机制,重叠计算与通信以进一步节省内存。 Result: 在多个模型和基准测试上,GRASS平均准确率提升达4.38点,内存使用减少最多19.97%。 Conclusion: GRASS在保持训练吞吐量的同时,显著提升了参数高效微调的性能与内存效率,优于当前最优方法。 Abstract: Full-parameter fine-tuning of large language models is constrained by substantial GPU memory requirements. Low-rank adaptation methods mitigate this challenge by updating only a subset of parameters. However, these approaches often limit model expressiveness and yield lower performance than full-parameter fine-tuning. Layer-wise fine-tuning methods have emerged as an alternative, enabling memory-efficient training through static layer importance sampling strategies. However, these methods overlook variations in layer importance across tasks and training stages, resulting in suboptimal performance on downstream tasks. To address these limitations, we propose GRASS, a gradient-based adaptive layer-wise importance sampling framework. GRASS utilizes mean gradient norms as a task-aware and training-stage-aware metric for estimating layer importance. Furthermore, GRASS adaptively adjusts layer sampling probabilities through an adaptive training strategy. We also introduce a layer-wise optimizer state offloading mechanism that overlaps computation and communication to further reduce memory usage while maintaining comparable training throughput. Extensive experiments across multiple models and benchmarks demonstrate that GRASS consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of up to 4.38 points and reducing memory usage by up to 19.97\%.

[21] AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

Yuxuan Hu,Jianchao Tan,Jiaqi Zhang,Wen Zan,Pingwei Sun,Yifan Lu,Yerui Sun,Yuchen Xie,Xunliang Cai,Jing Zhang

Main category: cs.CL

TL;DR: 本文提出AsyncTLS,一种分层稀疏注意力机制,通过粗粒度块过滤与细粒度令牌选择结合,并利用时间局部性实现KV缓存异步卸载,显著提升长上下文推理效率而不损失精度。

Details Motivation: 长上下文推理面临注意力计算复杂度高(二次方)和KV缓存内存开销大的双重挑战;现有稀疏注意力方法在精度与效率之间难以兼顾。 Method: 提出AsyncTLS:1)分层稀疏注意力——先进行块级粗筛,再做令牌级精选;2)异步卸载引擎——重叠KV缓存传输与计算,利用时间局部性优化。 Result: 在Qwen3和GLM-4.7-Flash模型(GQA/MLA架构)上验证,支持48k–96k上下文,精度接近全注意力,算子加速1.2x–10.0x,端到端吞吐提升1.3x–4.7x。 Conclusion: AsyncTLS有效平衡了长上下文推理中的精度与效率,为大模型实际部署提供了可行的系统级优化方案。 Abstract: Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.

[22] Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model

Kunfeng Chen,Luyao Zhuang,Fei Liao,Juhua Liu,Jian Wang,Bo Du

Main category: cs.CL

TL;DR: 本文提出了一种名为Tool Retrieval Bridge (TRB) 的方法,通过引入桥接模型将模糊指令重写为更具体的指令,从而提升工具检索在真实模糊指令场景下的性能,并构建了新基准VGToolBench进行验证。

Details Motivation: 现有工具检索方法依赖于包含详细API信息的学术基准,而现实中的用户指令往往模糊不清,导致实际应用中性能下降。 Method: 构建新基准VGToolBench模拟模糊指令;提出TRB方法,利用桥接模型对模糊指令进行重写,以缩小其与检索器偏好之间的差距。 Result: TRB在多种检索设置下均显著提升性能,例如BM25的NDCG平均分从9.73提升至19.59(相对提升111.51%)。 Conclusion: TRB是一种简单而有效的方法,能显著缓解模糊指令带来的歧义问题,并在各类基线检索器上实现一致且显著的性能提升。 Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.

[23] Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Harsh Kohli,Srinivasan Parthasarathy,Huan Sun,Yuekun Yao

Main category: cs.CL

TL;DR: 本文研究了隐式推理能力,提出循环深度Transformer模型以增强模型在系统性泛化和深度外推方面的表现,并揭示了其三阶段'顿悟'机制及'过度思考'限制。

Details Motivation: 现有Transformer大语言模型虽存储大量知识和规则,但在隐式多跳推理中缺乏对参数化知识的组合泛化能力。 Method: 提出循环深度Transformer,通过在同一层Transformer上迭代计算来增强隐式推理;设计系统性泛化与深度外推两类挑战任务,并开展从头训练的受控实验与机制分析。 Result: 循环深度Transformer显著提升系统性泛化与深度外推能力;系统性泛化经由三阶段'顿悟'过程涌现;深度外推可通过扩大推理时循环次数实现,但存在'过度思考'现象限制极深推理。 Conclusion: 循环深度架构是提升隐式推理组合泛化能力的有效路径,但需平衡推理深度以避免性能下降。 Abstract: We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.

[24] Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers

Michelle Damin Kim,Ellie S. Paek,Yufen Lin,Emily Mroz,Jane Chung,Jinho D. Choi

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的方法,利用GPT-4o、GPT-5-nano和GPT-5构建高质量Reddit数据集,结合专家设计的孤独评估框架与成因分类体系,分析照护者与非照护者群体在孤独感表现及成因分布上的显著差异。

Details Motivation: 为测量和比较照护者与非照护者群体在社交媒体上的孤独感,需构建高质量、多样化的数据集,并建立可解释、可验证的评估与归因框架。 Method: 构建专家开发的孤独评估框架与成因类型学;设计并人工验证数据处理流程;调用GPT-4o、GPT-5-nano和GPT-5在Reddit上提取与标注数据;进行孤独程度评估与成因分类,并开展人口统计特征提取与跨群体对比分析。 Result: 孤独评估框架在照护者与非照护者群体中准确率分别为76.09%和79.78%;成因分类框架微平均F1分数分别为0.825和0.80;发现两群体在孤独成因分布上存在显著差异,照护者孤独主要源于照护角色、身份认同缺失与被遗弃感;Reddit数据展现出构建多样化照护者孤独数据集的可行性。 Conclusion: 本研究建立了首个基于LLM的社交数据构建与孤独分析管道,证实其在高质量数据生成与群体差异揭示方面的有效性,为数字心理健康研究提供了新范式。 Abstract: This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers' loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.

[25] MemReader: From Passive to Active Extraction for Long-Term Agent Memory

Jingyi Kang,Chunyu Li,Ding Chen,Bo Tang,Feiyu Xiong,Zhiyu Li

Main category: cs.CL

TL;DR: 本文提出MemReader系列模型,通过主动式长期记忆提取方法解决现有系统在噪声对话、跨轮依赖等场景下的记忆污染和低效写入问题;其中MemReader-0.6B为轻量级被动提取器,MemReader-4B采用GRPO优化,支持价值评估、延迟写入、历史检索等决策行为,在多项基准测试中达到SOTA,并已集成至MemOS投入实际应用。

Details Motivation: 现有记忆提取方法为单次被动转录,难以应对噪声对话、缺失指代和跨轮依赖,导致记忆污染、低价值写入与不一致问题。 Method: 提出MemReader家族:MemReader-0.6B为蒸馏得到的紧凑型被动提取器,确保结构化输出准确性和模式一致性;MemReader-4B为基于Group Relative Policy Optimization(GRPO)优化的主动提取器,在ReAct范式下显式评估信息价值、指代模糊性与完整性,并可选择写入、延迟、检索或丢弃。 Result: 在LOCOMO、LongMemEval和HaluMem基准上全面超越现有基于提取的基线方法;MemReader-4B在知识更新、时序推理与幻觉抑制任务中达到SOTA;已集成进MemOS并部署于真实场景。 Conclusion: 有效代理长期记忆的关键不在于提取更多信息,而在于推理驱动、有选择性的记忆提取,以构建低噪声、动态演化的记忆系统。 Abstract: Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.

[26] Contextualising (Im)plausible Events Triggers Figurative Language

Annerose Eichel,Tonmoy Rakshit,Sabine Schulte im Walde

Main category: cs.CL

TL;DR: 本文探讨了英语主谓宾事件中(非)字面性与合理性之间的关系,通过设计系统化的合理与不合理事件三元组及抽象/具体成分类别,对比分析人类与大语言模型(LLM)在判断合理性时的差异:人类能精细识别并结合语境区分(非)字面性与不合理性,而LLM则表现出浅层语境化能力,并倾向于将不合理事件解释为非字面但合理的含义。

Details Motivation: 探究(非)字面性与事件合理性之间的关系,揭示人类与大语言模型在理解事件合理性上的认知差异。 Method: 构建包含合理/不合理事件三元组及抽象/具体成分类别的系统化实验设置,收集并分析人类判断、LLM生成判断及示例语境。 Result: 人类能精细区分(非)字面性与不合理性并进行深度语境化;LLM仅呈现浅层语境化,且存在将不合理事件误判为非字面但合理解释的偏差。 Conclusion: 当前LLM在事件合理性判断上缺乏人类水平的语义与语境理解能力,尤其难以区分字面性与合理性,提示其语义建模存在根本局限。 Abstract: This work explores the connection between (non-)literalness and plausibility at the example of subject-verb-object events in English. We design a systematic setup of plausible and implausible event triples in combination with abstract and concrete constituent categories. Our analysis of human and LLM-generated judgments and example contexts reveals substantial differences between assessments of plausibility. While humans excel at nuanced detection and contextualization of (non-)literal vs. implausible events, LLM results reveal only shallow contextualization patterns with a bias to trade implausibility for non-literal, plausible interpretations.

[27] Linear Representations of Hierarchical Concepts in Language Models

Masaki Sakata,Benjamin Heinzerling,Takumi Ito,Sho Yokoi,Kentaro Inui

Main category: cs.CL

TL;DR: 本文研究了语言模型如何编码概念层级关系,发现层级信息以高度可解释的线性方式表征,并集中在领域特定的低维子空间中,且不同领域的子空间间具有高度相似性。

Details Motivation: 探究语言模型内部表征中层级关系(如地理包含关系)的编码方式和程度,弥补以往工作在多词实体和跨层表征分析上的不足。 Method: 基于线性关系概念,为每个层级深度和语义域训练特定的线性变换,通过比较这些变换刻画层级关系相关的表征差异,并评估域内泛化与跨域迁移能力。 Result: 层级关系可在域内从模型表征中线性恢复;层级信息集中在相对低维、领域特定的子空间中;不同领域的此类子空间高度相似。 Conclusion: 所研究的所有模型均以高度可解释的线性形式编码概念层级。 Abstract: We investigate how and to what extent hierarchical relations (e.g., Japan $\subset$ Eastern Asia $\subset$ Asia) are encoded in the internal representations of language models. Building on Linear Relational Concepts, we train linear transformations specific to each hierarchical depth and semantic domain, and characterize representational differences associated with hierarchical relations by comparing these transformations. Going beyond prior work on the representational geometry of hierarchies in LMs, our analysis covers multi-token entities and cross-layer representations. Across multiple domains we learn such transformations and evaluate in-domain generalization to unseen data and cross-domain transfer. Experiments show that, within a domain, hierarchical relations can be linearly recovered from model representations. We then analyze how hierarchical information is encoded in representation space. We find that it is encoded in a relatively low-dimensional subspace and that this subspace tends to be domain-specific. Our main result is that hierarchy representation is highly similar across these domain-specific subspaces. Overall, we find that all models considered in our experiments encode concept hierarchies in the form of highly interpretable linear representations.

[28] Data Selection for Multi-turn Dialogue Instruction Tuning

Bo Li,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 本文提出MDS(多轮对话选择)框架,从数据选择角度提升指令微调语言模型的训练数据质量,通过全局覆盖与局部结构两个阶段对整段对话进行评分与筛选。

Details Motivation: 现有指令微调语言模型依赖的大规模多轮对话数据集存在噪声大、结构不一致、话题漂移、重复闲聊和答案格式不匹配等问题。 Method: 提出MDS框架:1)全局覆盖阶段——在用户查询轨迹空间中按bin进行对话级代表性与非冗余性选择;2)局部结构阶段——基于实体锚定的话题连贯性、信息进展度及问答格式一致性评估对话内可靠性与功能对齐。 Result: MDS在三个多轮基准数据集及领域内Banking测试集上均优于强单轮选择器、对话级LLM打分器及启发式基线,在无参考与有参考指标下均获最佳综合排名,且在同等训练预算下对长对话更鲁棒。 Conclusion: 对话级数据选择需兼顾全局代表性与局部结构性,MDS为高质量多轮对话数据构建提供了有效、可扩展的解决方案。 Abstract: Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines on three multi-turn benchmarks and an in-domain Banking test set, achieving the best overall rank across reference-free and reference-based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.

[29] TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

Xinliang Frederick Zhang,Lu Wang

Main category: cs.CL

TL;DR: 本文提出TSUBASA方法,通过动态记忆演化与上下文蒸馏驱动的自学习,提升个性化大语言模型在长周期任务中的记忆读写能力,在Qwen-3系列模型上显著超越Mem0等基线方法,并实现质量与效率的帕累托改进。

Details Motivation: 现有个性化大语言模型在长周期任务(如长期对话/行为追踪)中表现不足:传统记忆机制难以捕捉行为演化,RAG受限于质量-效率权衡,参数化适配则受制于标注数据稀缺导致的训推差距。 Method: 提出TSUBASA双路径框架:1)动态记忆演化——优化记忆写入;2)基于上下文蒸馏目标的自学习机制——增强记忆读取,使模型内化用户经验。 Result: 在多个长周期基准上,基于Qwen-3(4B至32B)的TSUBASA显著优于Mem0、Memory-R1等以记忆写入为主的记忆增强系统,并在降低token消耗的同时提升个性化保真度。 Conclusion: TSUBASA突破了质量与效率的权衡瓶颈,实现了帕累托改进,为PLLMs的长周期建模提供了可扩展、高保真的新范式。 Abstract: Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences. However, they still struggle with long-horizon tasks, such as tracking a user's extensive history of conversations or activities. Existing memory mechanisms often fail to capture evolving behaviors, and RAG paradigms are trapped by a quality-efficiency tradeoff. Meanwhile, parametric adaptation is bottlenecked by train-inference gap due to the scarcity of labeled data. To enhance the long-horizon capabilities of PLLMs, we introduce TSUBASA, a two-pronged approach designed to improve memory writing via dynamic memory evolution, and memory reading via self-learning with a context distillation objective to internalize user experiences. Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and Memory-R1. Our analyses further confirms that TSUBASA breaks the quality-efficiency barrier to achieve Pareto improvements, delivering robust, high-fidelity personalization with a reduced token budget.

[30] HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy

Guoqi Ma,Liang Zhang,Hongyao Tu,Hao Fu,Hui Li,Yujie Lin,Longyue Wang,Weihua Luo,Jinsong Su

Main category: cs.CL

TL;DR: 本文探索了大语言模型(LLM)在跨文档关系抽取(Cross-document RE)任务中的应用,发现其直接使用效果受限于大量预定义关系带来的分类难度;为此提出分层分类模型HCRE,结合层次化关系树与预测-验证推理策略,显著提升性能。

Details Motivation: 现有基于小语言模型(SLM)的方法受限于语言理解能力;而初步实验发现LLM在跨文档RE中并未稳定超越SLM,主要因预定义关系类别过多导致分类困难,亟需新方法提升LLM适配性。 Method: 提出HCRE模型:1)构建基于预定义关系集的层次化关系树;2)利用LLM进行逐级关系预测;3)采用预测-然后验证(prediction-then-verification)策略,在每层通过多视角验证缓解误差传播。 Result: HCRE在多个跨文档RE基准上显著优于现有SLM和LLM基线方法,验证了分层分类与验证机制的有效性。 Conclusion: LLM在跨文档RE中潜力未被充分释放,关键在于缓解细粒度多类关系带来的推理负担;HCRE通过结构化关系表示与鲁棒推理策略,为LLM适配复杂关系抽取任务提供了新范式。 Abstract: Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.

[31] Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

Shiwan Zhao,Zhihu Wang,Xuyang Zhao,Jiaming Zhou,Caiyue Xu,Chenfei Liu,Liting Zhang,Yuhang Jia,Yanzhe Zhang,Hualong Yu,Zichen Xu,Qicheng Li,Yong Qin

Main category: cs.CL

TL;DR: 本文提出了一种理解大语言模型(LLM)后训练的新框架,将后训练视为对模型行为的结构化干预,按轨迹来源分为离策略与在策略学习,并从有效支持扩展、策略重塑和行为整合三方面统一解释各类方法。

Details Motivation: 现有后训练方法(如SFT、偏好优化、RL等)常被按目标函数或标签碎片化讨论,缺乏对其所解决的行为瓶颈的系统性理解。 Method: 提出以轨迹来源(off-policy vs. on-policy)为第一维度,以行为干预角色(支持扩展、策略重塑、行为整合)为第二维度的分析框架,对主流后训练范式进行统一解读。 Result: 实现了对SFT、偏好学习、RL、蒸馏及多阶段流水线等方法的跨范式统一解释,并揭示其在行为瓶颈诊断与阶段协同设计中的指导价值。 Conclusion: LLM后训练的进步正日益依赖于多阶段、多角色协同的系统化设计,而非单一目标函数的优化。 Abstract: Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.

[32] Rethinking Data Mixing from the Perspective of Large Language Models

Yuanjian Xu,Tianze Sun,Changwei Xu,XinLong Zhao,Jianing Hao,Ran Chen,Yang Liu,Ruijie Xu,Stephen Chen,Guang Zhang

Main category: cs.CL

TL;DR: 本文提出DoGraph框架,通过建立梯度动力学与领域分布的理论联系,将数据调度建模为图约束优化问题,以解决LLM训练中数据混合策略的关键问题。

Details Motivation: 现有数据混合策略缺乏对‘领域’定义、人机领域感知一致性及领域加权影响泛化机制的深入理解,导致经验性方法效果不稳定。 Method: 建立梯度动态与领域分布间的理论关联,提出DoGraph重加权框架,将数据调度建模为图约束优化问题。 Result: 在不同规模GPT-2模型上大量实验表明,DoGraph持续取得具有竞争力的性能。 Conclusion: 领域在LLM训练动态中具有可形式化的关键作用;DoGraph为数据混合提供了理论支撑与实用解决方案。 Abstract: Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.

[33] AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification

Hongyi Cen,Mingxin Wang,Yule Liu,Jingyi Zheng,Hanze Jia,Tan Tang,Yingcai Wu

Main category: cs.CL

TL;DR: 本文提出AtomEval框架,通过SROM原子分解和原子有效性评分(AVS)来评估对抗性声明重写的真实性一致性,克服了传统指标仅依赖表面相似性的缺陷;在FEVER数据集上的实验表明其评估信号更可靠,并揭示更强LLM未必生成更有效的对抗性声明。

Details Motivation: 标准评估指标无法捕捉真值条件一致性,常将语义被破坏的对抗性改写误判为成功,导致事实核查系统评估不可靠。 Method: 提出AtomEval框架,将声明分解为SROM(主语-关系-宾语-修饰语)原子,并设计原子有效性评分(AVS)来量化对抗性改写中事实性破坏程度。 Result: 在FEVER数据集上,AtomEval对多种攻击策略和LLM生成器的评估比传统指标更可靠;发现更强LLM生成的对抗性声明在有效性评估下并不更优。 Conclusion: AtomEval能更准确识别事实性破坏,暴露当前对抗评估实践的局限性,强调需采用真实性感知的评估标准。 Abstract: Adversarial claim rewriting is widely used to test fact-checking systems, but standard metrics fail to capture truth-conditional consistency and often label semantically corrupted rewrites as successful. We introduce AtomEval, a validity-aware evaluation framework that decomposes claims into subject-relation-object-modifier (SROM) atoms and scores adversarial rewrites with Atomic Validity Scoring (AVS), enabling detection of factual corruption beyond surface similarity. Experiments on the FEVER dataset across representative attack strategies and LLM generators show that AtomEval provides more reliable evaluation signals in our experiments. Using AtomEval, we further analyze LLM-based adversarial generators and observe that stronger models do not necessarily produce more effective adversarial claims under validity-aware evaluation, highlighting previously overlooked limitations in current adversarial evaluation practices.

[34] Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

George Fountzoulas

Main category: cs.CL

TL;DR: Kathleen是一种直接在原始UTF-8字节上运行的文本分类架构,采用频域处理,无需分词器和注意力机制,仅含733K参数,并在多个基准数据集上超越更大规模的token化模型。

Details Motivation: 避免传统NLP模型对分词器、注意力机制和大规模参数的依赖,探索更高效、轻量且可扩展的字节级文本建模方法。 Method: 提出三个新组件:RecurrentOscillatorBanks(带时序记忆的阻尼正弦卷积)、FFT-Rotate Wavetable Encoder(单向量映射256字节值)和PhaseHarmonics(仅6参数的正弦非线性),全部基于频域处理。 Result: Kathleen-Clean在IMDB(88.6%)、AG News(92.3%)和SST-2(83.3%)上取得优异性能,优于参数多16倍的token化模型;PhaseHarmonics以极小参数带来最大精度提升(+2.6%)。 Conclusion: 频域建模可在极小参数量下实现高性能文本分类,挑战了当前主流认知架构的必要性,为高效、可扩展的字节级NLP提供了新范式。 Abstract: We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing -- requiring no tokenizer, no attention mechanism, and only 733K parameters. Kathleen introduces three novel components: (1) RecurrentOscillatorBanks -- damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats), replacing conventional embedding tables (65K parameters) while improving accuracy; (3) PhaseHarmonics -- a sinusoidal non-linearity with just 6 learnable phase parameters that our ablation identifies as the single most impactful component (+2.6% accuracy, <0.001% of model parameters). Through comprehensive ablation of a 1.8M-parameter predecessor, we show that frequency-domain components systematically outperform complex cognitive architectures: removing a 560K-parameter bio-inspired framework costs only -0.2%, while removing the 6-parameter PhaseHarmonics costs -2.6%. The resulting Kathleen-Clean achieves 88.6% on IMDB, 92.3% on AG News, and 83.3% on SST-2 -- outperforming a tokenized counterpart with 16x more parameters on IMDB (+1.6%) and AG News (+2.1%). Kathleen processes sequences in O(L) time and memory, enabling byte-level operation at sequence lengths where O(L^2) Transformers exhaust GPU memory.

[35] A Decomposition Perspective to Long-context Reasoning for LLMs

Yanling Xiao,Huaibing Xie,Guoliang Zhao,Shihan Dou,Shaolei Wang,Yiting Liu,Nantao Zheng,Cheng Zhang,Pluto Zhou,Zhisong Zhang,Lemao Liu

Main category: cs.CL

TL;DR: 本文提出一种将长上下文推理任务分解为基本原子技能的方法,并通过强化学习在合成的伪数据集上提升这些技能,从而增强大语言模型的长上下文推理能力。

Details Motivation: 当前长上下文推理研究多采用整体视角,忽视了其内部复杂性;本文旨在揭示并系统提升支撑长上下文推理的基本能力单元。 Method: 将长上下文推理解构为若干原子技能,自动合成针对性的伪数据集,并利用强化学习在这些数据集上分别优化对应技能。 Result: 在Loogle、Loong、LongBench-v2、BrowscompLong、Ruler-qa2和MRCR等多个基准上平均提升7.7%(从46.3%提升至54.0%)。 Conclusion: 提升底层原子技能可有效增强模型整体长上下文推理能力,验证了细粒度技能分解与定向训练的有效性。 Abstract: Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model's atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7\% (improving from 46.3\% to 54.0\%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.

[36] Rag Performance Prediction for Question Answering

Or Dado,David Carmel. Oren Kurland

Main category: cs.CL

TL;DR: 本文研究了预测RAG(检索增强生成)在问答任务中相对于不使用RAG所带来的性能增益问题,比较了多种预检索、后检索及后生成预测器,并提出了一种能显式建模问题、检索段落与生成答案之间语义关系的新型监督预测器,其预测效果最佳。

Details Motivation: 预测RAG在问答任务中是否带来性能提升,从而指导是否启用RAG以节省计算资源或提高效率。 Method: 评估若干为ad hoc检索设计的预检索和后检索预测器,并引入并测试多种后生成预测器,其中一种为本研究提出的新型监督预测器,显式建模问题、检索段落与生成答案之间的语义关系。 Result: 所提出的新型监督预测器在预测RAG增益方面表现最优,显著优于其他预检索和后检索预测器。 Conclusion: 显式建模问题、检索内容与生成答案三者间语义关系的监督预测方法,是预测RAG增益最有效的方式;后生成预测,尤其是基于语义关系建模的方法,具有更高潜力。 Abstract: We address the task of predicting the gain of using RAG (retrieval augmented generation) for question answering with respect to not using it. We study the performance of a few pre-retrieval and post-retrieval predictors originally devised for ad hoc retrieval. We also study a few post-generation predictors, one of which is novel to this study and posts the best prediction quality. Our results show that the most effective prediction approach is a novel supervised predictor that explicitly models the semantic relationships among the question, retrieved passages, and the generated answer.

[37] Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation

Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Yuxi Zhang,Huimin Wang,Yutian Zhao,Yefeng Zheng,Binyang Li,Kam-Fai Wong,Xian Wu

Main category: cs.CL

TL;DR: 本文提出GuarantRAG框架,通过显式解耦推理与证据整合,利用Inner-Answer(基于参数知识)和Refer-Answer(通过对比DPO目标强制依赖外部证据),再经联合解码融合二者优势,显著提升RAG的事实准确性和减少幻觉。

Details Motivation: 现有RAG方法虽能检索到相关文档,但大模型常因内部参数化知识与外部证据冲突而无法有效利用检索结果,即存在‘整合瓶颈’;隐式解决该冲突效果不佳。 Method: 提出GuarantRAG框架:1)生成仅依赖参数知识的Inner-Answer以建模推理流;2)设计对比DPO目标训练Refer-Answer,将Inner-Answer作为负样本、检索文档作为正样本,抑制幻觉;3)引入token级动态联合解码机制,融合Inner-Answer的逻辑连贯性与Refer-Answer的事实精确性。 Result: 在五个QA基准上,相比标准及动态RAG基线,准确率最高提升12.1%,幻觉降低16.3%。 Conclusion: 显式解耦并协同优化推理与证据整合是突破RAG整合瓶颈的有效路径,GuarantRAG为提升RAG忠实性与可靠性提供了新范式。 Abstract: Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs) by providing access to external knowledge. However, current research primarily focuses on retrieval quality, often overlooking the critical ''integration bottleneck'': even when relevant documents are retrieved, LLMs frequently fail to utilize them effectively due to conflicts with their internal parametric knowledge. In this paper, we argue that implicitly resolving this conflict in a single generation pass is suboptimal. We introduce GuarantRAG, a framework that explicitly decouples reasoning from evidence integration. First, we generate an ''Inner-Answer'' based solely on parametric knowledge to capture the model's reasoning flow. Second, to guarantee faithful evidence extraction, we generate a ''Refer-Answer'' using a novel Contrastive DPO objective. This objective treats the parametric Inner-Answer as a negative constraint and the retrieved documents as positive ground truth, forcing the model to suppress internal hallucinations in favor of external evidence during this phase. Finally, rather than naive concatenation or using the DPO trained model directly, we propose a joint decoding mechanism that dynamically fuses the logical coherence of the Inner-Answer with the factual precision of the Refer-Answer at the token level. Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.

[38] Efficient Provably Secure Linguistic Steganography via Range Coding

Ruiyi Yan,Yugo Murawaki

Main category: cs.CL

TL;DR: 本文提出了一种基于范围编码和旋转机制的高效、可证明安全的语言学隐写方法,在保证安全性的同时显著提升了嵌入容量和速度。

Details Motivation: 实现语言模型隐写的可证明安全性,同时解决以往方法在完美不可感知性(零KL散度)下嵌入容量受限的问题。 Method: 直接采用经典熵编码方法(范围编码),并引入旋转机制,构建一种高效且可证明安全的语言学隐写方法。 Result: 在多个语言模型上实验表明,该方法嵌入效率接近100%,嵌入速度最高达1554.66 bits/s(GPT-2),优于现有基线方法。 Conclusion: 所提方法在保持可证明安全性与高不可感知性的同时,显著提升了嵌入容量和效率,为语言学隐写提供了新思路。 Abstract: Linguistic steganography involves embedding secret messages within seemingly innocuous texts to enable covert communication. Provable security, which is a long-standing goal and key motivation, has been extended to language-model-based steganography. Previous provably secure approaches have achieved perfect imperceptibility, measured by zero Kullback-Leibler (KL) divergence, but at the expense of embedding capacity. In this paper, we attempt to directly use a classic entropy coding method (range coding) to achieve secure steganography, and then propose an efficient and provably secure linguistic steganographic method with a rotation mechanism. Experiments across various language models show that our method achieves around 100% entropy utilization (embedding efficiency) for embedding capacity, outperforming the existing baseline methods. Moreover, it achieves high embedding speeds (up to 1554.66 bits/s on GPT-2). The code is available at github.com/ryehr/RRC_steganography.

[39] Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

Main category: cs.CL

TL;DR: 本文提出双池令牌预算路由机制,通过将同质化vLLM集群划分为短上下文高吞吐池和长上下文高容量池,依据请求预估令牌预算动态调度,显著提升GPU利用率、降低OOM与抢占率,并在真实负载下实现大幅成本节约。

Details Motivation: 生产环境中vLLM服务因按最坏上下文长度配置实例,导致KV缓存严重过分配、并发度利用不足;大量短请求被长上下文配置服务,造成吞吐浪费(4–8×)及OOM、抢占、请求拒绝等可靠性问题,根源在于配置与流量不匹配。 Method: 提出双池令牌预算路由:1)将集群划分为短上下文高吞吐池和长上下文高容量池;2)基于每类请求的在线学习字节-令牌比(EMA估计)估算总令牌预算,实现无tokenizer的轻量级路由;3)构建解析模型预测成本节省。 Result: 在Azure与LMSYS真实轨迹上,服务Llama-3-70B时GPU小时减少31–42%(年省286万美元),抢占率下降5.4×,P99首token延迟改善6%;Qwen3-235B-A22B在MI300X上预估年省1540万美元;调度开销仅O(1),兼容PagedAttention等现有优化。 Conclusion: 双池令牌预算路由是一种低开销、自适应、可组合的调度机制,有效缓解配置-流量失配问题,在保持系统兼容性的同时显著提升vLLM生产部署的资源效率与可靠性。 Abstract: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

[40] Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection

Khalid Zaman,Melike Sah,Anuwat Chaiwongyenc,Cem Direkoglu

Main category: cs.CL

TL;DR: 本文提出量子视觉(QV)理论,将量子物理中的波粒二象性思想引入深度学习音频分类,特别是深度伪造语音检测;通过QV块将语音特征(如STFT、梅尔谱图、MFCC)转化为信息波,再输入CNN或ViT模型,显著提升检测性能。

Details Motivation: 受量子物理中粒子-波二象性启发,探索数据除可观测‘坍缩态’外还可表示为‘信息波’,并验证该思想在语音频谱图上的有效性,以提升深度伪造语音检测性能。 Method: 设计QV块将语音信号的STFT、Mel谱图和MFCC等时频特征转换为信息波表征,并构建QV-CNN与QV-ViT模型,在ASVSpoof数据集上进行训练与评估。 Result: QV-CNN与QV-ViT均优于对应基线模型;QV-CNN+MFCC达94.20%准确率(EER=9.04%),QV-CNN+Mel谱图达最高准确率94.57%。 Conclusion: QV理论是一种有效且具前景的量子启发式方法,适用于音频深度伪造检测,并为音频感知任务中的新型学习范式开辟方向。 Abstract: We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.

[41] Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

Ian W. Kennedy,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 本文提出了一种输出感知的EM初始化方法OA-EM,用于改善加性量化(AQ)在2-bit极端压缩下的性能瓶颈,尤其解决因贪心顺序初始化导致的优化困局;通过引入表征比ρ=N/KM刻画权重分组与码本容量关系,并利用Hessian加权马氏距离实现更优初始化,在多个模型和压缩设置下显著提升PV-tuning后的质量与效率。

Details Motivation: 加性量化在2-bit精度下常发生灾难性失效,即使采用大量搜索和微调也难以缓解,作者发现其主要瓶颈在于码本初始化策略不当。 Method: 提出OA-EM(Output-Aware Expectation-Maximization)初始化方法,基于Hessian加权马氏距离进行输出感知的码本初始化,并引入表征比ρ=N/KM分析初始化对优化几何的影响。 Result: OA-EM在Llama 3.2 3B、Llama 3.1 8B和Qwen 2.5 3B三个模型上,不同压缩率和搜索预算下均优于基线,显著提升PV-tuning后性能,尤其在2 bpp时可避免困惑度数量级恶化,并主导质量-计算前沿。 Conclusion: 码本初始化是加性量化中决定性的优化因素,其影响随表征比ρ增大而加剧;优化空间的几何性质在高度压缩模型中起主导作用,良好初始化可大幅超越后续搜索与微调的效果。 Abstract: Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

[42] LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

Tian Huang,Tom Bourgeade,Irina Illina

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LLM)在法语低资源场景下自动生成和评估OSCE医患对话的新方法,通过可控合成与银标标注 pipeline,实现了媲美GPT-4o的评估准确率(约90%),支持本地化、隐私保护的医学教育评估系统。

Details Motivation: 法国OSCE培训受限于人力与后勤,学生缺乏反复练习与结构化反馈机会;同时,真实法语OSCE标注语料极度稀缺,阻碍可复现研究与可靠基准测试。 Method: 构建一个受控pipeline:基于场景特定评估标准生成理想与扰动版医患对话(模拟不同学生水平),并采用LLM辅助框架对合成对话进行可调严格度的自动银标标注;随后在合成数据上对多个开源与闭源LLM进行基准评测。 Result: 中等规模LLM(≤32B参数)在合成数据上的评估准确率可达约90%,与GPT-4o相当,验证了本地部署、隐私安全的OSCE自动评估可行性。 Conclusion: LLM可在低资源法语OSCE场景中有效承担对话生成与自动评估双重角色,为医学教育提供可扩展、可定制、符合隐私要求的替代性评估方案。 Abstract: Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.

[43] Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs

Soveatin Kuntur,Maciej Krzywda,Anna Wróblewska,Marcin Paprzycki,Maria Ganzha,Szymon Łukasik,Amir H. Gandomi

Main category: cs.CL

TL;DR: 本文在可控条件下对比了图神经网络(GNN)与传统机器学习方法在多语言虚假信息检测任务上的性能,发现轻量级GNN(如GraphSAGE、ChebNet等)在F1分数上显著优于Logistic回归、SVM和MLP,且推理时间相当或更低,表明经典GNN仍具高效性与实用性。

Details Motivation: 现有虚假信息检测模型(如大语言模型、混合架构)计算开销大、部署受限,亟需评估更轻量、实用的替代方案。 Method: 在七个多语言公开数据集上,统一使用TF-IDF特征,公平比较轻量级GNN(GCN、GraphSAGE、GAT、ChebNet)与非图模型(逻辑回归、SVM、MLP),以F1分数和推理时间为评估指标。 Result: GNN在所有数据集上均显著优于非图基线:GraphSAGE在Kaggle、WELFake、COVID-19数据集F1分别达96.8%、91.9%、90.5%,远超MLP;ChebNet在FakeNewsNet上达79.1%,亦优于MLP的66.4%;且推理效率相当或更优。 Conclusion: 经典轻量级GNN在虚假信息检测中兼具高性能与高效率,挑战了当前盲目追求复杂模型的倾向,凸显其实际部署价值。 Abstract: The rapid spread of online misinformation has led to increasingly complex detection models, including large language models and hybrid architectures. However, their computational cost and deployment limitations raise concerns about practical applicability. In this work, we benchmark graph neural networks (GNNs) against non-graph-based machine learning methods under controlled and comparable conditions. We evaluate lightweight GNN architectures (GCN, GraphSAGE, GAT, ChebNet) against Logistic Regression, Support Vector Machines, and Multilayer Perceptrons across seven public datasets in English, Indonesian, and Polish. All models use identical TF-IDF features to isolate the impact of relational structure. Performance is measured using F1 score, with inference time reported to assess efficiency. GNNs consistently outperform non-graph baselines across all datasets. For example, GraphSAGE achieves 96.8% F1 on Kaggle and 91.9% on WELFake, compared to 73.2% and 66.8% for MLP, respectively. On COVID-19, GraphSAGE reaches 90.5% F1 vs. 74.9%, while ChebNet attains 79.1% vs. 66.4% on FakeNewsNet. These gains are achieved with comparable or lower inference times. Overall, the results show that classic GNNs remain effective and efficient, challenging the need for increasingly complex architectures in misinformation detection.

[44] Clickbait detection: quick inference with maximum impact

Soveatin Kuntur,Panggih Kusuma Ningrum,Anna Wróblewska,Maria Ganzha,Marcin Paprzycki

Main category: cs.CL

TL;DR: 本文提出了一种轻量级混合方法用于点击诱饵检测,结合OpenAI语义嵌入与六个紧凑的启发式特征,并通过PCA降维和多种分类器(XGBoost、GraphSAGE、GCN)评估,图模型在推理速度显著提升的同时保持了良好的判别能力。

Details Motivation: 提高点击诱饵检测的效率与实用性,尤其在资源受限场景下平衡性能与计算开销。 Method: 融合OpenAI语义嵌入与六种风格/信息启发式特征;使用PCA降维嵌入;采用XGBoost、GraphSAGE和GCN进行分类。 Result: 图神经网络模型(GraphSAGE、GCN)在显著降低推理时间的同时达到有竞争力的F1分数和高ROC-AUC值。 Conclusion: 轻量级混合方案特别是图模型,可在保证检测可靠性的同时大幅提升运行效率,适用于实时或边缘部署场景。 Abstract: We propose a lightweight hybrid approach to clickbait detection that combines OpenAI semantic embeddings with six compact heuristic features capturing stylistic and informational cues. To improve efficiency, embeddings are reduced using PCA and evaluated with XGBoost, GraphSAGE, and GCN classifiers. While the simplified feature design yields slightly lower F1-scores, graph-based models achieve competitive performance with substantially reduced inference time. High ROC--AUC values further indicate strong discrimination capability, supporting reliable detection of clickbait headlines under varying decision thresholds.

[45] Training Data Size Sensitivity in Unsupervised Rhyme Recognition

Petr Plecháč,Artjoms Šeļa,Silvie Cinková,Mirella De Sisto,Lara Nugues,Neža Kočnik,Antonina Martynenko,Ben Nagy,Luca Giovannini,Robert Kolár

Main category: cs.CL

TL;DR: 本文研究了无监督韵律识别工具RhymeTagger在七种语言中的性能,探讨训练数据量与语言差异对准确率的影响,并与人工标注一致性及大语言模型进行对比。

Details Motivation: 韵律判断具有历史建构性、主观性和跨语言复杂性,导致自动韵律识别困难,尤其在多语境下。 Method: 使用语言无关的RhymeTagger工具,在七种语言诗歌语料上开展无监督韵律识别实验;评估不同训练规模下的性能;通过人工标注子集计算标注者间一致性,并分析语音相似性与词距对分歧的影响;同时以单样本学习方式对比三个大语言模型。 Result: RhymeTagger在获得足够训练数据后,性能稳定超过人工标注一致性;而缺乏显式语音表征的大语言模型在此任务上表现显著较差。 Conclusion: 基于重复模式的无监督方法(如RhymeTagger)在多语言韵律识别中更可靠,凸显语音建模对韵律任务的关键作用,而当前大语言模型因缺乏语音能力存在明显局限。 Abstract: Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic similarity between rhyming words and their distance from each other in a poem. We also compare RhymeTagger to three large language models using a one-shot learning strategy. Our findings show that, once provided with sufficient training data, RhymeTagger consistently outperforms human agreement, while LLMs lacking phonetic representation significantly struggle with the task.

[46] Self-Debias: Self-correcting for Debiasing Large Language Models

Xuan Feng,Shuai Zhao,Luwei Xiao,Tianlong Gu,Bo An

Main category: cs.CL

TL;DR: 本文提出Self-Debias框架,通过将去偏视为资源重分配问题,在推理轨迹层面施加动态约束,实现LLM内在的、可自我修正的去偏,仅需2万样本即高效提升公平性且不损推理能力。

Details Motivation: 现有去偏方法多依赖静态约束或外部干预,难以在链式推理(CoT)中识别并中断持续发生的‘偏见传播’问题。 Method: 提出Self-Debias框架:1)将去偏建模为输出概率质量的策略性重分配;2)设计轨迹级细粒度目标函数与动态去偏约束,选择性修正偏见后缀、保留有效前缀;3)引入基于一致性过滤的在线自我改进机制,自主合成监督信号。 Result: 仅用20k标注样本即可激活高效自校正能力,在多个基准上实现更优去偏效果,同时保持通用推理性能,无需持续外部监督。 Conclusion: Self-Debias赋予LLM内在的、渐进式自我修正能力,为解决CoT中的偏见传播提供了新范式,兼顾有效性、效率与鲁棒性。 Abstract: Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.

[47] HyperMem: Hypergraph Memory for Long-Term Conversations

Juwei Yue,Chuanrui Hu,Jiawei Sheng,Zuyi Zhou,Wenyuan Zhang,Tingwen Liu,Li Guo,Yafeng Deng

Main category: cs.CL

TL;DR: 本文提出了一种基于超图的分层记忆架构HyperMem,用于解决对话代理长期记忆中高阶关联建模不足的问题,通过主题-情节-事实三级结构和超边建模,结合混合索引与粗到细检索策略,在LoCoMo基准上达到92.73%的SOTA准确率。

Details Motivation: 现有方法(如RAG、图记忆)依赖成对关系,难以捕捉多个元素间的高阶联合依赖,导致记忆检索碎片化。 Method: 提出HyperMem:基于超图的分层记忆架构,将记忆划分为主题、情节、事实三层,并用超边聚合相关情节与事实;设计混合词法-语义索引与粗到细检索策略。 Result: 在LoCoMo基准上取得92.73%的LLM-as-a-judge准确率,性能优于现有方法。 Conclusion: 超图建模高阶关联能有效提升长期对话中记忆的连贯性与检索准确性,HyperMem为长时程对话代理提供了更鲁棒的记忆机制。 Abstract: Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues. However, existing approaches as Retrieval-Augmented Generation (RAG) and graph-based memory mostly rely on pairwise relations, which can hardly capture high-order associations, i.e., joint dependencies among multiple elements, causing fragmented retrieval. To this end, we propose HyperMem, a hypergraph-based hierarchical memory architecture that explicitly models such associations using hyperedges. Particularly, HyperMem structures memory into three levels: topics, episodes, and facts, and groups related episodes and their facts via hyperedges, unifying scattered content into coherent units. Leveraging this structure, we design a hybrid lexical-semantic index and a coarse-to-fine retrieval strategy, supporting accurate and efficient retrieval of high-order associations. Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.

[48] Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

Jun Seo,Sangwon Ryu,Heejin Do,Hyounghun Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 本文提出了一种行为感知的题目建模框架BAIM,通过融合基于Polya四阶段解题过程的动态程序性信息来增强题目表征,并自适应地为不同学习者路由各阶段表征,从而提升知识追踪性能。

Details Motivation: 现有知识追踪方法虽利用知识点对齐的题目表征取得进展,但忽略了问题解决过程中的程序性动态特征。 Method: BAIM框架使用推理型语言模型将每道题目的解法分解为理解、计划、执行和回顾四个阶段(基于Polya理论),从各阶段嵌入轨迹中提取阶段级表征,并引入上下文条件机制,在KT主干网络中自适应路由这些阶段表征以适配学习者异质性。 Result: 在XES3G5M和NIPS34数据集上的实验表明,BAIM持续优于强预训练基线,尤其在学习者多次交互场景下提升显著。 Conclusion: 融合解题过程的动态程序性信息并进行学习者自适应建模,可有效提升知识追踪的预测精度与个性化能力。 Abstract: Knowledge Tracing (KT) aims to predict learners' future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item's solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya's framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.

[49] Floating or Suggesting Ideas? A Large-Scale Contrastive Analysis of Metaphorical and Literal Verb-Object Constructions

Prisca Piccirilli,Alexander Fraser,Sabine Schulte im Walde

Main category: cs.CL

TL;DR: 本文通过分析297个英语动词-宾语对在约200万语料句子中的使用,结合2293个认知与语言特征,探究隐喻与字面用法在跨对和对内层面的差异,发现二者无统一分布模式,差异主要依赖于具体构式。

Details Motivation: 现有研究多从认知或心理语言学角度探讨隐喻,但缺乏大规模、近义表达下隐喻与字面语言的系统比较。 Method: 基于约200万句子语料,对297个动词-宾语对进行跨对(cross-pair)与对内(within-pair)分析,利用5种NLP工具提取2293个涵盖情感、词汇、句法及语篇层面的认知与语言特征。 Result: 跨对分析显示:字面语境具有更高词频、内聚性与结构规整性;隐喻语境则表现出更强情感负荷、意象性、词汇多样性与构式特异性;对内分析揭示多数动词-宾语对内部存在显著异质性,效应不统一。 Conclusion: 隐喻与字面用法之间不存在单一、普适的分布差异模式,其区别高度依赖于具体动词-宾语构式;大规模数据与多维特征结合可实现对VO隐喻使用的精细刻画。 Abstract: Metaphor pervades everyday language, allowing speakers to express abstract concepts via concrete domains. While prior work has studied metaphors cognitively and psycholinguistically, large-scale comparisons with literal language remain limited, especially for near-synonymous expressions. We analyze 297 English verb-object pairs (e.g., float idea vs. suggest idea) in ~2M corpus sentences, examining their contextual usage. Using five NLP tools, we extract 2,293 cognitive and linguistic features capturing affective, lexical, syntactic, and discourse-level properties. We address: (i) whether features differ between metaphorical and literal contexts (cross-pair analysis), and (ii) whether individual VO pairs diverge internally (within-pair analysis). Cross-pair results show literal contexts have higher lexical frequency, cohesion, and structural regularity, while metaphorical contexts show greater affective load, imageability, lexical diversity, and constructional specificity. Within-pair analyses reveal substantial heterogeneity, with most pairs showing non-uniform effects. These results suggest no single, consistent distributional pattern that distinguishes metaphorical from literal usage. Instead, differences are largely construction-specific. Overall, large-scale data combined with diverse features provides a fine-grained understanding of metaphor-literal contrasts in VO usage.

[50] When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

Ruotao Xu,Yixin Ji,Yu Luo,Jinpeng Li,Dong Li,Peifeng Li,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出自适应工具信任校准(ATTC)框架,解决工具集成推理(TIR)模型中因盲目信任自身推理而忽略正确工具结果(即“工具被忽略”问题)的缺陷,通过代码块置信度动态调整对工具结果的信任程度,显著提升多模型多数据集性能(+4.1%~7.5%)。

Details Motivation: 现有工具集成推理(TIR)模型在推理与工具结果冲突时倾向于信任自身推理,导致忽略正确工具输出(“Tool Ignored”),缺乏对工具结果的可信度判断机制。 Method: 提出自适应工具信任校准(ATTC)框架,依据生成代码块的置信度分数,动态决定是否信任或忽略工具执行结果,从而校准模型对工具的信任程度。 Result: 在多个开源TIR模型和多个数据集上的实验表明,ATTC有效缓解‘Tool Ignored’问题,性能提升4.1%至7.5%。 Conclusion: ATTC为TIR模型提供了可泛化的信任校准机制,提升了模型在需精确计算与知识调用任务中的鲁棒性与准确性。 Abstract: Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored''. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.

[51] Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language Models

Yating Wang,Wenting Zhao,Yaqi Zhao,Yongshun Gong,Yilong Yin,Haoliang Sun

Main category: cs.CL

TL;DR: 本文研究了大语言模型中规则级知识的编辑问题,发现规则知识在不同transformer层中按形式(公式、描述、实例)分布,因此提出分布式多层编辑方法(DMLE),显著提升了规则级编辑性能。

Details Motivation: 现有模型编辑方法主要针对事实级知识,假设可通过局部干预实现编辑,但该假设不适用于规则级知识——规则需在多种相互依赖的形式(如公式、自然语言描述、具体实例)间保持一致。 Method: 通过细粒度因果追踪分析规则知识在transformer各层中的分布特性,并据此提出分布式多层编辑(DMLE)方法:对公式和描述在早期层施加共享更新,对实例在中间层施加独立更新。 Result: DMLE在GPT-J-6B、Qwen2.5-7B、Qwen2-7B和LLaMA-3-8B上平均提升实例可迁移性13.91个百分点、规则理解能力50.19个百分点,优于最强基线;同时在标准编辑指标上保持竞争力。 Conclusion: 规则级知识具有形式特异性和跨层分布特性,单一或连续层干预不足以保证一致性;DMLE通过分层协同更新有效解决了这一问题,为规则级知识编辑提供了新范式。 Abstract: Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact-level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule-level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule-level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine-grained causal tracing reveals a form-specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single-layer or contiguous-block intervention. Based on this insight, we propose Distributed Multi-Layer Editing (DMLE), which applies a shared early-layer update to formulas and descriptions and a separate middle-layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule-level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT-J-6B, Qwen2.5-7B, Qwen2-7B, and LLaMA-3-8B. The code is available at https://github.com/Pepper66/DMLE.

[52] SeLaR: Selective Latent Reasoning in Large Language Models

Renyu Fu,Guibo Luo

Main category: cs.CL

TL;DR: 本文提出SeLaR(Selective Latent Reasoning),一种无需训练的轻量级框架,通过熵门控机制选择性启用软嵌入(仅在低置信度推理步),并引入熵感知对比正则化防止软嵌入坍缩,从而提升大模型链式推理的稳定性与探索能力。

Details Motivation: 现有隐式推理方法受限于全局激活扰动高置信步骤、以及软嵌入易坍缩至最高概率token导致探索不足的问题。 Method: 提出熵门控机制(仅在低置信度步骤启用软嵌入)和熵感知对比正则化(推动软嵌入远离主导token方向)。 Result: 在五个推理基准上,SeLaR持续优于标准CoT及现有最优无训练方法。 Conclusion: 选择性地结合离散与连续表示,并辅以针对性正则化,可有效提升推理稳定性与路径多样性,无需额外训练。 Abstract: Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token's direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.

[53] Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen,Ruoxi Xu,Boxi Cao,Ruotong Pan,Yunfei Zhang,Yifei Hu,Yong Du,Tingting Gao,Yaojie Lu,Yingfei Sun,Xianpei Han,Le Sun,Xiangyu Wu,Hongyu Lin

Main category: cs.CL

TL;DR: 本文提出了首个完全基于真实世界数据构建的用户模拟基准OmniBehavior,揭示了现有LLM在模拟复杂、长周期、跨场景人类行为时存在结构性偏差(如乌托邦偏差、人格同质化),难以捕捉个体差异与长尾行为。

Details Motivation: 现有用户模拟基准局限于孤立场景、狭窄动作空间或合成数据,无法反映真实人类行为的整体性与复杂性。 Method: 构建全真实数据驱动的OmniBehavior基准,整合长周期、跨场景、异构行为模式;通过实证分析和大规模LLM评估,对比模拟行为与真实行为的结构性差异。 Result: 发现当前LLM在长程因果推理上存在‘隧道视野’;性能随上下文扩展而饱和;存在正向平均人收敛、过度活跃、人格同质化和乌托邦偏差等结构性问题。 Conclusion: 现有LLM用户模拟缺乏高保真度,需重点解决个体差异建模与长尾行为表征问题,为未来研究指明方向。 Abstract: The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

[54] A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection

Wenxian Wang,Xiaohu Luo,Junfeng Hao,Xiaoming Gu,Xingshu Chen,Zhu Wang,Haizhou Wang

Main category: cs.CL

TL;DR: 本文提出了一种结合生成对抗网络(GAN)与大语言模型(LLM)的数据增强框架,用于建模用户语言模式以提升中文讽刺检测性能,并构建了包含用户历史行为的新数据集SinaSarc;所提方法在F1指标上超越现有最优方法。

Details Motivation: 现有中文讽刺检测方法受限于数据集规模小、构建成本高,且多忽略用户特异性语言模式对讽刺表达的影响。 Method: 提出GAN与LLM驱动的数据增强框架:基于新浪微博多主题原始数据训练GAN,并利用GPT-3.5生成带目标评论、上下文及用户历史行为的讽刺评论数据集SinaSarc;进一步扩展BERT架构,融合用户历史行为等多维信息以捕捉动态语言模式。 Result: 在讽刺与非讽刺类别上F1-score分别达0.9138和0.9151,均优于所有现有SOTA方法。 Conclusion: 该研究为中文讽刺检测提供了可动态建模用户长期语言模式的新框架,在数据集构建与方法论两方面均有重要贡献。 Abstract: Sarcasm is a rhetorical device that expresses criticism or emphasizes characteristics of certain individuals or situations through exaggeration, irony, or comparison. Existing methods for Chinese sarcasm detection are constrained by limited datasets and high construction costs, and they mainly focus on textual features, overlooking user-specific linguistic patterns that shape how opinions and emotions are expressed. This paper proposes a Generative Adversarial Network (GAN) and Large Language Model (LLM)-driven data augmentation framework to dynamically model users' linguistic patterns for enhanced Chinese sarcasm detection. First, we collect raw data from various topics on Sina Weibo. Then, we train a GAN on these data and apply a GPT-3.5 based data augmentation technique to synthesize an extended sarcastic comment dataset, named SinaSarc. This dataset contains target comments, contextual information, and user historical behavior. Finally, we extend the BERT architecture to incorporate multi-dimensional information, particularly user historical behavior, enabling the model to capture dynamic linguistic patterns and uncover implicit sarcastic cues in comments. Experimental results demonstrate the effectiveness of our proposed method. Specifically, our model achieves the highest F1-scores on both the non-sarcastic and sarcastic categories, with values of 0.9138 and 0.9151 respectively, which outperforms all existing state-of-the-art (SOTA) approaches. This study presents a novel framework for dynamically modeling users' long-term linguistic patterns in Chinese sarcasm detection, contributing to both dataset construction and methodological advancement in this field.

[55] Synthetic Data for any Differentiable Target

Tristan Thrush,Sung Min Park,Herman Brunborg,Luke Bailey,Marcel Roed,Neil Band,Christopher Potts,Tatsunori Hashimoto

Main category: cs.CL

TL;DR: 本文提出了一种名为Dataset Policy Gradient(DPG)的强化学习方法,用于精确优化合成数据生成器,以生成能引导目标语言模型在特定可微指标上表现优异的训练数据。该方法通过高阶梯度实现精确的数据归因,并用其作为策略梯度奖励;理论证明其逼近真实但不可行的数据生成器梯度。实验表明DPG能仅通过合成数据监督微调,操控模型权重嵌入QR码、数字模式、降低范数,甚至实现跨语言改写和生成特定UUID,揭示了合成数据对模型控制的强大潜力。

Details Motivation: 探索通过合成训练数据控制语言模型的极限,解决如何精准引导模型行为而不依赖人工标注或强化学习直接优化模型的问题。 Method: 提出Dataset Policy Gradient(DPG)这一RL基本单元:利用高阶梯度进行精确数据归因,将下游可微指标对合成数据的梯度作为奖励信号,反向更新合成数据生成器;理论上证明DPG逼近真实但不可行的数据生成器梯度。 Result: 仅用SFT+DPG生成的数据,成功使目标模型LM头权重嵌入QR码、字符模式'67'、降低ℓ²范数;并使生成器学会跨语言重述输入、生成指定UUID(二者均未在输入提示中明示)。 Conclusion: DPG是一种强大且灵活的技术,仅通过合成训练样本即可有效塑造语言模型的内部结构与行为,揭示了合成数据驱动模型控制的新范式与潜力边界。 Abstract: What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

[56] AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Lilian Wanzare,Cynthia Amol,zekiel Maina,Nelson Odhiambo,Hope Kerubo,Leila Misula,Vivian Oloo,Rennish Mboya,Edwin Onkoba,Edward Ombui,Joseph Muguro,Ciira wa Maina,Andrew Kipkebut,Alfred Omondi Otom,Ian Ndung'u Kang'ethe,Angela Wambui Kanyi,Brian Gichana Omwenga

Main category: cs.CL

TL;DR: AfriVoices-KE 是一个涵盖五种肯尼亚语言、总计约3000小时的大型多语种语音数据集,包含脚本与自发语音,旨在缓解非洲语言在语音技术中代表性不足的问题。

Details Motivation: 解决非洲语言在语音技术中严重缺乏代表性的关键问题,推动包容性语音系统开发和肯尼亚语言遗产的数字化保护。 Method: 采用双轨数据采集方法:脚本语音基于11个肯尼亚相关领域的文本语料、翻译及生成句子;自发语音通过文字和图像提示采集;使用定制移动应用实现智能手机录音;多层质量保障包括信噪比自动验证与人工内容审核。 Result: 构建了包含约3000小时语音(750小时脚本+2250小时自发)、覆盖4777名母语者、横跨地理与人口多样性的高质量多语种语音数据集 AfriVoices-KE。 Conclusion: AfriVoices-KE 为低资源非洲语言的ASR和TTS系统研发提供了坚实基础,同时为语言多样性保护与社区参与式数据建设树立了实践范例。 Abstract: AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.

[57] AI generates well-liked but templatic empathic responses

Emma Gueorguieva,Hongli Zhan,Jina Suh,Javier Hernandez,Tatiana Lau,Junyi Jessy Li,Desmond C. Ong

Main category: cs.CL

TL;DR: 本文发现大语言模型(LLMs)在提供情感支持时表现更富同理心,原因在于其高度公式化地运用了一套10种同理心语言策略构成的固定模板;该模板在83%-90%的AI响应中被识别,覆盖率达81%-92%,而人类响应则更具多样性。

Details Motivation: 解释为何LLM生成的情感支持回应被用户评价为比人类更富有同理心。 Method: 构建包含10种同理心语言策略的分类法,并在两项研究中(共3265条AI响应、1290条人类响应)分析LLM与人类回应的语言结构规律性及模板匹配度。 Result: 发现一个高覆盖率(83%-90%)且高覆盖率(81%-92%)的同理心表达模板广泛存在于六种LLM的输出中,而人类回应显著更多样。 Conclusion: LLM的同理心优势源于可复现的结构化策略模板,而非真正理解;这提示未来需审慎看待AI生成同理心的应用边界与伦理影响。 Abstract: Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.

[58] What do Language Models Learn and When? The Implicit Curriculum Hypothesis

Emmy Liu,Kaiser Sun,Millicent Li,Isabelle Lee,Lindia Tjuatja,Jen-tse Huang,Graham Neubig

Main category: cs.CL

TL;DR: 本文提出隐式课程假说,认为大语言模型预训练过程遵循一种可预测的、组合式的技能习得顺序,并通过设计一系列可组合的简单任务验证了该假说:不同模型在技能涌现顺序上高度一致,复合任务通常在其组成子任务之后出现,且该顺序可从模型内部表征中预测。

Details Motivation: 现有研究仅通过验证损失缩放律了解模型整体性能提升,但对预训练过程中具体能力如何、以何种顺序涌现缺乏细粒度理解。 Method: 提出隐式课程假说;设计涵盖检索、形态变换、共指消解、逻辑推理和数学等领域的可组合简单任务集;在4个模型族(410M–13B参数)上追踪各任务达到固定准确率阈值的涌现点;分析任务表征向量相似性与训练轨迹一致性;利用任务表征空间预测未见复合任务的训练轨迹。 Result: 不同模型间技能涌现顺序高度一致(ρ = .81);复合任务多在组件任务之后涌现;任务表征相似性与训练轨迹相似性正相关;能基于表征空间高精度预测新任务训练轨迹(R² = .68–.84)。 Conclusion: 大语言模型预训练并非混沌过程,而具有跨模型一致、可从内部表征读取的组合式技能涌现结构,超越了单纯损失曲线所能揭示的信息。 Abstract: Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent ($ρ= .81$ across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining ($R^2 = .68$-$.84$ across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.

[59] Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Jiayuan Ye,Vitaly Feldman,Kunal Talwar

Main category: cs.CL

TL;DR: 本文从信息论角度形式化了大语言模型的事实记忆问题,指出当训练数据中事实信息量超过模型容量或事实频率分布偏斜时,事实准确性会下降;提出仅基于训练损失的数据选择方法,通过限制事实数量和均衡频率分布来提升事实记忆能力,在实验中显著提高了小模型的事实记忆性能。

Details Motivation: 大型语言模型在参数中记忆事实知识的能力有限,容易产生幻觉并在知识密集型任务中表现不佳。 Method: 从信息论角度形式化事实记忆问题,分析训练数据分布对事实准确性的影响,并提出仅基于训练损失的数据选择方案,以限制训练数据中的事实数量并拉平其频率分布。 Result: 在半合成高熵事实数据集上,该方法将事实准确性提升至容量极限;在基于标注维基百科语料预训练时,GPT2-Small模型(1.1亿参数)记忆的实体事实数量比标准训练提升1.3倍,并达到使用全量数据训练的10倍更大模型(13亿参数)的性能。 Conclusion: 事实记忆性能受限于模型容量与训练数据中事实信息量及分布特性;通过合理数据选择可显著提升小模型的事实记忆能力,逼近大模型性能。 Abstract: Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

[60] ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang,Yubo Wang,Yipeng Zhu,Penghui Du,Junwen Miao,Xuan Lu,Wendong Xu,Yunzhuo Hao,Songcheng Cai,Xiaochen Wang,Huaisong Zhang,Xian Wu,Yi Lu,Minyi Lei,Kai Zou,Huifeng Yin,Ping Nie,Liang Chen,Dongfu Jiang,Wenhu Chen,Kelsey R. Allen

Main category: cs.CL

TL;DR: 本文提出了ClawBench,一个面向真实线上平台的AI代理评估框架,包含153个日常任务,覆盖144个活跃网站、15类生活与工作场景,强调信息提取、跨平台多步导航和大量表单填写等高难度能力;实验表明当前前沿模型完成率仍很低(如Claude Sonnet 4.6仅33.3%),凸显现实Web交互的挑战。

Details Motivation: 现有AI代理基准多在离线沙箱或静态网页中评估,无法反映真实、动态、复杂的线上环境;而日常在线任务(如购物、预约、求职)是检验AI作为通用助手能力的理想且未被充分解决的测试场。 Method: 构建ClawBench:包含153个真实、简单但具代表性的日常任务,分布于144个实时运行的网站、15个类别;采用轻量级拦截层,在不产生真实副作用的前提下安全评估代理在生产环境中的表现。 Result: 对7个前沿模型(含闭源与开源)的评测显示,所有模型完成率均较低,最高为Claude Sonnet 4.6的33.3%,表明当前AI代理在真实Web交互中仍存在显著能力缺口。 Conclusion: ClawBench填补了面向真实线上环境的AI代理评估空白,其高难度任务设计揭示了现有模型在信息整合、多步导航与写操作等方面的不足;提升在该基准上的表现是迈向可靠通用AI助手的关键一步。 Abstract: AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

[61] Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo,Yu-Neng Chuang,Guanchu Wang,Zicheng Xu,Xiaotian Han,Tianyi Zhang,Vladimir Braverman

Main category: cs.CL

TL;DR: 本文提出StableOPD框架,解决On-policy distillation(OPD)中因学生模型rollout长度骤增导致的截断崩溃与训练不稳定问题,通过参考分布约束和rollout混合蒸馏提升数学推理性能。

Details Motivation: On-policy distillation(OPD)在训练过程中易出现rollout长度骤增、截断崩溃、重复饱和及梯度偏差,导致训练不稳定和验证性能急剧下降。 Method: 提出StableOPD框架,结合基于参考分布的散度约束与rollout混合蒸馏,抑制由重复引起的长度膨胀,稳定OPD训练过程。 Result: 在多个数学推理数据集上,StableOPD防止了截断崩溃,稳定了训练动态,并平均提升性能7.2%。 Conclusion: OPD的失败源于学生诱导数据采集与蒸馏目标的交互作用;StableOPD通过双重机制有效缓解该问题,显著提升鲁棒性与泛化能力。 Abstract: On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

cs.CV [Back]

[62] FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Xiangru Jian,Hao Xu,Wei Pang,Xinjian Zhao,Chengyu Tao,Qixin Zhang,Xikun Zhang,Chao Zhang,Guanzhi Deng,Alex Xue,Juan Du,Tianshu Yu,Garth Tarr,Linqi Song,Qiuzhuang Sun,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出FORGE,一个面向制造业的多模态大语言模型评估框架,通过构建高质量多模态数据集(2D图像+3D点云+细粒度语义标注),揭示当前MLLM在制造业任务中性能瓶颈在于领域知识不足而非视觉定位能力,并验证了基于该数据微调小模型的有效性。

Details Motivation: 现有MLLM评估无法反映真实制造环境的严苛需求,受限于数据稀缺和缺乏细粒度领域语义标注。 Method: 构建融合真实2D图像与3D点云、含细粒度领域语义(如精确型号)标注的高质量多模态数据集;系统评估18个SOTA MLLM在三大制造任务上的表现;进行瓶颈分析;开展监督微调实验验证数据实用性。 Result: 发现领域知识不足是主要瓶颈而非视觉定位;微调3B参数模型在留出制造场景上获得最高达90.8%的相对准确率提升。 Conclusion: 制造业MLLM发展应聚焦于注入和建模领域专业知识,FORGE提供了评估基准与可落地的训练资源。 Abstract: The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.

[63] Personalizing Text-to-Image Generation to Individual Taste

Anne-Sofie Maerten,Juliane Verwiebe,Shyamgopal Karthik,Ameya Prabhu,Johan Wagemans,Matthias Bethge

Main category: cs.CV

TL;DR: 本文提出PAMELA框架与新数据集,用于建模个性化图像审美评价,显著提升个体偏好预测精度,并支持基于提示优化的个性化文生图生成。

Details Motivation: 现有文本到图像模型及奖励模型仅优化“平均”人类偏好,无法刻画审美判断的主观性,亟需建模个体差异。 Method: 构建含70,000条标注、覆盖5,000张AI生成图像及15名用户/图的个性化评估数据集;提出联合训练的个性化奖励模型,融合高质量主观标注与现有美学评估子集;结合提示优化实现个性化生成引导。 Result: 所提个性化奖励模型在个体喜好预测上优于多数现有方法对群体偏好的预测精度;验证了简单提示优化可有效提升生成结果与个体偏好的一致性。 Conclusion: 数据质量与个性化建模对解决文生图中审美主观性至关重要;PAMELA为个性化T2I对齐与主观视觉质量评估提供了标准化基准与开源资源。 Abstract: Modern text-to-image (T2I) models generate high-fidelity visuals but remain indifferent to individual user preferences. While existing reward models optimize for "average" human appeal, they fail to capture the inherent subjectivity of aesthetic judgment. In this work, we introduce a novel dataset and predictive framework, called PAMELA, designed to model personalized image evaluations. Our dataset comprises 70,000 ratings across 5,000 diverse images generated by state-of-the-art models (Flux 2 and Nano Banana). Each image is evaluated by 15 unique users, providing a rich distribution of subjective preferences across domains such as art, design, fashion, and cinematic photography. Leveraging this data, we propose a personalized reward model trained jointly on our high-quality annotations and existing aesthetic assessment subsets. We demonstrate that our model predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences. Using our personalized predictor, we demonstrate how simple prompt optimization methods can be used to steer generations towards individual user preferences. Our results highlight the importance of data quality and personalization to handle the subjectivity of user preferences. We release our dataset and model to facilitate standardized research in personalized T2I alignment and subjective visual quality assessment.

[64] GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang,Siyuan Hu,Kevin Qinghong Lin,Hwee Tou Ng,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了GameWorld基准,用于在浏览器环境中标准化、可验证地评估多模态大语言模型(MLLM)作为通用游戏代理的能力,涵盖34种游戏和170项任务,并揭示当前模型距人类水平仍有显著差距。

Details Motivation: 现有MLLM代理在真实世界交互中面临高延迟、稀疏反馈和不可逆错误等挑战;视频游戏虽是理想测试平台,但缺乏统一、可验证的评估基准。 Method: 构建GameWorld基准,定义两种代理接口(计算机操作型与语义动作解析型),提供状态可验证指标,并开展多维度实验(如重跑鲁棒性、实时交互、记忆敏感性、动作有效性分析)。 Result: 在18个模型-接口组合上的实验表明,最佳代理仍远未达到人类游戏水平;基准本身展现出强鲁棒性与可复现性。 Conclusion: GameWorld为多模态游戏代理研究提供了标准化、可验证、可复现的评估框架,推动具身通用智能的发展。 Abstract: Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.

[65] HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Tencent Robotics X,HY Vision Team,:,Xumin Yu,Zuyan Liu,Ziyi Wang,He Zhang,Yongming Rao,Fangfu Liu,Yani Zhang,Ruowen Zhao,Oran Wang,Yves Liang,Haitao Lin,Minghui Wang,Yubo Dong,Kevin Cheng,Bolin Ni,Rui Huang,Han Hu,Zhengyou Zhang,Linus,Shunyu Yao

Main category: cs.CV

TL;DR: 本文提出了HY-Embodied-0.5系列基础模型,专为真实世界具身智能体设计,通过MoT架构与迭代自进化后训练范式,提升空间/时间视觉感知与具身推理能力,并在多类基准与真实机器人控制任务中取得优异性能。

Details Motivation: 弥合通用视觉语言模型(VLM)与真实具身智能体需求之间的差距,增强其所需的核心能力:空间与时间视觉感知,以及面向预测、交互与规划的高级具身推理能力。 Method: 提出MoT(Mixture-of-Transformers)架构以支持模态特异性计算并引入隐变量增强感知表征;设计迭代式自进化后训练范式提升推理能力;采用on-policy蒸馏将大模型能力迁移至小模型。 Result: 在22个涵盖视觉感知、空间推理与具身理解的基准上验证有效性:MoT-2B模型在16个基准上超越同规模SOTA;32B模型性能媲美Gemini 3.0 Pro;下游VLA模型在真实物理机器人实验中表现优异。 Conclusion: HY-Embodied-0.5系列模型显著提升了具身智能体所需的多维能力,兼顾边缘部署效率与复杂推理性能,开源代码与模型推动社区发展。 Abstract: We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

[66] SMFD-UNet: Semantic Face Mask Is The Only Thing You Need To Deblur Faces

Abduz Zami

Main category: cs.CV

TL;DR: 本文提出了一种名为SMFD-UNet的轻量级人脸图像去模糊框架,通过语义面部掩码引导去模糊过程,无需高质量参考图像,显著提升了PSNR和SSIM等指标,并在CelebA数据集上优于现有方法。

Details Motivation: 传统去模糊方法依赖通用图像先验,难以捕捉人脸特有的结构和身份特征;且多数方法需要高质量参考图像,限制了实际应用。 Method: 提出SMFD-UNet:首先用UNet生成模糊图像的细粒度语义面部组件掩码(如眼、鼻、嘴);再通过多阶段特征融合策略将掩码与模糊输入结合,在轻量UNet架构中重建清晰人脸;引入随机化模糊管线模拟1.74万亿种退化情形以增强鲁棒性;集成RDC模块、CBAM注意力、高效上采样与后处理技术。 Result: 在CelebA数据集上,SMFD-UNet在PSNR和SSIM上超越SOTA方法,同时保持良好的自然度(NIQE、LPIPS、FID)。 Conclusion: SMFD-UNet是一种高效、轻量、鲁棒的人脸去模糊方法,兼顾性能与实用性,适用于人脸识别、法医分析、医学影像等现实场景。 Abstract: For applications including facial identification, forensic analysis, photographic improvement, and medical imaging diagnostics, facial image deblurring is an essential chore in computer vision allowing the restoration of high-quality images from blurry inputs. Often based on general picture priors, traditional deblurring techniques find it difficult to capture the particular structural and identity-specific features of human faces. We present SMFD-UNet (Semantic Mask Fusion Deblurring UNet), a new lightweight framework using semantic face masks to drive the deblurring process, therefore removing the need for high-quality reference photos in order to solve these difficulties. First, our dual-step method uses a UNet-based semantic mask generator to directly extract detailed facial component masks (e.g., eyes, nose, mouth) straight from blurry photos. Sharp, high-fidelity facial images are subsequently produced by integrating these masks with the blurry input using a multi-stage feature fusion technique within a computationally efficient UNet framework. We created a randomized blurring pipeline that roughly replicates real-world situations by simulating around 1.74 trillion deterioration scenarios, hence guaranteeing resilience. Examined on the CelebA dataset, SMFD-UNet shows better performance than state-of-the-art models, attaining higher Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) while preserving satisfactory naturalness measures, including NIQE, LPIPS, and FID. Powered by Residual Dense Convolution Blocks (RDC), a multi-stage feature fusion strategy, efficient and effective upsampling techniques, attention techniques like CBAM, post-processing techniques, and the lightweight design guarantees scalability and efficiency, enabling SMFD-UNet to be a flexible solution for developing facial image restoration research and useful applications.

[67] Training-free Spatially Grounded Geometric Shape Encoding (Technical Report)

Yuhang He

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、通用的2D几何形状编码方法XShapeEnc,通过Zernike正交基与频率传播操作,实现对形状几何与姿态的可逆、自适应且高频丰富的紧凑编码。

Details Motivation: 传统位置编码适用于1D序列,难以直接扩展到2D空间几何形状;需兼顾形状几何、位姿及神经网络兼容性。 Method: 将2D形状分解为单位圆内的归一化几何和位姿向量;位姿转化为单位圆内的谐波位姿场;利用Zernike正交基独立或联合编码几何与位姿,并通过频率传播增强高频信息。 Result: XShapeEnc具备可逆性、自适应性、频率丰富性等五种优良性质,在多种形状感知任务和自建XShapeCorpus上验证了其理论有效性、效率、判别力与实用性。 Conclusion: XShapeEnc是一种面向2D空间智能的基础性编码工具,有望推动超越1D序列数据的研究前沿。 Abstract: Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.

[68] On the Uphill Battle of Image frequency Analysis

Nader Bazyari,Hedieh Sajedi

Main category: cs.CV

TL;DR: 本文提出了针对非均匀数据的逆平方均值漂移算法的特例,并利用三维快速傅里叶变换分析图像以发现隐藏模式。

Details Motivation: 为处理非均匀数据并挖掘图像中的隐藏模式,对逆平方均值漂移算法进行拓展。 Method: 构建逆平方均值漂移算法在非均匀数据下的特例,并应用三维快速傅里叶变换于图像分析。 Result: 实现了针对非均匀数据的算法变体,并初步探索了三维FFT在图像隐含模式识别中的潜力。 Conclusion: 该工作扩展了逆平方均值漂移算法的应用范围,并表明三维FFT可作为图像隐含结构分析的有效工具。 Abstract: This work is a follow up on the newly proposed clustering algorithm called The Inverse Square Mean Shift Algorithm. In this paper a special case of algorithm for dealing with non-homogenous data is formulated and the three dimensional Fast Fourier Transform of images is investigated with the aim of finding hidden patterns.

[69] Mathematical Analysis of Image Matching Techniques

Oleh Samoilenko

Main category: cs.CV

TL;DR: 本文对SIFT和ORB两种经典局部特征匹配算法在卫星影像上的性能进行了分析与实验评估,重点关注关键点数量对匹配质量(以内点率衡量)的影响。

Details Motivation: 图像匹配是计算机视觉中的基础问题,在机器人、遥感和地理空间数据分析中有直接应用;然而,现有方法在卫星影像上的表现缺乏系统性评估。 Method: 采用统一处理流程(关键点检测、描述子提取、描述子匹配、RANSAC+单应性估计的几何验证),对SIFT和ORB在自建GPS标注卫星图像数据集上进行对比实验,并分析关键点数量与内点率的关系。 Result: 实验结果揭示了不同算法在卫星影像上随关键点数量变化的内点率趋势,为实际应用中参数选择提供了依据。 Conclusion: SIFT与ORB在卫星影像匹配任务中表现存在差异,关键点数量并非越多越好,需权衡匹配精度与计算效率。 Abstract: Image matching is a fundamental problem in Computer Vision with direct applications in robotics, remote sensing, and geospatial data analysis. We present an analytical and experimental evaluation of classical local feature-based image matching algorithms on satellite imagery, focusing on the Scale-Invariant Feature Transform (SIFT) and the Oriented FAST and Rotated BRIEF (ORB). Each method is evaluated through a common pipeline: keypoint detection, descriptor extraction, descriptor matching, and geometric verification via RANSAC with homography estimation. Matching quality is assessed using the Inlier Ratio - the fraction of correspondences consistent with the estimated homography. The study uses a manually constructed dataset of GPS-annotated satellite image tiles with intentional overlaps. We examine the impact of the number of extracted keypoints on the resulting Inlier Ratio.

[70] Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

Katerina Katsarou,George Zountsas,Karam Tomotaki-Dawoud,Alexander Ehrenhoefer,Paul Chojecki,David Przewozny,Igor Maximilian Sauer,Amira Mouakher,Sebastian Bosse

Main category: cs.CV

TL;DR: 本文提出了一种基于ViT-LSTM的时空视觉框架,用于手术视频中器械交接事件的检测与方向分类,通过多任务学习和峰值检测实现高精度识别,并用Layer-CAM提升可解释性。

Details Motivation: 手术器械交接的自动监测对提升手术效率和患者安全至关重要,但因频繁遮挡、背景杂乱及交互时序动态性而极具挑战。 Method: 构建融合Vision Transformer(空间特征提取)与单向LSTM(时序聚合)的时空模型;采用统一多任务学习联合预测交接发生与否及方向;通过置信度时间信号的峰值检测识别离散交接事件;使用Layer-CAM进行可视化归因。 Result: 在肾移植手术数据集上,交接检测F1达0.84,方向分类平均F1为0.72,优于单任务模型和VideoMamba基线;同时保持检测性能相当。 Conclusion: 该框架有效建模手术器械交接的时空动态特性,兼顾准确性与可解释性,为术中智能监控提供了可靠技术路径。 Abstract: Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.

[71] MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition

Muhammad Imran Sharif,Doina Caragea

Main category: cs.CV

TL;DR: 本文提出MSGL-Transformer模型,通过多尺度全局-局部注意力机制和行为感知调制模块,有效提升基于姿态序列的啮齿类动物社交行为识别性能,在两个公开数据集上均取得SOTA结果。

Details Motivation: 传统人工标注啮齿类动物行为耗时且易出错,亟需自动、鲁棒、可泛化的自动识别方法。 Method: 提出MSGL-Transformer:轻量级Transformer编码器,含并行短程、中程与全局注意力分支;引入行为感知调制(BAM)模块,对时间嵌入进行特征重标定。 Result: 在RatSI数据集上达75.4%平均准确率(F1=0.745),优于TCN、LSTM等;在CalMS21上达87.1%准确率(F1=0.8745),较HSTWFormer提升10.7%,且同一架构可跨数据集泛化。 Conclusion: MSGL-Transformer验证了多尺度时空建模与行为感知特征调制对细粒度动物行为识别的有效性,具备良好的泛化能力与实用性。 Abstract: Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.

[72] Bootstrapping Sign Language Annotations with Sign Language Models

Colin Lea,Vasileios Baltatzis,Connor Gillis,Raja Kushalnagar,Lorna Quandt,Leah Findlater

Main category: cs.CV

TL;DR: 本文提出了一种伪标注流水线,用于在缺乏高质量标注数据的情况下,为手语视频生成包含时间戳的词汇(gloss)、手指拼写词和手势分类器的候选标注,并发布了人工标注的黄金标准数据集及大量伪标注数据。

Details Motivation: AI驱动的手语翻译受限于高质量标注数据的匮乏;现有大型数据集(如ASL STEM Wiki、FLEURS-ASL)虽含数百小时专业手语视频,但仅部分标注,人工标注成本过高导致其未被充分利用。 Method: 构建伪标注流水线:结合自研的稀疏手指拼写识别器与孤立手语识别器(ISR)输出,辅以K-shot大语言模型(LLM)进行上下文推理,生成带时间区间的多类型标注(gloss、手指拼写、分类器)并排序;同时建立了轻量高效的手指拼写识别与ISR基线模型。 Result: 手指拼写识别在FSBoard上达到6.7%字符错误率(CER),ISR在ASL Citizen上达74% top-1准确率;专业译员人工标注近500个ASL STEM Wiki视频形成黄金标准;发布超300小时伪标注数据。 Conclusion: 该伪标注流水线可显著缓解手语识别领域标注瓶颈,所发布的人工与伪标注资源将推动数据驱动的手语理解研究。 Abstract: AI-driven sign language interpretation is limited by a lack of high-quality annotated data. New datasets including ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100s of hours of data but remain only partially annotated and thus underutilized, in part due to the prohibitive costs of annotating at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. Our pipeline uses sparse predictions from our fingerspelling recognizer and isolated sign recognizer (ISR), along with a K-Shot LLM approach, to estimate these annotations. In service of this pipeline, we establish simple yet effective baseline fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and on ASL Citizen datasets (74% top-1 accuracy). To validate and provide a gold-standard benchmark, a professional interpreter annotated nearly 500 videos from ASL STEM Wiki with sequence-level gloss labels containing glosses, classifiers, and fingerspelling signs. These human annotations and over 300 hours of pseudo-annotations are being released in supplemental material.

[73] VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models

Pavan Kumar Anasosalu Vasu,Cem Koc,Fartash Faghri,Chun-Liang Li,Bo Feng,Zhengfeng Lai,Meng Cao,Oncel Tuzel,Hadi Pouransari

Main category: cs.CV

TL;DR: 本文提出VSAS-Bench,首个面向视觉流式助手(Streaming VLMs)的评测基准与框架,引入时序密集标注、同步/异步评估协议及新指标(如主动性、一致性),系统评估了多类模型在准确率-延迟权衡下的表现,并发现无需额外训练即可将传统VLM适配为高性能流式模型。

Details Motivation: 现有VLM评测主要面向离线视频理解,忽视流式场景下关键能力(如响应及时性、输出一致性),缺乏适配实时视觉助手的评估体系。 Method: 构建VSAS-Bench:包含18,000+时序密集标注、多领域多任务数据;设计同步/异步评估协议;定义主动性、一致性等新指标;开展大规模模型对比实验,分析内存缓冲长度、访问策略、输入分辨率等设计因素的影响。 Result: 实验证明传统VLM(如Qwen3-VL-4B)经简单适配后,在异步协议下性能超越当前最优流式VLM(Dispider)3%;揭示了准确率与延迟间的权衡规律及若干实用设计启示。 Conclusion: VSAS-Bench填补了流式VLM评测空白,推动视觉流式助手向更真实、更鲁棒、更可衡量的方向发展;同时表明流式能力不必然依赖专用架构或训练,为模型部署提供新思路。 Abstract: Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at https://github.com/apple/ml-vsas-bench.

[74] Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach

Huibin Bai,Shuai Li,Hanxiao Zhai,Yanbo Gao,Chong Lv,Yibo Wang,Haipeng Ping,Wei Hua,Xingyu Gao

Main category: cs.CV

TL;DR: 本文提出了一种从特征恢复视角解决单目深度估计问题的新方法,通过引入可逆变换增强的间接扩散模块(InvT-IndDiffusion)和辅助视角低层特征增强模块(AV-LFE),显著提升了预测精度。

Details Motivation: 现有单目深度估计方法多采用编码器-解码器结构,但其架构局限性及不同层级特征对预测精度的影响尚未被系统评估;作者发现若能提升编码器特征质量,当前框架仍有较大提升潜力。 Method: 将深度估计建模为预训练编码器特征的恢复问题,假设存在一个能生成真实深度图的‘理想特征’;设计InvT-IndDiffusion模块在无直接特征监督下,利用稀疏深度图进行间接监督,并通过满足双Lipschitz条件的可逆变换解码器缓解扩散迭代中的特征偏差;同时提出即插即用的AV-LFE模块,借助辅助视角增强局部细节。 Result: 在多个数据集上优于当前最优方法;在KITTI基准上,相比基线模型,RMSE指标分别提升4.09%和37.77%(不同训练设置下)。 Conclusion: 特征恢复视角是提升单目深度估计性能的有效新思路,所提InvT-IndDiffusion与AV-LFE模块具备通用性和实用性,且代码已开源。 Abstract: Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at https://github.com/whitehb1/IID-RDepth.

[75] Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation

Yanbo Gao,Huibin Bai,Huasong Zhou,Xingyu Gao,Shuai Li,Xun Cai,Hui Yuan,Wei Hua,Tian Xie

Main category: cs.CV

TL;DR: 本文提出了一种深度转换尺度卷积(DcSConv)增强的单目深度估计框架,通过将物体深度与尺度的先验关系融入卷积感受野,提升特征提取能力,并设计了DcS-F模块自适应融合新旧特征;该方法作为即插即用模块可提升现有CNN模型,在KITTI数据集上SqRel指标最高提升11.6%。

Details Motivation: 现有自监督单目深度估计方法未显式建模物体尺寸随深度变化的规律,尤其在单目视频中同一物体尺寸连续变化,导致尺寸与深度歧义。 Method: 提出Depth-converted-Scale Convolution(DcSConv),使卷积核尺度自适应于物体深度;并设计Depth-converted-Scale aware Fusion(DcS-F)模块,自适应融合DcSConv与常规卷积特征;整体作为即插即用模块嵌入现有CNN架构。 Result: 在KITTI基准上取得最优性能,SqRel指标最高降低11.6%;消融实验验证了DcSConv和DcS-F模块的有效性。 Conclusion: 卷积核的尺度对单目深度估计任务至关重要,甚至比局部形变更重要;引入深度-尺度先验可显著提升自监督单目深度估计精度。 Abstract: Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.

[76] Weight Group-wise Post-Training Quantization for Medical Foundation Model

Yineng Chen,Peng Huang,Aozhong Zhang,Hui Guo,Penghang Yin,Shu Hu,Shao Lin,Xin Li,Tzu-Jen Kao,Balakrishnan Prabhakaran,MingChing Chang,Xin Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需反向传播的后训练量化算法Permutation-COMQ,通过重排序权重和简单运算提升低比特量化精度,在2/4/8位量化中均取得最优效果。

Details Motivation: 基础模型在医疗图像分析中表现优异,但其大网络结构和高计算复杂度限制了在终端医疗设备上的实时推理应用。 Method: 提出后训练量化算法Permutation-COMQ,采用点积与舍入操作替代反向传播,避免超参调优;引入权重量感知策略,在每层内重排权重以缓解通道级缩放导致的精度下降,同时保持通道结构。 Result: 在2比特、4比特和8比特量化任务中均达到最优性能。 Conclusion: Permutation-COMQ是一种高效、简洁且高精度的低比特量化方法,适用于资源受限的终端医疗设备部署。 Abstract: Foundation models have achieved remarkable results in medical image analysis. However, its large network architecture and high computational complexity significantly impact inference speed, limiting its application on terminal medical devices. Quantization, a technique that compresses models into low-bit versions, is a solution to this challenge. In this paper, we propose a post-training quantization algorithm, Permutation-COMQ. It eliminates the need for backpropagation by using simple dot products and rounding operations, thereby removing hyperparameter tuning and simplifying the process. Additionally, we introduce a weight-aware strategy that reorders the weight within each layer to address the accuracy degradation induced by channel-wise scaling during quantization, while preserving channel structure. Experiments demonstrate that our method achieves the best results in 2-bit, 4-bit, and 8-bit quantization.

[77] FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction

Jinzhen Han,JinByeong Lee,Hak Han,YeonJu Na,Jae-Joon Lee

Main category: cs.CV

TL;DR: 本文提出FireSenseNet,一种双分支CNN模型,通过CAFIM模块显式建模燃料与天气特征的空间交互,在下一日本地野火蔓延预测任务中达到SOTA性能,并揭示了评估偏差与关键特征重要性。

Details Motivation: 现有深度学习方法将异构地理空间输入简单拼接,忽视静态(燃料/地形)与动态(气象)变量的物理本质差异,导致建模不准确。 Method: 提出双分支CNN架构FireSenseNet,引入跨注意力特征交互模块(CAFIM),在多个编码器尺度上通过可学习注意力门控建模燃料与天气模态的空间变化交互;结合蒙特卡洛Dropout实现像素级不确定性量化。 Result: 在Google Next-Day Wildfire Spread基准上F1达0.4176、AUC-PR达0.3435,显著优于包括参数多3.8倍的SegFormer在内的六种对比模型;消融显示CAFIM带来7.1%相对F1提升;发现前一日火场掩膜主导预测,风速在粗时间分辨率下呈噪声效应;指出常见评估捷径使F1虚高超44%。 Conclusion: 显式解耦并交互建模静态与动态地理变量是提升野火预测性能的关键;CAFIM有效增强跨模态特征融合;评估需避免数据泄露捷径,且不确定性量化对灾害响应至关重要。 Abstract: Accurate prediction of next-day wildfire spread is critical for disaster response and resource allocation. Existing deep learning approaches typically concatenate heterogeneous geospatial inputs into a single tensor, ignoring the fundamental physical distinction between static fuel/terrain properties and dynamic meteorological conditions. We propose FireSenseNet, a dual-branch convolutional neural network equipped with a novel Cross-Attentive Feature Interaction Module (CAFIM) that explicitly models the spatially varying interaction between fuel and weather modalities through learnable attention gates at multiple encoder scales. Through a systematic comparison of seven architectures -- spanning pure CNNs, Vision Transformers, and hybrid designs -- on the Google Next-Day Wildfire Spread benchmark, we demonstrate that FireSenseNet achieves an F1 of 0.4176 and AUC-PR of 0.3435, outperforming all alternatives including a SegFormer with 3.8* more parameters (F1 = 0.3502). Ablation studies confirm that CAFIM provides a 7.1% relative F1 gain over naive concatenation, and channel-wise feature importance analysis reveals that the previous-day fire mask dominates prediction while wind speed acts as noise at the dataset's coarse temporal resolution. We further incorporate Monte Carlo Dropout for pixel-level uncertainty quantification and present a critical analysis showing that common evaluation shortcuts inflate reported F1 scores by over 44%.

[78] Needle in a Haystack -- One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology

Swarnadip Chatterjee,Vladimir Basic,Arrigo Capitanio,Orcun Goksel,Joakim Lindblad

Main category: cs.CV

TL;DR: 本文提出使用单类表示学习(OCC)方法(如DSVDD和DROC)在极低恶性细胞比例(≤1%)下进行细胞异常检测,仅需阴性样本训练,无需实例级标注,在骨髓与口腔癌细胞数据集上达到SOTA性能,优于传统MIL及部分监督方法。

Details Motivation: 恶性细胞在全片图像中形态多样且极其稀少,导致严重类别不平衡和标注匮乏;传统弱监督方法(如MIL)在极低见证率(witness rate)下难以泛化到实例级别。 Method: 采用仅基于阴性补丁(slide-negative patches)训练的单类表示学习(OCC)方法,具体评估DSVDD和DROC,并与FS-SIL、WS-SIL及ItS2CLR对比;利用正常性紧凑表征建模,测试时检测偏离。 Result: 在TCIA骨髓和自建口腔癌细胞数据集上,DSVDD在超低见证率(≤1%)下实现实例级异常排序SOTA性能,甚至超越全监督方法;DROC通过分布增强对比学习在极端稀疏场景下也具竞争力。 Conclusion: 单类表示学习是极端稀有恶性细胞检测中比多实例学习更鲁棒、可解释性更强的优选方案。 Abstract: In computational cytology, detecting malignancy on whole-slide images is difficult because malignant cells are morphologically diverse yet vanishingly rare amid a vast background of normal cells. Accurate detection of these extremely rare malignant cells remains challenging due to large class imbalance and limited annotations. Conventional weakly supervised approaches, such as multiple instance learning (MIL), often fail to generalize at the instance level, especially when the fraction of malignant cells (witness rate) is exceedingly low. In this study, we explore the use of one-class representation learning techniques for detecting malignant cells in low-witness-rate scenarios. These methods are trained exclusively on slide-negative patches, without requiring any instance-level supervision. Specifically, we evaluate two OCC approaches, DSVDD and DROC, and compare them with FS-SIL, WS-SIL, and the recent ItS2CLR method. The one-class methods learn compact representations of normality and detect deviations at test time. Experiments on a publicly available bone marrow cytomorphology dataset (TCIA) and an in-house oral cancer cytology dataset show that DSVDD achieves state-of-the-art performance in instance-level abnormality ranking, particularly in ultra-low witness-rate regimes ($\leq 1\%$) and, in some cases, even outperforming fully supervised learning, which is typically not a practical option in whole-slide cytology due to the infeasibility of exhaustive instance-level annotations. DROC is also competitive under extreme rarity, benefiting from distribution-augmented contrastive learning. These findings highlight one-class representation learning as a robust and interpretable superior choice to MIL for malignant cell detection under extreme rarity.

[79] Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

Jiahao Li,Yang Lu,Yachao Zhang,Fangyong Wang,Yuan Xie,Yanyun Qu

Main category: cs.CV

TL;DR: 本文提出了一种无需迭代训练和模型特定注意力调制的开放词汇语义分割新方法,通过直接解析分布差异获得语义图,在八个基准数据集上达到SOTA性能。

Details Motivation: 现有开放词汇语义分割方法依赖耗时的迭代训练或模型特定的注意力调制来优化视觉-语言特征间的logits分布差异,限制了效率与泛化性。 Method: 提出一个关键假设:logits与真值之间的分布差异蕴含类别语义信息,且在同类图像块间一致、跨类间不一致;据此,直接将该分布差异的解析解作为语义分割图,跳过传统logits优化过程。 Result: 所提方法在八个基准数据集上均达到当前最优性能(state-of-the-art),同时避免了迭代训练和模型特定注意力调制。 Conclusion: 分布差异本身即具语义判别能力,其解析解可直接用于分割,为OVSS提供了更高效、通用且高性能的新范式。 Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.

[80] GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting

Jialin Li,Bin Fu,Ruiping Wang,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出GEAR框架,通过高斯点绘表示法联合建模几何与运动,交替优化以提升复杂关节物体的重建精度和泛化能力。

Details Motivation: 高保真交互式数字资产对具身智能和机器人交互至关重要,但现有方法在关节物体重建中存在几何-运动联合优化不稳定、泛化能力差等问题。 Method: 提出EM风格的交替优化框架GEAR,在高斯点绘表示中将几何与运动建模为相互依赖的组件;将部件分割视为隐变量、关节运动参数作为显变量交替优化;利用2D分割模型提供多视角部件先验,并引入弱监督约束正则化隐变量。 Result: 在多个基准及新构建数据集GEAR-Multi上验证,GEAR在几何重建和运动参数估计方面达到SOTA,尤其在多部件复杂关节物体上表现突出。 Conclusion: GEAR通过解耦与协同优化策略有效提升了关节物体重建的稳定性、一致性与泛化性,为具身智能提供了更可靠的数字资产生成方法。 Abstract: High-fidelity interactive digital assets are essential for embodied intelligence and robotic interaction, yet articulated objects remain challenging to reconstruct due to their complex structures and coupled geometry-motion relationships. Existing methods suffer from instability in geometry-motion joint optimization, while their generalization remains limited on complex multi-joint or out-of-distribution objects. To address these challenges, we propose GEAR, an EM-style alternating optimization framework that jointly models geometry and motion as interdependent components within a Gaussian Splatting representation. GEAR treats part segmentation as a latent variable and joint motion parameters as explicit variables, alternately refining them for improved convergence and geometric-motion consistency. To enhance part segmentation quality without sacrificing generalization, we leverage a vanilla 2D segmentation model to provide multi-view part priors, and employ a weakly supervised constraint to regularize the latent variable. Experiments on multiple benchmarks and our newly constructed dataset GEAR-Multi demonstrate that GEAR achieves state-of-the-art results in geometric reconstruction and motion parameters estimation, particularly on complex articulated objects with multiple movable parts.

[81] Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

Shogo Hamano,Shunya Wakasugi,Tatsuhito Sato,Sayaka Nakamura

Main category: cs.CV

TL;DR: 本文提出CG-CLIP框架,结合文本描述与可学习token,通过Caption-guided Memory Refinement(CMR)和Token-based Feature Extraction(TFE)提升视频行人重识别在高难度场景(如体育、舞蹈)下的性能,并在多个数据集上取得SOTA结果。

Details Motivation: 现有视频行人重识别方法在多人穿着相似、动作动态性强的高难度场景(如体育、舞蹈表演)中表现不佳,亟需利用更丰富的语义信息和高效特征建模能力。 Method: 提出CG-CLIP框架,包含两个核心模块:1)Caption-guided Memory Refinement(CMR),利用多模态大语言模型生成的文本描述来精炼身份特异性特征;2)Token-based Feature Extraction(TFE),采用固定长度可学习token与跨注意力机制高效聚合时空特征。 Result: 在MARS、iLIDS-VID、SportsVReID和DanceVReID四个数据集上均超越当前最先进方法,尤其在新构建的高难度SportsVReID和DanceVReID上提升显著。 Conclusion: CG-CLIP通过融合显式文本引导与可学习token建模,有效提升了视频行人重识别在复杂动态场景中的鲁棒性与判别力,验证了多模态语义对细粒度身份匹配的重要价值。 Abstract: In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.

[82] MSCT: Differential Cross-Modal Attention for Deepfake Detection

Fangda Wei,Miao Liu,Yingxue Wang,Jing Wang,Shenghui Zhao,Nan Li

Main category: cs.CV

TL;DR: 本文提出了一种多尺度跨模态变换器编码器(MSCT)用于音视频深度伪造检测,通过多尺度自注意力和差分跨模态注意力提升特征提取与模态对齐能力,在FakeAVCeleb数据集上验证了有效性。

Details Motivation: 传统多模态伪造检测方法存在特征提取不足和模态对齐偏差的问题。 Method: 提出多尺度跨模态变换器编码器(MSCT),包含多尺度自注意力机制以整合邻近嵌入特征,以及差分跨模态注意力机制以融合多模态特征。 Result: 在FakeAVCeleb数据集上取得了具有竞争力的检测性能。 Conclusion: MSCT结构能有效提升音视频深度伪造检测中的特征表达与跨模态对齐能力。 Abstract: Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.

[83] Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan

Main category: cs.CV

TL;DR: 本文提出Symbiotic-MoE框架,在不增加参数的前提下,通过模态感知专家解耦与渐进训练策略,缓解大视觉语言模型在联合图像生成与理解任务时的灾难性遗忘问题,并提升跨模态协同能力。

Details Motivation: 现有方法(如MoT)虽能缓解生成与理解任务间的梯度冲突,但破坏了跨模态协同并导致模型容量碎片化;同时,标准MoE微调易引发路由坍塌,使生成任务主导专家使用,损害理解能力。 Method: 提出Symbiotic-MoE:1)模态感知专家解耦——将专家划分为任务专用组,并引入共享专家作为多模态语义桥梁;2)渐进训练策略——采用差异化学习率和早期梯度屏蔽机制,保护预训练知识并逐步将生成信号转化为对理解任务的有益反馈。 Result: Symbiotic-MoE在图像生成任务上快速收敛,同时显著提升模型固有理解能力,在MMLU和OCRBench等基准上取得显著性能增益。 Conclusion: Symbiotic-MoE实现了生成与理解能力的共生式协同增强,无需额外参数,在统一MoE架构中有效解决任务干扰问题,为多模态大模型的统一预训练提供了新范式。 Abstract: Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.

[84] DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics

Hang Zhang,Qijian Tian,Jingyu Gong,Daoguo Dong,Xuhong Wang,Yuan Xie,Xin Tan

Main category: cs.CV

TL;DR: DailyArt提出一种从单张闭合状态图像中推断铰接物体运动学结构的新方法,通过先合成最大展开状态以暴露运动线索,再基于观测与合成状态的差异估计全部关节参数,无需多状态、多视角或显式部件标注。

Details Motivation: 现有方法难以仅从单张闭合图像中推断铰接物体的运动学结构,因关键运动线索常被遮挡;且多数方法依赖多状态观测、显式部件先验、检索或其他辅助输入,限制了泛化能力。 Method: DailyArt将单图铰接关节估计建模为‘合成驱动推理’问题:首先在相同视角下合成物体的最大展开状态以暴露关节线索,再通过观测图像与合成图像的差异联合预测所有关节参数;采用集合预测形式,无需模板、多视图或部件标注;并支持以估计关节为条件进行部件级新状态合成。 Result: 在铰接关节估计任务上性能优异,并能有效支持以关节为条件的部件级新状态合成。 Conclusion: DailyArt实现了仅凭单张静态图像即可鲁棒、通用地估计铰接结构,并拓展出可控制的部件级状态生成能力,提升了对日常铰接物体的几何-运动理解。 Abstract: Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed-state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi-state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis-mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set-prediction formulation, DailyArt recovers all joints simultaneously without requiring object-specific templates, multi-view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part-level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part-level novel state synthesis conditioned on joints. Project page is available at https://rangooo123.github.io/DaliyArt.github.io/.

[85] WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

Junxiong Liang,Mengwei Bao,Tianxiang Wang,Xinggang Wang,An-An Liu,Ryan Wen Liu

Main category: cs.CV

TL;DR: 本文构建了大规模船舶检测数据集WUTDet,包含10万+图像和38万+标注实例,覆盖多样复杂海事场景,并基于该数据集系统评估了CNN、Transformer和Mamba三类主流检测模型的性能与泛化能力。

Details Motivation: 现有公开船舶检测数据集在规模、小目标比例和场景多样性方面受限,难以支撑复杂海事环境下检测算法的系统评估与泛化研究。 Method: 构建大规模、多场景、多成像条件的船舶检测数据集WUTDet;在WUTDet上系统评测20种基线模型(CNN/Transformer/Mamba);构建跨数据集测试集Ship-GEN评估泛化能力。 Result: Transformer在检测精度(AP)和小目标检测(APs)上最优;CNN推理效率最高,适合实时应用;Mamba在精度与效率间取得较好平衡;WUTDet训练的模型在Ship-GEN上泛化能力更强。 Conclusion: WUTDet为复杂海事场景下的船舶检测算法研究、评估与泛化分析提供了有效数据支撑,已开源。 Abstract: Ship detection for navigation is a fundamental perception task in intelligent waterway transportation systems. However, existing public ship detection datasets remain limited in terms of scale, the proportion of small-object instances, and scene diversity, which hinders the systematic evaluation and generalization study of detection algorithms in complex maritime environments. To this end, we construct WUTDet, a large-scale ship detection dataset. WUTDet contains 100,576 images and 381,378 annotated ship instances, covering diverse operational scenarios such as ports, anchorages, navigation, and berthing, as well as various imaging conditions including fog, glare, low-lightness, and rain, thereby exhibiting substantial diversity and challenge. Based on WUTDet, we systematically evaluate 20 baseline models from three mainstream detection architectures, namely CNN, Transformer, and Mamba. Experimental results show that the Transformer architecture achieves superior overall detection accuracy (AP) and small-object detection performance (APs), demonstrating stronger adaptability to complex maritime scenes; the CNN architecture maintains an advantage in inference efficiency, making it more suitable for real-time applications; and the Mamba architecture achieves a favorable balance between detection accuracy and computational efficiency. Furthermore, we construct a unified cross-dataset test set, Ship-GEN, to evaluate model generalization. Results on Ship-GEN show that models trained on WUTDet exhibit stronger generalization under different data distributions. These findings demonstrate that WUTDet provides effective data support for the research, evaluation, and generalization analysis of ship detection algorithms in complex maritime scenarios. The dataset is publicly available at: https://github.com/MAPGroup/WUTDet.

[86] Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

Jingtong Dou,Chuancheng Shi,Jian Wang,Fei Shen,Zhiyong Wang,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出了一种模态无关的伪造检测框架(MAF),通过解耦模态特异性风格,提取跨模态共有的潜在伪造知识,以解决现有深度伪造检测方法在未知‘暗模态’上泛化能力差的问题,并构建了DeepModal-Bench基准进行评估。

Details Motivation: 现有深度伪造检测方法过度依赖模态特定的表层伪影,忽视了不同模态下共享的潜在伪造知识,导致在未见过的‘暗模态’上性能急剧下降。 Method: 提出模态无关伪造(MAF)检测框架,显式解耦模态特异性风格以提取跨模态共性伪造知识;定义弱MAF(语义相关模态迁移)与强MAF(对完全隔离的‘暗模态’鲁棒性)两个泛化维度;构建DeepModal-Bench多模态伪造检测基准。 Result: 实证验证了通用伪造痕迹的存在性,在未知模态上实现显著性能突破,为通用多模态防御提供了开创性技术路径。 Conclusion: 将多模态取证从传统‘特征融合’范式转向‘模态泛化’新范式,MAF框架有效提升了模型对未知模态的泛化能力,突破了现有方法的泛化瓶颈。 Abstract: As generative artificial intelligence evolves, deepfake attacks have escalated from single-modality manipulations to complex, multimodal threats. Existing forensic techniques face a severe generalization bottleneck: by relying excessively on superficial, modality-specific artifacts, they neglect the shared latent forgery knowledge hidden beneath variable physical appearances. Consequently, these models suffer catastrophic performance degradation when confronted with unseen "dark modalities." To break this limitation, this paper introduces a paradigm shift that redefines multimodal forensics from conventional "feature fusion" to "modality generalization." We propose the first modality-agnostic forgery (MAF) detection framework. By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge. Furthermore, we define two progressive dimensions to quantify model generalization: transferability toward semantically correlated modalities (Weak MAF), and robustness against completely isolated signals of "dark modality" (Strong MAF). To rigorously assess these generalization limits, we introduce the DeepModal-Bench benchmark, which integrates diverse multimodal forgery detection algorithms and adapts state-of-the-art generalized learning methods. This study not only empirically proves the existence of universal forgery traces but also achieves significant performance breakthroughs on unknown modalities via the MAF framework, offering a pioneering technical pathway for universal multimodal defense.

[87] RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

Liang Yao,Shengxiang Xu,Fan Liu,Chuanyi Zhang,Bishun Yao,Rui Min,Yongjun Li,Chaoqian Ouyang,Shimin Di,Min-Ling Zhang

Main category: cs.CV

TL;DR: 本文提出RemoteAgent框架,通过构建VagueEO数据集并结合强化微调,使多模态大语言模型(MLLM)能理解模糊自然语言查询,并智能分配任务:内部处理图像级与稀疏区域级任务,仅对密集像素级预测调用外部工具,从而在遥感任务中实现高效精准的意图识别与执行。

Details Motivation: 地球观测(EO)系统需响应领域专家模糊的自然语言查询,而现有MLLM文本输出难以支持高精度空间预测,且现有代理框架盲目调用工具导致计算低效、未充分利用MLLM原生能力。 Method: 提出RemoteAgent代理框架,构建人类中心的VagueEO指令数据集(含模糊查询与EO任务配对),通过强化微调对齐MLLM为认知核心;采用Model Context Protocol机制,根据任务粒度(图像级/稀疏区域级/密集像素级)智能决策内部处理或调用专用工具。 Result: RemoteAgent在多种EO任务上展现出鲁棒的意图识别能力与具有竞争力的性能,显著提升模糊查询到多粒度视觉分析的映射效率与精度。 Conclusion: RemoteAgent通过尊重MLLM能力边界、结合人类意图建模与任务分级调度,为模糊自然语言驱动的地球观测AI系统提供了高效、可扩展的实用化解决方案。 Abstract: Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.

[88] ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

Zihao Liu,Xiaoyu Wu,Wenna Li,Jianqin Wu,Linlin Yang

Main category: cs.CV

TL;DR: 本文提出ESOM,一种无需训练的高效流式开放世界视频异常检测模型,解决了现有MLLM方法在效率、流式处理和动态异常定义支持上的不足,并引入了OpenDef-Bench新基准进行评估。

Details Motivation: 现有基于多模态大语言模型(MLLM)的开放世界视频异常检测方法存在部署效率低、不适应流式处理、难以支持动态异常定义等局限性。 Method: 提出ESOM模型,包含定义归一化模块(减少幻觉)、帧间匹配-帧内令牌融合模块(压缩冗余视觉令牌)、混合流式记忆模块(实现高效因果推理)和概率评分模块(将区间级文本输出转为帧级异常分数);同时构建OpenDef-Bench基准。 Result: ESOM在单GPU上实现实时推理,在异常时间定位、分类及描述生成任务上达到SOTA性能。 Conclusion: ESOM是一种高效、流式、无需训练的开放世界视频异常检测方案,显著提升了实际部署能力与开放泛化性能。 Abstract: Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.

[89] Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models

Gexin Huang,Anqi Li,Yusheng Tan,Beidi Zhao,Gang Wang,Gaozu Hua,Xiaoxiao Li

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、无需重训练的模型融合策略LogitProd,通过在logit层面自适应加权多个独立训练的病理基础模型输出,在22个基准任务中显著提升性能,平均提升约3%,且训练成本仅为特征融合方法的1/12。

Details Motivation: 病理基础模型数量激增导致模型选择困难:单一模型无法在所有下游任务中表现最优,而逐一适配验证各模型成本过高。 Method: 提出LogitProd融合策略,将多个已训练好的基础模型视为固定专家,在滑片级logit输出上学习样本自适应的加权乘积融合;不需重训编码器,也不需特征对齐;并从理论上证明该加权乘积融合性能不劣于最优单个专家。 Result: 在22个涵盖WSI分类、切片分类、基因突变预测和离散时间生存建模的任务上系统评估,LogitProd在20/22任务中排名第一,平均性能比最强单模型提升约3%,训练成本约为特征融合方法的1/12。 Conclusion: LogitProd是一种高效、即插即用的多模型融合方案,显著降低多专家集成的计算开销,为病理基础模型的实际部署提供了实用新范式。 Abstract: Pathology foundation models (FMs) have become central to computational histopathology, offering strong transfer performance across a wide range of diagnostic and prognostic tasks. The rapid proliferation of pathology foundation models creates a model-selection bottleneck: no single model is uniformly best, yet exhaustively adapting and validating many candidates for each downstream endpoint is prohibitively expensive. We address this challenge with a lightweight and novel model fusion strategy, LogitProd, which treats independently trained FM-based predictors as fixed experts and learns sample-adaptive fusion weights over their slide-level outputs. The fusion operates purely on logits, requiring no encoder retraining and no feature-space alignment across heterogeneous backbones. We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. We systematically evaluate LogitProd on \textbf{22} benchmarks spanning WSI-level classification, tile-level classification, gene mutation prediction, and discrete-time survival modeling. LogitProd ranks first on 20/22 tasks and improves the average performance across all tasks by ~3% over the strongest single expert. LogitProd enables practitioners to upgrade heterogeneous FM-based pipelines in a plug-and-play manner, achieving multi-expert gains with $\sim$12$\times$ lower training cost than feature-fusion alternatives.

[90] Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Chanhyuk Choi,Taesoo Kim,Donggyu Lee,Siyeol Jung,Taehwan Kim

Main category: cs.CV

TL;DR: 本文提出了一种跨模态情感迁移方法(C-MET),通过建模语音与视觉特征空间间的情感语义向量,实现基于语音的面部表情生成,显著提升了情感准确性(+14%)并支持未见扩展情感(如讽刺)的合成。

Details Motivation: 现有方法在情感编辑中存在局限:标签法无法覆盖广泛情感;音频法因语音中情感与语言内容耦合而难以精准表达目标情感;图像法依赖高质量正面参考图,难以获取扩展情感的参考数据。 Method: 提出Cross-Modal Emotion Transfer(C-MET),利用大规模预训练音频编码器和解耦式面部表情编码器,学习跨模态(语音↔视觉)的情感语义向量,该向量表征不同情感嵌入在模态间的差异。 Result: 在MEAD和CREMA-D数据集上,情感准确率较SOTA提升14%,可生成高表现力的说话人脸视频,包括未见的扩展情感(如sarcasm)。 Conclusion: C-MET有效解耦语音中的情感与语言内容,通过跨模态语义对齐实现更灵活、更准确、更泛化的 talking face 情感编辑,为真实感语音驱动面部动画提供了新范式。 Abstract: Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

[91] Image-Guided Geometric Stylization of 3D Meshes

Changwoon Choi,Hyunsoo Lee,Clément Jambon,Yael Vinker,Young Min Kim

Main category: cs.CV

TL;DR: 本文提出了一种几何风格化框架GeoStyle,利用预训练扩散模型提取图像风格,并通过粗到细的变形流程将风格迁移至3D网格,支持大幅几何变形同时保持拓扑和语义结构。

Details Motivation: 现有生成模型难以支持超出数据分布的显著几何变形,且对图像风格的几何表达能力有限。 Method: 提出基于预训练扩散模型提取图像抽象风格表征的几何风格化框架;设计粗到细的网格变形流水线;引入近似VAE编码器以从网格渲染中提供高效可靠的梯度。 Result: 实验表明该方法能生成反映图像独特几何特征(如姿态、轮廓)的风格化3D网格,支持艺术化3D创作。 Conclusion: GeoStyle实现了图像驱动的可控3D几何风格迁移,在保持拓扑有效性与部件语义的同时拓展了生成模型的几何表达能力。 Abstract: Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our method can create stylized 3D meshes that reflect unique geometric features of the pictured assets, such as expressive poses and silhouettes, thereby supporting the creation of distinctive artistic 3D creations. Project page: https://changwoonchoi.github.io/GeoStyle

[92] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

Shaotian Li,Shangze Li,Chuancheng Shi,Wenhua Wu,Yanqiu Wu,Xiaohan Yu,Fei Shen,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出LAKE框架,无需训练即可从预训练视觉语言模型中挖掘潜在的异常知识,通过识别对异常敏感的稀疏神经元,实现零样本异常检测并提供神经元级可解释性。

Details Motivation: 现有方法将视觉语言模型视为黑箱特征提取器,依赖外部适配器或记忆库获取异常知识;本文质疑该假设,认为异常知识已内嵌于预训练模型中但处于潜伏状态。 Method: 提出无需训练的LAKE框架,利用少量正常样本识别并激活模型中对异常敏感的稀疏神经元,构建融合视觉结构偏差与跨模态语义激活的紧凑正常性表征。 Result: 在工业异常检测基准上达到SOTA性能,并提供内在的神经元级可解释性。 Conclusion: 异常检测应被重新定义为对预训练模型中潜在知识的定向激活,而非下游任务知识的学习。 Abstract: Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs as black-box feature extractors, assuming that anomaly-specific knowledge must be acquired through external adapters or memory banks. In this paper, we challenge this assumption by arguing that anomaly knowledge is intrinsically embedded within pre-trained models but remains latent and under-activated. We hypothesize that this knowledge is concentrated within a sparse subset of anomaly-sensitive neurons. To validate this, we propose latent anomaly knowledge excavation (LAKE), a training-free framework that identifies and elicits these critical neuronal signals using only a minimal set of normal samples. By isolating these sensitive neurons, LAKE constructs a highly compact normality representation that integrates visual structural deviations with cross-modal semantic activations. Extensive experiments on industrial AD benchmarks demonstrate that LAKE achieves state-of-the-art performance while providing intrinsic, neuron-level interpretability. Ultimately, our work advocates for a paradigm shift: redefining anomaly detection as the targeted activation of latent pre-trained knowledge rather than the acquisition of a downstream task.

[93] HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

Qihui Zhu,Tao Zhang,Yuchen Wang,Zijian Wen,Mengjie Zhang,Shuangwu Chen,Xiaobin Tan,Jian Yang,Yang Liu,Zhenhua Dong,Xianzhi Yu,Yinfei Pan

Main category: cs.CV

TL;DR: HAWK是一种无需训练的视觉token剪枝方法,通过感知不同注意力头在视觉任务中的重要性差异,利用头重要性权重和文本引导注意力来保留关键视觉token,显著降低MLLM推理开销并保持高精度。

Details Motivation: 现有视觉token剪枝方法假设所有注意力头对视觉理解贡献相同,但实际中不同头捕获不同视觉语义、扮演不同角色,需更精细的重要性建模。 Method: 提出HAWK方法,基于头重要性权重和文本引导注意力评估视觉token重要性,实现训练-free、即插即用的视觉token剪枝。 Result: 在多个视觉语言基准上达到SOTA精度;在Qwen2.5-VL上剪枝80.2%视觉token后仍保持96.0%原始精度,端到端延迟降至74.4%,GPU显存占用下降。 Conclusion: HAWK通过建模注意力头异质性实现高效视觉token剪枝,在精度、延迟与显存三方面取得良好平衡,适用于多种MLLM。 Abstract: In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.

[94] AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models

Hazza Mahmood,Yongqiang Yu,Rao Anwer

Main category: cs.CV

TL;DR: 本文提出了AgriChain数据集(约1.1万张专家标注的植物叶片图像,含疾病标签、置信度评分和专家验证的链式推理理由),并基于其微调Qwen2.5-VL-3B得到AgriChain-VL3B模型,在病害诊断任务中实现73.1% top-1准确率,优于多个强基线模型,显著提升准确性与可解释性。

Details Motivation: 现有视觉语言模型在真实农业场景中难以兼顾植物病害诊断的准确性与可解释性。 Method: 构建AgriChain数据集(含专家验证的链式推理理由),并在其上微调Qwen2.5-VL-3B模型,实现疾病预测与可视化推理联合生成。 Result: AgriChain-VL3B在1000张图像测试集上达到73.1% top-1准确率(macro F1=0.466,weighted F1=0.655),解释内容高度契合专家推理,且优于Gemini 1.5 Flash、Gemini 2.5 Pro和GPT-4o Mini等基线。 Conclusion: 专家验证的推理监督能显著提升模型的准确性与可解释性,弥合通用多模态模型与人类专业知识之间的鸿沟,推动可信、可全球部署的农业AI发展。 Abstract: Accurate and interpretable plant disease diagnosis remains a major challenge for vision-language models (VLMs) in real-world agriculture. We introduce AgriChain, a dataset of approximately 11,000 expert-curated leaf images spanning diverse crops and pathologies, each paired with (i) a disease label, (ii) a calibrated confidence score (High/Medium/Low), and (iii) an expert-verified chain-of-thought (CoT) rationale. Draft explanations were first generated by GPT-4o and then verified by a professional agricultural engineer using standardized descriptors (e.g., lesion color, margin, and distribution). We fine-tune Qwen2.5-VL-3B on AgriChain, resulting in a specialized model termed AgriChain-VL3B, to jointly predict diseases and generate visually grounded reasoning. On a 1,000-image test set, our CoT-supervised model achieves 73.1% top-1 accuracy (macro F1 = 0.466; weighted F1 = 0.655), outperforming strong baselines including Gemini 1.5 Flash, Gemini 2.5 Pro, and GPT-4o Mini. The generated explanations align closely with expert reasoning, consistently referencing key visual cues. These findings demonstrate that expert-verified reasoning supervision significantly enhances both accuracy and interpretability, bridging the gap between generic multimodal models and human expertise, and advancing trustworthy, globally deployable AI for sustainable agriculture. The dataset and code are publicly available at: https://github.com/hazzanabeel12-netizen/agrichain

[95] LPM 1.0: Video-based Character Performance Model

Ailing Zeng,Casper Yang,Chauncey Ge,Eddie Zhang,Garvey Xu,Gavin Lin,Gilbert Gu,Jeremy Pi,Leo Li,Mingyi Shi,Sheng Bi,Steven Tang,Thorn Hang,Tobey Guo,Vincent Li,Xin Tong,Yikang Li,Yuchen Sun,Yue,Zhao,Yuhan Lu,Yuwei Li,Zane Zhang,Zeshi Yang,Zi Ye

Main category: cs.CV

TL;DR: 本文提出LPM 1.0(大型性能模型),一种面向单人全双工音视频对话场景的生成模型,通过构建高质量多模态数据集、训练17B参数扩散Transformer并蒸馏为流式因果模型,实现了高表现力、实时推理与长时身份稳定三者的统一,并配套提出首个交互式角色性能评测基准LPM-Bench。

Details Motivation: 现有视频生成模型难以同时满足高表现力、实时推理和长时身份稳定性,即‘性能三难困境’;而对话是最具挑战性也最全面的角色性能场景,需同步完成说话、倾听、反应、表情及长期身份维持。 Method: 构建严格筛选、音视频配对、性能理解与身份感知多参考提取的人类中心多模态数据集;训练17B参数的多模态条件扩散Transformer(Base LPM);蒸馏为低延迟、无限长度的因果流式生成器(Online LPM);支持基于人物图像、音频输入与文本提示的实时听/说视频生成。 Result: LPM 1.0在LPM-Bench基准所有维度上达到SOTA,同时保持实时推理能力;支持身份稳定的无限长度交互式音视频生成,适用于对话代理、直播虚拟人与游戏NPC等场景。 Conclusion: LPM 1.0首次系统性地破解了性能三难困境,为交互式数字角色提供了可扩展、可控且实时的视觉生成引擎,并推动建立了该领域首个标准化评测体系。 Abstract: Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

[96] FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

Jinghan Yang,Yihe Fan,Xudong Pan,Min Yang

Main category: cs.CV

TL;DR: 本文提出FlowGuard,一种在扩散模型生成过程中实时检测NSFW内容的跨模型框架,通过线性近似潜空间解码和课程学习,在中间去噪步骤中实现早期、高效、鲁棒的不安全内容识别。

Details Motivation: 现有NSFW检测方法(前置基于文本提示、后置基于最终图像)存在提示-图像安全不一致、无法处理中间噪声图像等问题,缺乏对生成过程本身的细粒度安全干预能力。 Method: 提出FlowGuard框架:1)设计潜空间解码的线性近似以缓解早期高噪声下视觉信号缺失问题;2)采用课程学习策略稳定训练;3)在多个扩散模型的中间去噪步进行跨模型联合检测。 Result: 在涵盖9种扩散主干模型的跨模型基准上,FlowGuard在分布内与分布外场景下F1分数较现有方法提升超30%,峰值GPU显存降低97%以上,投影时间从8.1秒降至0.2秒。 Conclusion: FlowGuard首次实现了高效、鲁棒的扩散过程内NSFW检测,兼顾安全性、计算效率与泛化能力,为安全可控的生成式AI提供了新范式。 Abstract: Diffusion-based image generation models have advanced rapidly but pose a safety risk due to their potential to generate Not-Safe-For-Work (NSFW) content. Existing NSFW detection methods mainly operate either before or after image generation. Pre-generation methods rely on text prompts and struggle with the gap between prompt safety and image safety. Post-generation methods apply classifiers to final outputs, but they are poorly suited to intermediate noisy images. To address this, we introduce FlowGuard, a cross-model in-generation detection framework that inspects intermediate denoising steps. This is particularly challenging in latent diffusion, where early-stage noise obscures visual signals. FlowGuard employs a novel linear approximation for latent decoding and leverages a curriculum learning approach to stabilize training. By detecting unsafe content early, FlowGuard reduces unnecessary diffusion steps to cut computational costs. Our cross-model benchmark spanning nine diffusion-based backbones shows the effectiveness of FlowGuard for in-generation NSFW detection in both in-distribution and out-of-distribution settings, outperforming existing methods by over 30% in F1 score while delivering transformative efficiency gains, including slashing peak GPU memory demand by over 97% and projection time from 8.1 seconds to 0.2 seconds compared to standard VAE decoding.

[97] ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

Boyuan Wang,Xiaofeng Wang,Yongkang Li,Zheng Zhu,Yifan Chang,Angen Ye,Guosheng Zhao,Chaojun Ni,Guan Huang,Yijie Ren,Yueqi Duan,Xingang Wang

Main category: cs.CV

TL;DR: ReconPhys 是首个前馈式框架,能从单目视频中联合学习物理属性估计与3D高斯点绘重建,无需真值标签,实现快速、物理合理的非刚性物体重建。

Details Motivation: 现有基于可微渲染的逐场景优化方法依赖昂贵调参或人工标注,实用性与泛化性受限。 Method: 提出双分支架构的 ReconPhys 框架,采用自监督训练策略,从单目视频中端到端联合估计几何、外观与物理属性,并结合3D高斯点绘进行重建。 Result: 在合成数据集上,未来帧预测 PSNR 达 21.64(SOTA 基线为 13.27),Chamfer 距离降至 0.004(原为 0.349),推理时间小于 1 秒(原需数小时)。 Conclusion: ReconPhys 实现了高效、物理合理、无需真值标签的非刚性物体重建,显著提升实用性与仿真资产生成效率。 Abstract: Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for per-scene optimization, recovering geometry and dynamics but requiring expensive tuning or manual annotation, which limits practicality and generalizability. To address this, we propose ReconPhys, the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. Our method employs a dual-branch architecture trained via a self-supervised strategy, eliminating the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes. Experiments on a large-scale synthetic dataset demonstrate superior performance: our method achieves 21.64 PSNR in future prediction compared to 13.27 by state-of-the-art optimization baselines, while reducing Chamfer Distance from 0.349 to 0.004. Crucially, ReconPhys enables fast inference (<1 second) versus hours required by existing methods, facilitating rapid generation of simulation-ready assets for robotics and graphics.

[98] Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition

Xuemei Jia,Jiawei Du,Hui Wei,Jun Chen,Joey Tianyi Zhou,Zheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种强化学习引导的合成数据生成框架,用于在数据受限的隐私敏感场景(如身份识别)中提升生成模型性能。通过冷启动适配预训练生成器,并设计多目标奖励函数优化语义一致性、多样性与表达丰富性,结合动态样本选择机制,显著提升了生成保真度与下游分类准确率,尤其在小样本情况下泛化能力强。

Details Motivation: 在隐私敏感场景中,由于法规和版权限制导致真实数据获取困难,而低质量的生成模型又无法缓解数据稀缺问题,形成恶性循环。 Method: 提出强化学习引导的合成数据生成框架:1)冷启动适配预训练生成器以实现语义对齐;2)设计联合优化语义一致性、覆盖多样性和表达丰富性的多目标奖励函数;3)在下游训练中引入动态样本选择机制以自适应筛选高实用性合成样本。 Result: 在多个基准数据集上实验表明,该方法显著提升了生成图像的保真度和下游分类任务的准确率,并在小样本设置下对新类别表现出强泛化能力。 Conclusion: 该框架有效打破了数据稀缺与生成模型性能低下之间的恶性循环,为隐私敏感、数据受限场景下的高质量生成建模提供了可行路径。 Abstract: High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.

[99] Sampling-Aware 3D Spatial Analysis in Multiplexed Imaging

Ido Harlev,Tamar Oukhanov,Raz Ben-Uri,Leeat Keren,Shai Bagon

Main category: cs.CV

TL;DR: 本文研究了在空间蛋白质组学中,如何通过稀疏的连续切片实现稳定可靠的三维分析,提出了一种几何感知的重建模块,利用表型和邻近性约束及细胞类型特异性形状先验,从稀疏二维切片中恢复单细胞三维坐标,并分析了切片间距、覆盖度与冗余度之间的权衡。

Details Motivation: 高通量多重显微镜虽能实现单细胞分辨率的空间组织解析,但多数分析仍局限于二维切片,而获取密集三维数据成本高、技术难;实践中需在2D与稀疏3D之间权衡,缺乏对采样几何影响的系统评估与可靠重建方法。 Method: 通过受控模拟分析不同采样几何对空间统计稳定性的影响;提出一种几何感知重建模块,结合表型一致性、空间邻近性约束与细胞类型特异性形状先验,从稀疏连续切片中推断单细胞三维中心位置;定量分析切片间距、覆盖率与冗余度的权衡关系。 Result: 发现平面采样能稳定估计全局细胞丰度,但局部统计(如细胞聚类、互作)方差大,尤其对稀有或局域化细胞群;在真实多路数据中验证了交互指标与邻域关系在切片间剧烈波动;所提重建方法在IMC密集轴向数据上验证准确,并在CODEX数据中成功支持原本2D不可靠的结构级3D分析。 Conclusion: 本文为研究者提供了诊断工具与实践指南:明确何时2D采样已足够,何时需引入稀疏3D重建;所提方法可在有限成像预算下提升三维空间分析的可靠性与生物学解释力。 Abstract: Highly multiplexed microscopy enables rich spatial characterization of tissues at single-cell resolution, yet most analyses rely on two-dimensional sections despite inherently three-dimensional tissue organization. Acquiring dense volumetric data in spatial proteomics remains costly and technically challenging, leaving practitioners to choose between 2D sections or 3D serial sections under limited imaging budgets. In this work, we study how sampling geometry impacts the stability of commonly used spatial statistics, and we introduce a geometry-aware reconstruction module that enables sparse yet consistent 3D analysis from serial sections. Using controlled simulations, we show that planar sampling reliably recovers global cell-type abundance but exhibits high variance for local statistics such as cell clustering and cell-cell interactions, particularly for rare or spatially localized populations. We observe consistent behavior in real multiplexed datasets, where interaction metrics and neighborhood relationships fluctuate substantially across individual sections. To support sparse 3D analysis in practice, we present a reconstruction approach that links cell projections across adjacent sections using phenotype and proximity constraints and recovers single-cell 3D centroids using cell-type-specific shape priors. We further analyze the trade-off between section spacing, coverage, and redundancy, identifying acquisition regimes that maximize reconstruction utility under fixed imaging budgets. We validate the reconstruction module on a public imaging mass cytometry dataset with dense axial sampling and demonstrate its downstream utility on an in-house CODEX dataset by enabling structure-level 3D analyses that are unreliable in 2D. Together, our results provide diagnostic tools and practical guidance for deciding when 2D sampling suffices and when sparse 3D reconstruction is warranted.

[100] AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

Jiaming Su,Tengchao Yang,Ruikang Zhang,Zhengan Yan,Haoyu Sun,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出AnomalyAgent,一种具备自反思、知识检索和迭代优化能力的工业异常合成智能体,通过多工具协同与两阶段训练框架,在MVTec-AD数据集上显著超越现有零样本SOTA方法。

Details Motivation: 解决工业异常检测中因真实异常样本稀缺导致的数据不足问题,克服现有单步生成方法在语义真实性和复杂推理能力上的不足。 Method: 构建包含Prompt生成、图像生成、质量评估、知识检索和掩码生成五种工具的AnomalyAgent;采用基于真实异常图像构建的结构化轨迹进行两阶段训练(监督微调+强化学习),并设计任务、反思和行为三重奖励机制驱动优化。 Result: 在MVTec-AD上实现IS/IC-L为2.10/0.33,ResNet34分类准确率达57.0%,UNet在图像/像素级AP达99.3%/74.2%,全面超越零样本SOTA方法。 Conclusion: AnomalyAgent通过引入智能体范式与闭环优化机制,有效提升了工业异常生成的真实性、多样性与可控性,为数据稀缺场景下的异常检测提供了新范式。 Abstract: Industrial anomaly generation is a crucial method for alleviating the data scarcity problem in anomaly detection tasks. Most existing anomaly synthesis methods rely on single-step generation mechanisms, lacking complex reasoning and iterative optimization capabilities, making it difficult to generate anomaly samples with high semantic realism. We propose AnomalyAgent, an anomaly synthesis agent with self-reflection, knowledge retrieval, and iterative refinement capabilities, aiming to generate realistic and diverse anomalies. Specifically, AnomalyAgent is equipped with five tools: Prompt Generation (PG), Image Generation (IG), Quality Evaluation (QE), Knowledge Retrieval (KR), and Mask Generation (MG), enabling closed-loop optimization. To improve decision-making and self-reflection, we construct structured trajectories from real anomaly images and design a two-stage training framework: supervised fine-tuning followed by reinforcement learning. This process is driven by a three-part reward mechanism: (1) task rewards to supervise the quality and location rationality of generated anomalies; (2) reflection rewards to train the model's ability to improve anomaly synthesis prompt; (3) behavioral rewards to ensure adherence to the trajectory. On the MVTec-AD dataset, AnomalyAgent achieves IS/IC-L of 2.10/0.33 for anomaly generation, 57.0% classification accuracy using ResNet34, and 99.3%/74.2% AP at the image/pixel level using a simple UNet, surpassing all zero-shot SOTA methods. The code and data will be made publicly available.

[101] PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation

Dingwen Xiao,Weiming Zhang,Shiqi Wen,Lin Wang

Main category: cs.CV

TL;DR: 本文提出PanoSAM2,一种基于SAM2的轻量级360视频目标分割框架,通过全景感知解码器、畸变引导掩码损失和长短时记忆模块,有效解决球面投影畸变、左右语义不一致及内存中对象稀疏等问题,在360VOTS和PanoVOS数据集上显著超越SAM2。

Details Motivation: 现有360视频目标分割(360VOS)面临高质量标注数据稀缺问题;而直接应用具备强视频分割能力的SAM2效果不佳,因其未考虑360视频特有的投影畸变、左右边界语义不一致以及内存中对象掩码信息稀疏等挑战。 Method: 提出PanoSAM2框架,包含三方面创新:1)Pano-Aware Decoder,采用缝合一致感受野与迭代畸变优化,保障0/360度边界连续性;2)Distortion-Guided Mask Loss,依据畸变程度加权像素损失,强调拉伸区域与边界;3)Long-Short Memory Module,用紧凑长期指针重实例化并对其短期记忆,提升时序一致性。 Result: 在360VOTS和PanoVOS数据集上,PanoSAM2相较SAM2分别提升5.6和6.7分,验证了方法有效性。 Conclusion: PanoSAM2通过轻量且针对性强的适配策略,在保留SAM2用户友好提示机制的同时,成功克服360视频分割的关键挑战,为VR/AR与具身AI等应用提供了更可靠的分割基础。 Abstract: 360 video object segmentation (360VOS) aims to predict temporally-consistent masks in 360 videos, offering full-scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high-quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 -- with its design of memory module -- shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left-right sides, and sparse object mask information in SAM2's memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion- and memory-aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2's user-friendly prompting design. Concretely, to tackle the projection distortion and semantic inconsistency issues, we propose a Pano-Aware Decoder with seam-consistent receptive fields and iterative distortion refinement to maintain continuity across the 0/360 degree boundary. Meanwhile, a Distortion-Guided Mask Loss is introduced to weight pixels by distortion magnitude, stressing stretched regions and boundaries. To address the object sparsity issue, we propose a Long-Short Memory Module to maintain a compact long-term object pointer to re-instantiate and align short-term memories, thereby enhancing temporal coherence. Extensive experiments show that PanoSAM2 yields substantial gains over SAM2: +5.6 on 360VOTS and +6.7 on PanoVOS, showing the effectiveness of our method.

[102] ParkSense: Where Should a Delivery Driver Park? Leveraging Idle AV Compute and Vision-Language Models

Die Hu,Henan Li

Main category: cs.CV

TL;DR: 本文提出ParkSense框架,利用自动驾驶车辆在低风险状态(如等红灯、拥堵)下的空闲算力,运行量化7B视觉语言模型(VLM),结合预缓存的卫星与街景图像,精准识别商户入口与合法停车区,以解决外卖配送中耗时严重的停车问题。

Details Motivation: 外卖配送中寻找停车位耗费大量时间,但现有系统缺乏针对商户入口的精确停车点选择能力。 Method: 提出ParkSense框架,利用AV低风险状态下的闲置算力运行量化7B VLM,分析预缓存的卫星与街景图像,识别商户入口和合法停车区域;形式化定义Delivery-Aware Precision Parking (DAPP)问题。 Result: 量化7B VLM可在HW4级硬件上4–8秒内完成推理;估算美国外卖司机年均增收3000–8000美元;识别出自动驾驶、计算机视觉与最后一公里物流交叉领域的5个开放研究方向。 Conclusion: ParkSense为提升外卖配送效率提供了可行的技术路径,揭示了跨领域协同优化的巨大潜力,并指明了未来关键研究方向。 Abstract: Finding parking consumes a disproportionate share of food delivery time, yet no system addresses precise parking-spot selection relative to merchant entrances. We propose ParkSense, a framework that repurposes idle compute during low-risk AV states -- queuing at red lights, traffic congestion, parking-lot crawl -- to run a Vision-Language Model (VLM) on pre-cached satellite and street view imagery, identifying entrances and legal parking zones. We formalize the Delivery-Aware Precision Parking (DAPP) problem, show that a quantized 7B VLM completes inference in 4-8 seconds on HW4-class hardware, and estimate annual per-driver income gains of 3,000-8,000 USD in the U.S. Five open research directions are identified at this unexplored intersection of autonomous driving, computer vision, and last-mile logistics.

[103] Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction

Yuanhong Zhang,Zhaoyang Wang,Xin Zhang,Weizhan Zhang,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 本文提出MESA框架,通过选择性潜空间干预来缓解大视觉语言模型(LVLMs)的幻觉问题,同时保持原有生成行为不变。

Details Motivation: 现有方法在抑制LVLMs幻觉时会改变生成行为(如输出变短、词元分布偏移),根源在于干预信号与模型内在生成机制纠缠。 Method: 提出MESA——一种即插即用的潜空间干预框架,仅针对幻觉相关响应进行可控、选择性干预,保留原始token分布。 Result: 在多种生成与判别基准上验证了MESA能一致降低幻觉,且比先前方法更好地维持生成行为,在多个LVLM家族中均表现更优。 Conclusion: MESA实现了幻觉缓解与生成行为保真之间的有效平衡,为LVLMs可信生成提供了新思路。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across cross-modal tasks but remain hindered by hallucinations, producing textual outputs inconsistent with visual content. Existing methods mitigate hallucinations but often alter generation behavior, resulting in shorter outputs and shifted token distributions, especially in latent space steering approaches. We identify that this issue stems from entangled steering signals, where suppressing hallucinations inadvertently disrupts the model's intrinsic generation behavior. To address this, we propose MESA, an effective plug-and-play framework that performs controlled and selective latent intervention for hallucination mitigation. Specifically, MESA targets hallucination-relevant responses while preserving the model's original token distribution, enabling effective hallucination reduction without compromising generation behavior. Extensive experiments across diverse generative and discriminative benchmarks demonstrate that MESA consistently reduces hallucinations while better preserving generation behavior, outperforming prior methods across multiple LVLM families.

[104] Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

Weiming Zhang,Dingwen Xiao,Songyue Guo,Guangyu Xiang,Shiqi Wen,Minwei Zhao,Lei Chen,Lin Wang

Main category: cs.CV

TL;DR: 本文提出Tarot-SAM3,一种无需训练的框架,通过表达推理解释器(ERI)和掩码自优化(MSR)两阶段,提升SAM3在任意指代表达分割(RES)任务中的鲁棒性与泛化能力,尤其适用于显式与隐式表达及开放世界场景。

Details Motivation: 现有指代表达分割(RES)方法依赖大量标注数据,且难以兼顾显式与隐式表达;SAM3虽在可提示分割中表现优异,但对长/隐式语言表达支持不足,且直接耦合MLLM会导致结果过度依赖其推理能力而缺乏对分割输出的优化。 Method: 提出Tarot-SAM3无训练框架:第一阶段为表达推理解释器(ERI),通过推理辅助的多类型提示生成与评估感知重述,将任意指代表达转化为鲁棒异构提示以驱动SAM3;第二阶段为掩码自优化(MSR),基于DINOv3特征关系对ERI生成的多个掩码进行选择与自优化,校正过分割与欠分割。 Result: 在显式、隐式RES基准及开放世界场景中均取得强性能;消融实验验证了ERI与MSR两阶段的有效性。 Conclusion: Tarot-SAM3无需训练即可显著提升SAM3在复杂指代表达下的分割鲁棒性与泛化能力,为视觉-语言对齐提供新思路。 Abstract: Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.

[105] Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation

Hina Kogure,Kei Katsumata,Taiki Miyanishi,Komei Sugiura

Main category: cs.CV

TL;DR: 本文提出Stitch4D框架,用于解决稀疏多视角城市环境下的4D重建问题,通过合成中间桥接视图和联合优化真实与合成观测,在统一坐标系下保证跨位置一致性,显著提升几何完整性与动态连续性。

Details Motivation: 现有4D重建方法依赖密集重叠视图,在实际城市中多为稀疏、非重叠的多位置相机配置,导致中间区域重建失败和时序伪影,该问题尚未被充分研究。 Method: Stitch4D包含两部分:(i) 合成中间桥接视图以增强空间约束和覆盖;(ii) 在统一坐标系下联合优化真实与合成观测,并施加显式的跨位置一致性约束。 Result: 在自建CARLA基准U-S4D上实验表明,Stitch4D显著优于主流4D重建基线,重建几何更完整、动态更平滑,视觉质量更高。 Conclusion: 恢复中间空间覆盖对稀疏城市环境下的稳定4D重建至关重要,Stitch4D为此提供了有效且统一的解决方案。 Abstract: Dynamic urban environments are often captured by cameras placed at spatially separated locations with little or no view overlap. However, most existing 4D reconstruction methods assume densely overlapping views. When applied to such sparse observations, these methods fail to reconstruct intermediate regions and often introduce temporal artifacts. To address this practical yet underexplored sparse multi-location setting, we propose Stitch4D, a unified 4D reconstruction framework that explicitly compensates for missing spatial coverage in sparse observations. Stitch4D (i) synthesizes intermediate bridge views to densify spatial constraints and improve spatial coverage, and (ii) jointly optimizes real and synthesized observations within a unified coordinate frame under explicit inter-location consistency constraints. By restoring intermediate coverage before optimization, Stitch4D prevents geometric collapse and reconstructs coherent geometry and smooth scene dynamics even in sparsely observed environments. To evaluate this setting, we introduce Urban Sparse 4D (U-S4D), a CARLA-based benchmark designed to assess spatiotemporal alignment under sparse multi-location configurations. Experimental results on U-S4D show that Stitch4D surpasses representative 4D reconstruction baselines and achieves superior visual quality. These results indicate that recovering intermediate spatial coverage is essential for stable 4D reconstruction in sparse urban environments.

[106] Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting

Tao Hana,Zhibin Wen,Zhenghao Chen,Fenghua Lin,Junyu Gao,Song Guo,Lei Bai

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯溅射的尺度感知视觉Transformer(GSSA-ViT),用于任意分辨率的数值天气预报与灵活降尺度,首次将生成式3D高斯建模与尺度感知注意力机制结合,显著提升多尺度大气场预测效率与精度。

Details Motivation: AI驱动的数值天气预报虽快,但高分辨率输出受限于多尺度适应性差和数据表示低效。 Method: 将经纬度网格点视为3D高斯中心,引入生成式3D高斯预测方案估计协方差、属性和不透明度;设计尺度感知注意力模块以建模跨尺度依赖关系,支持连续分辨率自适应。 Result: 在ERA5上准确预测87个大气变量的任意分辨率结果;在ERA5和CMIP6上降尺度性能优于现有方法。 Conclusion: GSSA-ViT为高分辨率、多尺度大气预测与降尺度提供了高效可扩展的新范式。 Abstract: While AI-based numerical weather prediction (NWP) enables rapid forecasting, generating high-resolution outputs remains computationally demanding due to limited multi-scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting-based scale-aware vision transformer (GSSA-ViT), a novel framework for arbitrary-resolution forecasting and flexible downscaling of high-dimensional atmospheric fields. Specifically, latitude-longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale-aware attention module is designed to capture cross-scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale-aware attention for unified multi-scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high-resolution, multi-scale atmospheric prediction and downscaling. Code is available at: https://github.com/binbin2xs/weather-GS.

[107] Shortcut Learning in Glomerular AI: Adversarial Penalties Hurt, Entropy Helps

Mohammad Daouk,Jan Ulrich Becker,Neeraja Kambham,Anthony Chang,Hien Nguyen,Chandra Mohan

Main category: cs.CV

TL;DR: 本文研究了肾病理AI中染色变异导致的分布偏移和捷径学习问题,提出了一种无需染色或站点标签的无监督熵正则化方法,在多中心、多染色数据集上验证了其对狼疮性肾炎肾小球病变分类的有效性和鲁棒性。

Details Motivation: 染色变异性是肾病理AI中普遍存在的分布偏移来源,可能导致模型利用染色信息作为捷径进行预测,影响泛化能力。本文旨在探究狼疮性肾炎肾小球病变分类器是否依赖染色捷径,并提出无需染色/中心标签的缓解策略。 Method: 构建包含三个中心、四种染色(PAS、H&E、Jones、Trichrome)的9674张肾小球图像块多中心多染色数据集;采用贝叶斯CNN与ViT主干网络结合蒙特卡洛Dropout;设计三种设置:(1)仅染色分类;(2)带监督染色损失的双头联合预测;(3)基于染色头预测熵最大化的无标签染色正则化。 Result: (1)染色身份极易学习,证实其为强捷径;(2)监督染色损失调节染色性能但几乎不影响病变分类指标,表明该数据集本身不易引发染色捷径;(3)熵正则化使染色预测接近随机水平,且不损害病变分类准确率与校准性。 Conclusion: 精心构建的多染色数据集本身具备一定抗染色捷径能力;而贝叶斯双头架构配合无标签熵正则化是一种简单、可部署的防护机制,可应对肾小球AI中潜在的染色相关漂移。 Abstract: Stain variability is a pervasive source of distribution shift and potential shortcut learning in renal pathology AI. We ask whether lupus nephritis glomerular lesion classifiers exploit stain as a shortcut, and how to mitigate such bias without stain or site labels. We curate a multi-center, multi-stain dataset of 9{,}674 glomerular patches (224$\times$224) from 365 WSIs across three centers and four stains (PAS, H\&E, Jones, Trichrome), labeled as proliferative vs.\ non-proliferative. We evaluate Bayesian CNN and ViT backbones with Monte Carlo dropout in three settings: (1) stain-only classification; (2) a dual-head model jointly predicting lesion and stain with supervised stain loss; and (3) a dual-head model with label-free stain regularization via entropy maximization on the stain head. In (1), stain identity is trivially learnable, confirming a strong candidate shortcut. In (2), varying the strength and sign of stain supervision strongly modulates stain performance but leaves lesion metrics essentially unchanged, indicating no measurable stain-driven shortcut learning on this multi-stain, multi-center dataset, while overly adversarial stain penalties inflate predictive uncertainty. In (3), entropy-based regularization holds stain predictions near chance without degrading lesion accuracy or calibration. Overall, a carefully curated multi-stain dataset can be inherently robust to stain shortcuts, and a Bayesian dual-head architecture with label-free entropy regularization offers a simple, deployment-friendly safeguard against potential stain-related drift in glomerular AI.

[108] ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

Jiayang Xu,Fan Zhuo,Majun Zhang,Changhao Pan,Zehan Wang,Siyu Chen,Xiaoda Yang,Tao Jin,Zhou Zhao

Main category: cs.CV

TL;DR: ImVideoEdit是一种高效视频编辑框架,仅用图像对训练即可实现高质量视频编辑,通过冻结3D注意力模块、引入Predict-Update空间差异注意力和文本引导动态语义门控机制,在低计算开销下保持优异的时间一致性和编辑保真度。

Details Motivation: 现有视频编辑模型依赖昂贵的配对视频数据,难以扩展;而多数视频编辑任务本质是时空解耦过程,可保留预训练模型的时间动态性,仅精准修改空间内容。 Method: 提出ImVideoEdit框架:冻结预训练3D注意力模块,将图像视为单帧视频以解耦2D空间学习;设计Predict-Update Spatial Difference Attention模块渐进提取与注入空间差异;引入Text-Guided Dynamic Semantic Gating实现无需显式掩码的文本驱动自适应编辑。 Result: 仅用13K图像对训练5轮、计算开销极低,仍达到与在大规模视频数据上训练的大模型相当的编辑保真度和时间一致性。 Conclusion: 证明仅用图像对即可高效学习视频编辑能力,为降低视频编辑模型训练成本与提升实用性提供了新范式。 Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.

[109] TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

Yifei Gong,Xing Wu,Wenda Liu,Kang Tu

Main category: cs.CV

TL;DR: 本文提出了ToolCAD框架,利用大语言模型(LLM)作为工具调用智能体实现文本到CAD建模,结合交互式CAD建模环境与在线课程强化学习,使开源LLM具备媲美专有模型的CAD建模能力。

Details Motivation: 现有研究尚未探索工具调用型大语言模型如何最优地与CAD引擎交互,阻碍了文本到CAD建模系统的构建。 Method: 提出ToolCAD框架,构建交互式CAD建模gym以生成推理与工具交互轨迹,并设计端到端后训练策略,通过在线课程强化学习引导LLM生成高质量CAD建模思维链(CAD-CoT)。 Result: ToolCAD成功填补了开源LLM在CAD工具调用智能体训练方面的空白,使其性能可媲美专有模型。 Conclusion: ToolCAD为构建更易获取、更鲁棒的自主文本到CAD建模系统提供了可行路径。 Abstract: Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.

[110] DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

Gyanendra Das,Sai Satyam Jena

Main category: cs.CV

TL;DR: 本文提出了一种名为动态子空间概念对齐(DSCA)的方法,通过将视觉语言模型的表征空间分解为正交语义子空间,在结构上隔离概念,从而实现精准、无干扰的持续知识编辑,显著提升了长期编辑的稳定性与知识保留能力。

Details Motivation: 现有视觉语言模型(VLM)的知识编辑方法在共享表征空间中操作,导致概念纠缠和编辑干扰,难以应对持续编辑中的灾难性遗忘与跨模态错位问题。 Method: 提出Dynamic Subspace Concept Alignment(DSCA),通过增量聚类与联合视觉-语言表征的PCA构建正交语义子空间;编辑仅在对应子空间中进行,并采用多目标损失函数保障任务保真度、编辑局部性与跨模态对齐。 Result: 在冻结基础模型前提下,单次编辑成功率98%,1000次连续编辑后仍保持>95%成功率,幻觉降低3–5%,后向迁移(BWT)得分最优,在多个数据集与基准上达到SOTA稳定性与知识保留性能。 Conclusion: 结构化知识隔离(而非优化控制)是提升VLM持续编辑鲁棒性的关键,DSCA将概念隔离从训练目标升华为架构属性,为终身模型编辑提供了新范式。 Abstract: Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.

[111] Lighting-grounded Video Generation with Renderer-based Agent Reasoning

Ziqi Cai,Taoyu Yang,Zheng Chang,Si Li,Han Jiang,Shuchen Weng,Boxin Shi

Main category: cs.CV

TL;DR: LiVER是一种基于扩散模型的场景可控视频生成框架,通过显式3D场景属性(如布局、光照、相机轨迹)实现高精度、解耦的视频控制,并配套大规模标注数据集与自动化的场景代理。

Details Motivation: 现有视频扩散模型在布局、光照、相机轨迹等关键场景因素上建模薄弱或纠缠,难以满足电影制作和虚拟制片中对显式场景控制的需求。 Method: 提出LiVER框架:1)构建统一3D场景表示并渲染多种控制信号;2)设计轻量级条件模块与渐进式训练策略,将3D控制信号融入基础视频扩散模型;3)开发能将用户高级指令自动转化为3D控制信号的场景代理。 Result: 在光真实感与时间一致性上达到SOTA,支持图像/视频到视频生成且底层3D场景完全可编辑,实现对场景因素的精确、解耦控制。 Conclusion: LiVER为可控视频生成设立了新标准,显著提升了扩散模型在专业创作场景中的实用性与可控性。 Abstract: Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

[112] Object-Centric Stereo Ranging for Autonomous Driving: From Dense Disparity to Census-Based Template Matching

Qihao Huang

Main category: cs.CV

TL;DR: 本文提出了一种融合稠密立体匹配、目标中心的Census模板匹配和单目几何先验的立体测距系统,重点在于一种GPU加速、面向目标的稀疏立体匹配算法及在线标定优化框架,实现了高速、鲁棒的实时长距车辆测距。

Details Motivation: 传统稠密立体匹配方法(如BM/SGM)在自动驾驶长距车辆检测中存在计算开销大、对双目辐射差异敏感、远距离精度低等问题,亟需更高效鲁棒的深度估计方案。 Method: 构建统一的检测-测距-跟踪流水线,集成三种互补深度估计方法:1)稠密BM/SGM视差;2)基于Census的目标中心模板匹配(含远近分治、前后向验证、遮挡感知采样、多块鲁棒聚合);3)单目几何先验;并设计在线标定精调框架(自动矫正偏移搜索、雷达-立体视差投票校正、目标级雷达-立体关联)。 Result: 系统在夜间、雨天、光照变化等复杂驾驶场景下实现鲁棒测距,并通过异步GPU流水线达到实时性能。 Conclusion: 所提多模态融合测距系统显著提升了长距车辆深度估计的精度、鲁棒性与实时性,其目标中心稀疏匹配与在线标定机制为实际部署提供了有效解决方案。 Abstract: Accurate depth estimation is critical for autonomous driving perception systems, particularly for long range vehicle detection on highways. Traditional dense stereo matching methods such as Block Matching (BM) and Semi Global Matching (SGM) produce per pixel disparity maps but suffer from high computational cost, sensitivity to radiometric differences between stereo cameras, and poor accuracy at long range where disparity values are small. In this report, we present a comprehensive stereo ranging system that integrates three complementary depth estimation approaches: dense BM/SGM disparity, object centric Census based template matching, and monocular geometric priors, within a unified detection ranging tracking pipeline. Our key contribution is a novel object centric Census based template matching algorithm that performs GPU accelerated sparse stereo matching directly within detected bounding boxes, employing a far close divide and conquer strategy, forward backward verification, occlusion aware sampling, and robust multi block aggregation. We further describe an online calibration refinement framework that combines auto rectification offset search, radar stereo voting based disparity correction, and object level radar stereo association for continuous extrinsic drift compensation. The complete system achieves real time performance through asynchronous GPU pipeline design and delivers robust ranging across diverse driving conditions including nighttime, rain, and varying illumination.

[113] DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction

Tingxi Chen,Zhengxue Cheng,Houqiang Zhong,Su Wang,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 本文提出DP-DeGauss框架,用于第一人称视角下的动态4D场景重建,通过概率高斯分解实现背景、手部与物体的显式解耦,显著提升重建质量与可编辑性。

Details Motivation: 现有方法难以处理第一人称视频中复杂的自运动、遮挡及手物交互,且假设固定视角或简单合并动态前景,无法实现精细解耦。 Method: 提出动态概率高斯分解框架DP-DeGauss:基于COLMAP初始化统一3D高斯集,为每个高斯附加可学习类别概率,并通过类别专用掩码和亮度/光流控制,将其动态路由至背景、手部、物体三类形变分支。 Result: 在PSNR上平均超越基线+1.70dB,并提升SSIM与LPIPS;首次实现背景、手、物体三者的显式、细粒度解耦。 Conclusion: DP-DeGauss为第一人称4D重建提供了新范式,支持直观的场景理解与编辑,推动AR/VR与具身AI发展。 Abstract: Egocentric video is crucial for next-generation 4D scene reconstruction, with applications in AR/VR and embodied AI. However, reconstructing dynamic first-person scenes is challenging due to complex ego-motion, occlusions, and hand-object interactions. Existing decomposition methods are ill-suited, assuming fixed viewpoints or merging dynamics into a single foreground. To address these limitations, we introduce DP-DeGauss, a dynamic probabilistic Gaussian decomposition framework for egocentric 4D reconstruction. Our method initializes a unified 3D Gaussian set from COLMAP priors, augments each with a learnable category probability, and dynamically routes them into specialized deformation branches for background, hands, or object modeling. We employ category-specific masks for better disentanglement and introduce brightness and motion-flow control to improve static rendering and dynamic reconstruction. Extensive experiments show that DP-DeGauss outperforms baselines by +1.70dB in PSNR on average with SSIM and LPIPS gains. More importantly, our framework achieves the first and state-of-the-art disentanglement of background, hand, and object components, enabling explicit, fine-grained separation, paving the way for more intuitive ego scene understanding and editing.

[114] SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Yunnan Wang,Kecheng Zheng,Jianyuan Wang,Minghao Chen,David Novotny,Christian Rupprecht,Yinghao Xu,Xing Zhu,Wenjun Zeng,Xin Jin,Yujun Shen

Main category: cs.CV

TL;DR: 本文介绍了SceneScribe-1M,一个包含百万级野外视频的大规模多模态数据集,每段视频均配有文本描述、相机参数、深度图和3D点轨迹,旨在统一支持3D感知与视频生成任务。

Details Motivation: 现有数据集仅分别推动3D理解或视频生成,缺乏同时支持二者的大规模统一资源。 Method: 构建SceneScribe-1M数据集,包含100万段野外视频,并为每段视频提供文本描述、相机参数、密集深度图和一致的3D点轨迹标注;并基于该数据集建立涵盖感知(如单目深度估计、场景重建)与生成(如文本到视频合成)任务的基准测试。 Result: SceneScribe-1M在多个下游任务(包括单目深度估计、场景重建、动态点跟踪、可控/不可控文本到视频合成)上建立了新基准,验证了其广泛适用性。 Conclusion: SceneScribe-1M填补了3D感知与视频合成之间的数据鸿沟,作为开源基准,有望推动兼具动态3D理解与可控视频生成能力的模型发展。 Abstract: The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

[115] MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

Zile Guo,Zhan Chen,Enze Zhu,Kan Wei,Yongkang Zou,Xiaoxuan Liu,Lei Wang

Main category: cs.CV

TL;DR: 本文提出MotionScape——首个面向世界模型的大规模真实无人机视角视频数据集,包含30+小时4K视频、4.5M帧,配准6-DoF相机轨迹与细粒度语言描述,显著提升世界模型对高动态3D运动的建模能力。

Details Motivation: 现有世界模型在无人机(UAV)高度动态的6-DoF相机视角下难以保持时空物理一致性,主因是训练数据存在分布偏差:主流数据集多为2.5D受限运动(如地面驾驶或平缓人眼视角),缺乏真实高动态UAV运动先验。 Method: 构建MotionScape数据集:采集大规模真实UAV视频;设计自动化多阶段处理流程,融合CLIP相关性过滤、时序分割、鲁棒视觉SLAM恢复6-DoF轨迹、大语言模型驱动语义标注;确保视频样本在语义与几何上严格对齐。 Result: 实验表明,引入MotionScape及其语义-几何对齐标注后,现有世界模型在复杂3D动态建模、大视角变化处理能力上显著提升,从而增强UAV智能体在复杂环境中的决策与规划性能。 Conclusion: MotionScape填补了高动态UAV视角世界建模的数据空白,为提升具身智能体(尤其是UAV)在开放环境中物理一致性的预测与推理能力提供了关键基础资源与新范式。 Abstract: Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape

[116] SAT: Selective Aggregation Transformer for Image Super-Resolution

Dinh Phu Tran,Thao Do,Saad Wazir,Seongah Kim,Seon Kwon Kim,Daeyoung Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为Selective Aggregation Transformer (SAT) 的新方法,通过密度驱动的Token聚合算法选择性地聚合Key-Value矩阵,在大幅降低计算量(Token减少97%)的同时扩大感受野、保持查询分辨率和重建精度。

Details Motivation: 传统Transformer在图像超分辨率中因自注意力机制的二次计算复杂度而受限;窗口注意力虽提升效率但感受野受限,需兼顾效率与全局上下文建模。 Method: 提出Selective Aggregation Transformer(SAT),设计Density-driven Token Aggregation算法,对Key-Value矩阵进行选择性聚合(保留查询矩阵全分辨率),以单个聚合Token表征每个密度与隔离度定义的簇。 Result: SAT在性能上超越SOTA方法PFT达0.22dB,同时FLOPs最多降低27%。 Conclusion: SAT在显著降低计算复杂度的前提下,有效扩展模型感受野并保持高保真重建能力,为高效全局建模提供了新范式。 Abstract: Transformer-based approaches have revolutionized image super-resolution by modeling long-range dependencies. However, the quadratic computational complexity of vanilla self-attention mechanisms poses significant challenges, often leading to compromises between efficiency and global context exploitation. Recent window-based attention methods mitigate this by localizing computations, but they often yield restricted receptive fields. To mitigate these limitations, we propose Selective Aggregation Transformer (SAT). This novel transformer efficiently captures long-range dependencies, leading to an enlarged model receptive field by selectively aggregating key-value matrices (reducing the number of tokens by 97\%) via our Density-driven Token Aggregation algorithm while maintaining the full resolution of the query matrix. This design significantly reduces computational costs, resulting in lower complexity and enabling scalable global interactions without compromising reconstruction fidelity. SAT identifies and represents each cluster with a single aggregation token, utilizing density and isolation metrics to ensure that critical high-frequency details are preserved. Experimental results demonstrate that SAT outperforms the state-of-the-art method PFT by up to 0.22dB, while the total number of FLOPs can be reduced by up to 27\%.

[117] Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

Yun Zhu,Jianjun Qian,Jian Yang,Jin Xie,Na Zhao

Main category: cs.CV

TL;DR: 本文提出FI3Det,首个面向少样本增量式3D目标检测的框架,利用视觉语言模型(VLM)从少量新类别样本中学习未知类别知识,在ScanNet V2和SUN RGB-D数据集上显著优于基线方法。

Details Motivation: 现有增量式3D检测方法依赖大量新类别的标注数据,难以适应动态室内环境中对少量样本快速学习的需求。 Method: 提出VLM引导的未知物体学习模块,挖掘未知物体并提取2D语义特征与类无关3D边界框;设计加权机制缓解噪声;引入门控多模态原型印记模块,融合对齐的2D语义与3D几何特征生成分类得分。 Result: 在ScanNet V2和SUN RGB-D数据集的批量与序列评估设置下,FI3Det均取得一致且显著的性能提升。 Conclusion: FI3Det首次实现了少样本条件下的增量式3D目标检测,验证了VLM在3D感知中迁移语义先验的有效性,为具身智能提供了更实用的增量感知能力。 Abstract: Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.

[118] SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving

Felix Embacher,Jonas Uhrig,Marius Cordts,Markus Enzweiler

Main category: cs.CV

TL;DR: SearchAD 是一个面向自动驾驶的大规模稀有图像检索数据集,包含42.3万帧、覆盖90个稀有类别的51.3万个高质量标注框,支持文本/图像跨模态检索与少样本学习,旨在解决长尾场景下的'大海捞针'问题。

Details Motivation: 从大规模数据集中高效检索罕见且安全关键的驾驶场景,是构建鲁棒自动驾驶系统的关键挑战;现有基准多关注实例级检索,缺乏面向语义图像检索、长尾感知和检索驱动数据整理的统一评测基准。 Method: 构建SearchAD数据集:整合11个现有数据集,人工标注51.3万+稀有类别(共90类)的边界框;设计语义图像检索任务(text-to-image / image-to-image),提供明确定义的训练/验证/测试划分;开展跨模态检索模型的零样本与微调评估。 Result: 实验证明基于文本的方法因更强的语义表征能力优于纯图像方法;空间视觉特征与语言直接对齐的模型在零样本设置下表现最佳;微调基线显著提升性能,但整体检索能力仍不理想。 Conclusion: SearchAD是首个支持检索驱动数据整理与长尾感知研究的大规模自动驾驶图像检索基准,推动了多模态语义检索在AD领域的应用与评估标准化。 Abstract: Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/

[119] Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Xuezhen Tu,Jingyu Wu,Fangyu Kang,Qingpeng Nong,Kaijin Zhang,Chaoyue Niu,Fan Wu

Main category: cs.CV

TL;DR: 本文提出Bridge-STG框架,通过解耦时空定位并引入语义桥接与查询引导机制,解决视频时空定位中时空对齐纠缠和双域视觉标记冗余两大挑战,在多个基准上达到SOTA性能。

Details Motivation: 现有MLLM在时空视频定位任务中面临两个核心挑战:一是时空对齐纠缠(因将异构子任务耦合于同一自回归输出空间),二是双域视觉标记冗余(目标在时空维度均稀疏,导致大量视觉标记无关)。 Method: 提出Bridge-STG端到端框架:1)通过Spatio-Temporal Semantic Bridging (STSB) 与Explicit Temporal Alignment (ETA) 将时序推理上下文蒸馏为增强型桥接查询,构建鲁棒语义接口;2)Query-Guided Spatial Localization (QGSL) 模块利用该查询驱动专用空间解码器,并结合多层交互查询与正负帧采样,协同消除双域冗余。 Result: 在VidSTG上平均m_vIoU从26.4提升至34.3;在多个基准上超越现有MLLM方法,且在统一多任务训练下展现出跨任务迁移能力。 Conclusion: Bridge-STG通过解耦与语义桥接有效缓解时空对齐纠缠与视觉冗余问题,验证了模块化协同设计在细粒度视频理解中的有效性与泛化性。 Abstract: Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM's temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m\_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.

[120] Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI

Minh Sao Khue Luu,Evgeniy N. Pavlovskiy,Bair N. Tuchinov

Main category: cs.CV

TL;DR: 本文提出了一种名为CATMIL的统一目标函数,通过引入基于连通组件自适应加权的Tversky损失和基于多实例学习的病灶级监督,增强nnU-Net在小病灶分割任务中的性能,尤其在高度类别不平衡场景下显著提升小病灶召回率并降低假阳性体积。

Details Motivation: 解决医学图像中小病灶分割困难、类别极度不平衡的问题,现有方法在小病灶召回和假阳性控制上表现不足。 Method: 提出CATMIL目标函数,融合:1)Component-Adaptive Tversky损失(按连通组件重加权体素贡献);2)Multiple Instance Learning损失(引入病灶实例级监督);与标准nnU-Net损失联合优化。在MSLesSeg数据集上基于nnU-Net框架进行5折交叉验证。 Result: 相比标准损失,Dice分数提升至0.7834,边界误差降低,小病灶召回率显著提高,假阴性减少,且假阳性体积最低。 Conclusion: 在统一目标中整合组件级与病灶级监督,是提升高度不平衡场景下小病灶分割性能的有效实用方法。 Abstract: We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \href{https://github.com/luumsk/SmallLesionMRI}{this url}.

[121] Rotation Equivariant Convolutions in Deformable Registration of Brain MRI

Arghavan Rezvani,Kun Han,Anthony T. Wu,Pooya Khosravi,Xiaohui Xie

Main category: cs.CV

TL;DR: 本文提出将旋转等变卷积引入可变形脑MRI配准网络,通过在三个基线架构中替换编码器验证其优势:提升配准精度、减少参数量、增强对旋转输入的鲁棒性、提高小样本性能。

Details Motivation: CNN缺乏旋转等变性,无法有效利用解剖结构(尤其是脑MRI)固有的旋转对称性,限制了配准性能。 Method: 将旋转等变卷积集成到可变形脑MRI配准网络中,用等变编码器替代标准编码器,在多个公开脑MRI数据集上评估三种基线架构。 Result: 等变编码器在配准精度、参数量、旋转鲁棒性和小样本性能上均优于基线方法。 Conclusion: 融入几何先验(如旋转等变性)是构建更鲁棒、准确和高效配准模型的关键步骤。 Abstract: Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant convolutions into deformable brain MRI registration networks. We evaluate this approach by replacing standard encoders with equivariant ones in three baseline architectures, testing on multiple public brain MRI datasets. Our experiments demonstrate that equivariant encoders have three key advantages: 1) They achieve higher registration accuracy while reducing network parameters, confirming the benefit of this anatomical inductive bias. 2) They outperform baselines on rotated input pairs, demonstrating robustness to orientation variations common in clinical practice. 3) They show improved performance with less training data, indicating greater sample efficiency. Our results demonstrate that incorporating geometric priors is a critical step toward building more robust, accurate, and efficient registration models.

[122] Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

Jun Li,Yingying Shi,Zhixuan Ruan,Nan Guo,Jianhua Xu

Main category: cs.CV

TL;DR: 本文提出MDDCNet,结合可变形空洞卷积与Mamba模型,增强局部细节建模与多尺度特征融合能力,显著提升复杂交通场景下的小目标检测精度。

Details Motivation: 现有基于Mamba的方法难以有效捕获小目标的丰富局部细节,且状态空间模型缺乏层次化特征表达和跨尺度交互能力,导致在杂乱背景下的交通目标检测性能受限。 Method: 提出MDDCNet:1)混合骨干网络,融合多尺度可变形空洞卷积(MSDDC)块与Mamba块,实现从局部到全局的层次化特征建模;2)通道增强前馈网络(CE-FFN)提升通道交互能力;3)基于Mamba的注意力聚合特征金字塔网络(A²FPN)强化多尺度特征融合与交互。 Result: 在多个公开基准和真实交通数据集上,MDDCNet显著优于各类先进检测器,验证了其在复杂场景下对小目标检测的有效性与鲁棒性。 Conclusion: MDDCNet通过引入可变形空洞卷积、通道增强机制与Mamba驱动的多尺度融合结构,有效解决了交通场景中多尺度目标尤其是小目标检测的难点,为基于状态空间模型的目标检测提供了新思路。 Abstract: In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.

[123] LINE: LLM-based Iterative Neuron Explanations for Vision Models

Vladimir Zaigrajew,Michał Piechota,Gaspar Sekula,Przemysław Biecek

Main category: cs.CV

TL;DR: 本文提出LINE,一种无需训练的迭代式开放词汇概念标注方法,用于解释视觉模型中单个神经元编码的概念。该方法在黑盒设置下结合大语言模型和文生图生成器,通过激活历史指导概念的迭代提出与优化,显著提升了概念标注性能,并支持多义性评估和可视化解释。

Details Motivation: 现有神经元概念标注方法受限于预定义概念词表或生成过于具体的概念描述,难以捕捉高阶、全局概念,影响对深度神经网络决策过程的理解和AI安全性保障。 Method: 提出LINE方法,采用无需训练的迭代式框架,在严格黑盒条件下,利用大语言模型和文本到图像生成器构建闭环,依据神经元激活历史迭代提出并优化概念描述。 Result: LINE在多个模型架构上达到最先进性能,在ImageNet和Places365数据集上AUC分别提升最多0.18和0.05;平均发现29%预定义大规模词表未覆盖的新概念;同时提供完整生成历史,支持多义性评估及媲美梯度依赖激活最大化方法的可视化解释。 Conclusion: LINE是一种高效、灵活且可解释性强的开放词汇神经元概念标注方法,为理解深度视觉模型内部机制和提升AI可解释性与安全性提供了新路径。 Abstract: Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.

[124] 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

Hongcan Xiao,Xinyue Xiao,Yilin Wang,Yue Zhang,Yonggang Qi

Main category: cs.CV

TL;DR: 本文提出3DrawAgent,一种无需训练、基于语言驱动的3D草图生成框架,利用大语言模型(LLM)在几何反馈下逐段绘制3D贝塞尔曲线,并通过相对经验优化策略实现无参数更新的自我提升。

Details Motivation: 自然语言生成3D草图仍具挑战性,现有方法多依赖监督学习或参数更新,缺乏训练自由、具备空间理解能力的生成框架。 Method: 提出3DrawAgent框架:利用LLM解析文本并生成3D贝塞尔曲线;引入相对经验优化策略,基于CLIP感知奖励与LLM细粒度评估构建优劣草图对,适配GRPO范式,在不更新参数前提下进行黑箱强化以增强3D感知。 Result: 实验表明该方法能从多样化文本提示生成复杂、连贯的3D贝塞尔草图,展现出涌现的几何推理能力,并可泛化至新形状。 Conclusion: 3DrawAgent确立了一种无需训练的3D草图智能新范式,为提升模型空间理解与生成能力提供了可行路径。 Abstract: Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

[125] Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

Francesca Fati,Alberto Rota,Adriana V. Gregory,Anna Catozzo,Maria C. Giuliano,Mrinal Dhar,Luigi De Vitis,Annie T. Packard,Francesco Multinu,Elena De Momi,Carrie L. Langstraat,Timothy L. Kline

Main category: cs.CV

TL;DR: 本文提出了一种基于DINOv3视觉Transformer的标签高效超声附件肿块分割框架,结合DPT式解码器,在少量标注数据下实现了优于U-Net等全监督模型的性能,尤其在数据受限时表现稳健。

Details Motivation: 超声附件肿块评估存在主观性强、观察者间差异大等问题;传统全监督分割模型依赖大量像素级标注,且难以应对医学影像中的域偏移。 Method: 采用预训练的DINOv3作为主干网络,融合Dense Prediction Transformer(DPT)风格解码器,实现多尺度特征分层重组,兼顾全局语义与精细空间细节。 Result: 在7777帧临床超声图像上达到Dice 0.945,边界精度提升(95% Hausdorff距离降低11.4%);仅用25%数据训练时仍显著优于全监督基线。 Conclusion: 利用大规模自监督预训练基础模型可有效缓解医学图像分割中标注数据稀缺问题,为临床数据受限场景提供高效可行的解决方案。 Abstract: Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA

[126] Guiding a Diffusion Model by Swapping Its Tokens

Weijia Zhang,Yuehao Liu,Shanyan Guan,Wu Ran,Yanhao Ge,Wei Li,Chao Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为Self-Swap Guidance(SSG)的新方法,通过在token latent层面进行语义差异最大的空间或通道维度交换,实现无需条件输入的CFG式引导,适用于条件与无条件扩散模型生成,并提升图像保真度、提示对齐性与鲁棒性。

Details Motivation: Classifier-Free Guidance(CFG)虽广泛用于提升扩散模型图像质量,但其依赖文本条件,无法用于无条件生成;本文旨在拓展CFG至无条件场景,并提升扰动控制的精细度与鲁棒性。 Method: 提出Self-Swap Guidance(SSG),在采样过程中对token latents进行语义最不相似的成对交换(空间或通道维度),构造扰动预测,并利用其与干净预测的方向差来引导采样;该扰动是局部、选择性且可重组的,区别于全局或弱约束扰动方法。 Result: 在MS-COCO 2014/2017和ImageNet上实验表明,SSG在图像保真度和提示对齐性上优于现有无条件引导方法,且在更广扰动强度范围内副作用更少、鲁棒性更强。 Conclusion: SSG成功将CFG思想扩展至条件与无条件生成统一框架,具备即插即用特性,可无缝集成到任意扩散模型中,显著提升生成质量与适用范围。 Abstract: Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

[127] ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

Daichi Yashima,Shuhei Kurita,Yusuke Oda,Shuntaro Suzuki,Seitaro Otsuki,Komei Sugiura

Main category: cs.CV

TL;DR: 本文提出ABMamba,一种基于状态空间模型的全开源多模态大语言模型,通过分层双向扫描模块实现线性计算复杂度的视频理解,在视频描述生成任务中实现了高吞吐与竞争性性能。

Details Motivation: 现有基于Transformer的视频理解模型因注意力机制的二次计算复杂度难以高效处理长视频序列,亟需更高效的架构。 Method: 提出Aligned Hierarchical Bidirectional Scan Mamba(ABMamba),以深度状态空间模型为语言骨干,取代自注意力机制,并设计分层双向扫描模块,在多个时间分辨率上处理视频。 Result: 在VATEX和MSR-VTT等标准视频描述数据集上达到与主流多模态大模型相当的性能,同时吞吐量提升约三倍。 Conclusion: ABMamba验证了状态空间模型在开放多模态大模型中的有效性,为高效、可扩展的视频理解提供了新范式。 Abstract: In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.

[128] EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience

Emanuele Balloni,Emanuele Frontoni,Chiara Matti,Marina Paolanti,Roberto Pierdicca,Emiliano Santarnecchi

Main category: cs.CV

TL;DR: 本文提出EEG2Vision框架,通过扩散模型和提示引导的后处理机制,从低密度EEG信号中重建高质量图像,显著提升低通道数下的视觉重建质量与语义一致性。

Details Motivation: 非侵入式脑电图(EEG)空间分辨率低、噪声高,尤其在现实低密度电极配置下难以实现高质量视觉刺激重建。 Method: 提出模块化端到端EEG-to-image框架EEG2Vision:首先基于EEG条件扩散模型进行初始图像重建;再利用多模态大语言模型提取语义描述,并驱动图像到图像扩散模型进行几何与感知一致性增强。系统评估了128/64/32/24通道等不同EEG分辨率下的性能。 Result: 语义解码精度随通道减少明显下降(如50类Top-1准确率从89%降至38%),但图像重建质量下降较小(FID从76.77升至80.51);后处理 boosting 持续提升感知指标,在低通道下最高提升9.71%的Inception Score(IS);用户研究证实 boosted 重建结果具有显著感知偏好。 Conclusion: EEG2Vision显著提升了低密度EEG设备上实时脑–图转换的可行性,有望推动此类技术走出实验室、走向实际应用。 Abstract: Reconstructing visual stimuli from non-invasive electroencephalography (EEG) remains challenging due to its low spatial resolution and high noise, particularly under realistic low-density electrode configurations. To address this, we present EEG2Vision, a modular, end-to-end EEG-to-image framework that systematically evaluates reconstruction performance across different EEG resolutions (128, 64, 32, and 24 channels) and enhances visual quality through a prompt-guided post-reconstruction boosting mechanism. Starting from EEG-conditioned diffusion reconstruction, the boosting stage uses a multimodal large language model to extract semantic descriptions and leverages image-to-image diffusion to refine geometry and perceptual coherence while preserving EEG-grounded structure. Our experiments show that semantic decoding accuracy degrades significantly with channel reduction (e.g., 50-way Top-1 Acc from 89% to 38%), while reconstruction quality slight decreases (e.g., FID from 76.77 to 80.51). The proposed boosting consistently improves perceptual metrics across all configurations, achieving up to 9.71% IS gains in low-channel settings. A user study confirms the clear perceptual preference for boosted reconstructions. The proposed approach significantly boosts the feasibility of real-time brain-2-image applications using low-resolution EEG devices, potentially unlocking this type of applications outside laboratory settings.

[129] Small Vision-Language Models are Smart Compressors for Long Video Understanding

Junjie Fei,Jun Chen,Zechun Liu,Yunyang Xiong,Chong Zhou,Wei Wen,Junlin Han,Mingchen Zhuge,Saksham Suri,Qi Qian,Shuming Liu,Lemeng Wu,Raghuraman Krishnamoorthi,Vikas Chandra,Mohamed Elhoseiny,Chenchen Zhu

Main category: cs.CV

TL;DR: Tempo是一种针对长视频理解的高效查询感知框架,利用小型视觉语言模型(SVLM)进行早期跨模态蒸馏,并通过无训练的自适应令牌分配(ATA)动态压缩视频,在严格令牌预算下保持关键语义和全局叙事。

Details Motivation: 现有方法在处理小时级视频时受限于上下文长度,密集视觉流易耗尽token预算并加剧'中间丢失'现象;启发式采样策略盲目牺牲关键帧或浪费带宽于无关背景。 Method: 提出Tempo框架:1)用小型视觉语言模型(SVLM)作为局部时间压缩器,将token缩减建模为早期跨模态蒸馏;2)引入无训练、O(1)复杂度的自适应令牌分配(ATA),基于SVLM的零样本相关性先验与语义前置特性,动态分配高密度带宽至查询关键片段,同时将冗余压缩为最小时间锚点。 Result: 在极端长视频基准LVBench(4101秒)上,6B架构在8K视觉token预算下达52.3分,超越GPT-4o与Gemini 1.5 Pro;扩展至2048帧达53.7分;显著低于理论token极限,验证意图驱动效率优于盲目扩大上下文窗口。 Conclusion: 真正长视频理解依赖于查询意图驱动的高效压缩机制,而非单纯堆叠上下文容量;Tempo证明了在严苛token预算下实现高质量长视频理解的可行性与优越性。 Abstract: Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

[130] Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning

Emanuele Balloni,Emanuele Frontoni,Chiara Matti,Marina Paolanti,Roberto Pierdicca,Emiliano Santarnecchi

Main category: cs.CV

TL;DR: 本文提出Brain3D框架,通过分阶段的多模态方法(EEG→图像→3D描述→扩散生成→单图转3D)实现从脑电图(EEG)信号重建三维(3D)模型,避免了直接EEG-to-3D映射的困难,在语义对齐与几何保真度上取得显著效果。

Details Motivation: 现有研究主要集中于从EEG重建2D图像,而3D重建尚未被充分探索,限制了神经解码在几何理解与实际应用中的潜力。 Method: 提出Brain3D架构:首先将EEG解码为图像;再利用多模态大语言模型提取结构化的3D感知描述;随后以该描述为条件进行扩散模型生成;最后通过单图像到3D模型将生成图像转换为一致的3D网格。 Result: 实验表明该方法在10类分类任务中达到85.4% Top-1 EEG解码准确率,CLIPScore达0.648,重建结果在语义与几何层面均与原始刺激高度一致。 Conclusion: Brain3D验证了基于EEG的多模态3D重建的可行性,为脑机接口和神经科学中的三维内容生成开辟了新路径。 Abstract: Decoding visual information from electroencephalography (EEG) has recently achieved promising results, primarily focusing on reconstructing two-dimensional (2D) images from brain activity. However, the reconstruction of three-dimensional (3D) representations remains largely unexplored. This limits the geometric understanding and reduces the applicability of neural decoding in different contexts. To address this gap, we propose Brain3D, a multimodal architecture for EEG-to-3D reconstruction based on EEG-to-image decoding. It progressively transforms neural representations into the 3D domain using geometry-aware generative reasoning. Our pipeline first produces visually grounded images from EEG signals, then employs a multimodal large language model to extract structured 3D-aware descriptions, which guide a diffusion-based generation stage whose outputs are finally converted into coherent 3D meshes via a single-image-to-3D model. By decomposing the problem into structured stages, the proposed approach avoids direct EEG-to-3D mappings and enables scalable brain-driven 3D generation. We conduct a comprehensive evaluation comparing the reconstructed 3D outputs against the original visual stimuli, assessing both semantic alignment and geometric fidelity. Experimental results demonstrate strong performance of the proposed architecture, achieving up to 85.4% 10-way Top-1 EEG decoding accuracy and 0.648 CLIPScore, supporting the feasibility of multimodal EEG-driven 3D reconstruction.

[131] Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Miguel Monte e Freitas,Rui Henriques,Ricardo Rei,Pedro Henrique Martins

Main category: cs.CV

TL;DR: 本文对视觉语言模型(VLMs)在动作质量评估(AQA)任务中的实际性能进行了系统性评测,发现当前SOTA模型表现仅略高于随机水平,存在预测偏差和语言敏感性等根本性局限,提示VLMs尚难以胜任细粒度动作质量判断。

Details Motivation: 尽管视觉语言模型(VLMs)在动作质量评估(AQA)中潜力巨大,但其在该领域的实际性能尚未被系统刻画,亟需全面评估以明确能力边界与失败模式。 Method: 对多个前沿VLM(如Gemini 3.1 Pro、Qwen3-VL、InternVL3.5)在不同运动领域(健身、花样滑冰、跳水)、任务设定、视觉表征及提示策略下进行综合评测,并分析预测分布以识别系统性偏差;尝试对比式任务重构以缓解偏差。 Result: 所有模型在AQA任务上仅略优于随机猜测;引入骨架信息、接地指令、推理结构或上下文学习等策略仅带来零星提升,无一普适有效;发现两大系统偏差:无视视觉证据的‘正确执行’倾向与对语言表述的过度敏感;对比式重构未能显著改善性能。 Conclusion: VLMs当前在细粒度动作质量评估上存在根本性困难,其局限远超表面偏差,需针对性解决关键失败模式,方能支撑真实场景可靠部署;本工作为后续VLM-AQA研究提供了严谨基线与明确改进方向。 Abstract: Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

[132] AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

Imane Momayiz,Soufiane Ait Elaouad,Abdeljalil Elmajjodi,Haitame Bouanane

Main category: cs.CV

TL;DR: 本文提出了AtlasOCR,首个开源的摩洛哥阿拉伯语(Darija)专用OCR模型,基于3B参数视觉语言模型(VLM)微调而成,在自建数据集AtlasOCRBench和基准KITAB-Bench上达到SOTA性能。

Details Motivation: 摩洛哥阿拉伯语(Darija)富含视觉内容但缺乏专用OCR工具,现有模型对其支持不足。 Method: 构建Darija专用数据集(结合OCRSmith合成生成与真实数据),采用QLoRA和Unsloth对Qwen2.5-VL 3B模型进行参数高效微调,并开展关键超参消融研究。 Result: 在新构建的AtlasOCRBench和标准KITAB-Bench上均取得SOTA性能,优于更大模型,且在Darija和标准阿拉伯语OCR任务中展现出强鲁棒性与泛化能力。 Conclusion: AtlasOCR验证了轻量高效微调大VLM可成功适配低资源方言OCR任务,为阿拉伯语族小语种OCR提供了可复用的技术路径与开源工具。 Abstract: Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.

[133] Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Marcel Gröpl,Jaewoo Jung,Seungryong Kim,Marc Pollefeys,Sunghwan Hong

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、基于模型内在不确定性的视觉定位方法,通过熵梯度生成相关性图,并结合多区域提取与迭代缩放重定位,显著提升了视觉语言模型在细节敏感和高分辨率场景下的推理能力与可解释性。

Details Motivation: 预训练的视觉语言模型在依赖细微视觉细节或多区域线索(如文档理解、组合式查询)的任务中表现不佳,亟需更精准、可解释的视觉定位机制。 Method: 提出一种训练无关的模型内生定位方法:利用模型下一词分布的熵作为不确定性监督信号,反向传播至视觉token嵌入以生成熵梯度相关性图;进一步提取并排序多个连贯区域以支持多证据查询,并设计带空间熵停止准则的迭代缩放-重定位流程。 Result: 在七个基准测试、四种VLM架构上均取得一致提升,尤其在细节关键和高分辨率设置下增益最大,同时生成更具可解释性的证据定位结果。 Conclusion: 基于不确定性驱动的测试时证据检索范式,能有效增强VLM对细粒度视觉信息的利用能力,且无需额外检测器或注意力启发式,具备强泛化性与可解释性。 Abstract: Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

[134] Tensor-Augmented Convolutional Neural Networks: Enhancing Expressivity with Generic Tensor Kernels

Chia-Wei Hsing,Wei-Lin Tu

Main category: cs.CV

TL;DR: 本文提出了一种物理引导的浅层张量增强卷积神经网络(TACNN),用通用张量替代传统卷积核,以提升表征能力并捕捉高阶特征相关性,在Fashion-MNIST上仅用两层即达到93.7%准确率,媲美VGG-16和GoogLeNet。

Details Motivation: 传统CNN依赖深层结构来捕获复杂相关性,导致计算开销大、可解释性差;而张量天然对应量子叠加态,具有更强表达能力,可构建更简洁高效的模型。 Method: 提出张量增强CNN(TACNN),将卷积核推广为高阶张量,使每层卷积输出成为能建模高阶特征交互的多线性形式。 Result: 在Fashion-MNIST上,仅含两层卷积的TACNN达到93.7%测试准确率,超过VGG-16(93.5%)并持平GoogLeNet(93.7%)。 Conclusion: TACNN通过物理启发的张量设计,在保持浅层结构的同时显著提升表达能力,为构建高效、可解释的深度学习模型提供了新路径。 Abstract: Convolutional Neural Networks (CNNs) excel at extracting local features hierarchically, but their performance in capturing complex correlations hinges heavily on deep architectures, which are usually computationally demanding and difficult to interpret. To address these issues, we propose a physically-guided shallow model: tensor-augmented CNN (TACNN), which replaces conventional convolution kernels with generic tensors to enhance representational capacity. This choice is motivated by the fact that an order-$N$ tensor naturally encodes an arbitrary quantum superposition state in the Hilbert space of dimension $d^N$, where $d$ is the local physical dimension, thus offering substantially richer expressivity. Furthermore, in our design the convolution output of each layer becomes a multilinear form capable of capturing high-order feature correlations, thereby equipping a shallow multilayer architecture with an expressive power competitive to that of deep CNNs. On the Fashion-MNIST benchmark, TACNN demonstrates clear advantages over conventional CNNs, achieving remarkable accuracies with only a few layers. In particular, a TACNN with only two convolution layers attains a test accuracy of 93.7$\%$, surpassing or matching considerably deeper models such as VGG-16 (93.5$\%$) and GoogLeNet (93.7$\%$). These findings highlight TACNN as a promising framework that strengthens model expressivity while preserving architectural simplicity, paving the way towards more interpretable and efficient deep learning models.

[135] What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Mohamed Amine Kerkouri,Marouane Tliba,Bin Wang,Aladine Chetouani,Ulas Bagci,Alessandro Bruno

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言模型(VLM)的语义扫视路径相似性分析框架,通过将注视点转化为文本描述并计算语义相似性,弥补了传统空间/时间对齐方法忽略语义等价性的不足。实验表明该方法能捕捉与几何对齐正交的内容一致性信息。

Details Motivation: 现有扫视路径相似性度量主要关注空间和时间对齐,忽略了被注意图像区域之间的语义等价性。 Method: 利用视觉-语言模型(VLM),在受控视觉上下文(patch-based 和 marker-based)下对每个注视点进行编码并生成简洁文本描述,再聚合为扫视路径级表征;随后采用嵌入式和词汇级NLP指标计算语义相似性,并与MultiMatch、DTW等经典空间度量对比。 Result: 语义相似性可捕获与几何对齐部分独立的方差,在空间差异较大时仍能反映高内容一致性;上下文编码方式影响描述保真度与指标稳定性。 Conclusion: 多模态基础模型可支持可解释、内容感知的经典扫视路径分析扩展,为ETRA社区的眼动研究提供互补维度。 Abstract: Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

[136] DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather

Christof Leitgeb,Thomas Puchleitner,Max Peter Ronecker,Daniel Watzenig

Main category: cs.CV

TL;DR: 本文提出DinoRADE,一种以雷达为中心的检测框架,结合FMCW雷达张量与DINOv3视觉基础模型特征,通过可变形跨模态注意力融合,在K-Radar数据集上显著提升恶劣天气下对弱势道路使用者(VRUs)等小目标的检测性能,较现有雷达-相机方法提升12.1%。

Details Motivation: 现有FMCW雷达在恶劣天气下虽检测性能良好,但难以分辨细粒度空间细节,尤其对小目标和弱势道路使用者(VRUs)检测不足;且缺乏在恶劣天气数据集(如K-Radar)上针对VRUs的系统性研究。 Method: 提出DinoRADE:输入密集雷达张量,利用DINOv3提取视觉特征,并通过可变形交叉注意力机制,在相机视角下围绕变换后的参考点聚合视觉特征,实现雷达主导、视觉增强的多模态融合检测。 Result: 在K-Radar全天气条件下完成全面评估,首次按五类目标分别报告检测性能;相较现有单类检测方法及最新雷达-相机方法,mAP提升12.1%。 Conclusion: DinoRADE验证了雷达中心范式结合先进视觉基础模型与可变形注意力机制的有效性,为恶劣天气下鲁棒、细粒度自动驾驶感知提供了新思路与实用方案。 Abstract: Reliable and weather-robust perception systems are essential for safe autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a DINOv3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. The code is available under https://github.com/chr-is-tof/RADE-Net.

[137] OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Wenbo Hu,Xin Chen,Yan Gao-Tian,Yihe Deng,Nanyun Peng,Kai-Wei Chang

Main category: cs.CV

TL;DR: 本文提出G²RPO,一种新的强化学习训练目标,通过非线性分布匹配使优势分布收敛至标准正态分布,以解决多模态大模型在不同视觉任务中奖励拓扑差异大、感知与推理能力难以平衡的问题;并结合响应长度塑造和熵塑造两种机制,构建出高性能开源多模态通用模型OpenVLThinkerV2。

Details Motivation: 现有Group Relative Policy Optimization(GRPO)方法在开源多模态通用模型中推广受限,主要因不同视觉任务间奖励拓扑差异极大,且模型难以兼顾细粒度感知与多步推理能力。 Method: 提出Gaussian GRPO(G²RPO),用非线性分布匹配替代线性缩放,强制各任务优势分布收敛至标准正态分布N(0,1);引入响应长度塑造(动态调节输出长度以适配复杂推理或直接视觉响应)与熵塑造(约束探索熵值防止坍缩或爆炸)两种任务级调控机制。 Result: 构建出OpenVLThinkerV2模型,在18个多样化基准测试中全面超越强开源及主流闭源前沿模型,验证了方法在稳定性与泛化性上的显著提升。 Conclusion: G²RPO及其配套塑造机制有效解决了多模态RL训练中跨任务梯度不均衡与感知-推理失衡问题,为开源多模态通用模型提供了更鲁棒、可扩展的强化学习优化范式。 Abstract: Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

[138] AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

Handong Li,Zikang Liu,Longteng Guo,Tongtian Yue,Yepeng Tang,Xinxin Zhu,Chuanyang Zheng,Ziming Wang,Zhibin Wang,Jun Song,Cheng Yu,Bo Zheng,Jing Liu

Main category: cs.CV

TL;DR: 本文提出AdaSpark,一种自适应稀疏性框架,通过自适应选择视频立方体和关键令牌,在大幅降低计算量的同时保持细粒度感知和长程时序建模能力。

Details Motivation: 现有Video-LLM处理长视频计算开销大,且效率方法常牺牲细粒度感知或限制长程时序建模。 Method: AdaSpark将视频划分为3D时空立方体,并设计两个协同组件:自适应立方体选择注意力(AdaS-Attn)和自适应令牌选择前馈网络(AdaS-FFN),结合基于熵的Top-p机制动态分配算力。 Result: 在小时级视频基准上,AdaSpark最多减少57% FLOPs,同时保持与稠密模型相当的性能,并保留细粒度与长程依赖建模能力。 Conclusion: AdaSpark提供了一种高效、灵活且不牺牲建模能力的长视频处理新范式。 Abstract: Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

[139] AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou,Zeyuan Lai,Rui Wang,Yifan Yang,Zhen Xing,Yuqing Yang,Qi Dai,Lili Qiu,Chong Luo

Main category: cs.CV

TL;DR: 本文提出了AVGen-Bench,一个面向任务的文本到音视频(T2AV)生成评测基准,并设计了多粒度评估框架,揭示了当前T2AV模型在语义可靠性(如文字渲染、语音连贯性、物理推理、音乐音高控制)方面存在显著缺陷。

Details Motivation: 现有T2AV评测方法碎片化,多孤立评估音频或视频,或仅依赖粗粒度嵌入相似性,无法反映真实提示下音视频联合生成的细粒度正确性需求。 Method: 构建包含11类高质量现实场景提示的AVGen-Bench基准;提出融合轻量级专用模型与多模态大语言模型(MLLM)的多粒度评估框架,覆盖感知质量到细粒度语义可控性。 Result: 评估发现当前强T2AV模型虽具良好视听美学效果,但在语义可靠性上存在明显短板:文字渲染失败、语音不连贯、物理推理错误、音乐音高普遍失控。 Conclusion: AVGen-Bench为T2AV生成提供了更贴近实际需求的评测标准,揭示了美学与语义可控性之间的关键鸿沟,推动后续研究聚焦于提升联合语义一致性。 Abstract: Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

[140] DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

Junbo Wang,Liangyu Fu,Yuke Li,Yining Zhu,Ya Jing,Xuecheng Wu,Jiangbin Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的非自回归视频描述生成框架(DiffVC),通过并行解码提升生成速度、减少累积误差,并利用判别式条件扩散模型增强多模态交互,从而在保持高效生成的同时提升描述质量。

Details Motivation: 现有自回归视频描述方法存在生成慢、误差累积问题;非自回归方法则因多模态交互建模不足导致生成质量差。 Method: 提出基于判别式条件扩散模型的非自回归框架:先编码视频为视觉表征;训练时对真实文本加高斯噪声,用视觉表征约束的判别式去噪器恢复文本表征;再输入非自回归语言模型生成描述;推理时直接从高斯分布采样噪声生成。 Result: 在MSVD、MSR-VTT和VATEX上超越先前非自回归方法,CIDEr最高提升9.9,B@4提升2.6,性能媲美自回归方法且生成更快。 Conclusion: DiffVC有效平衡了生成效率与质量,验证了扩散模型在非自回归视频描述任务中的有效性与潜力。 Abstract: Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.

[141] Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Haolei Xu,Haiwen Hong,Hongxing Li,Rui Zhou,Yang Zhang,Longtao Huang,Hui Xue,Yongliang Shen,Weiming Lu,Yueting Zhuang

Main category: cs.CV

TL;DR: 本文发现多模态混合专家(MoE)模型存在‘看得见却不会思考’现象:能准确感知图像内容,但在视觉推理任务中表现差于纯文本任务;提出‘路由干扰’假说,指出图像输入导致路由机制未能有效激活推理相关的领域专家,并设计路由引导干预方法,在多个模型和基准上显著提升复杂视觉推理性能。

Details Motivation: 解决多模态MoE模型在视觉推理任务中性能显著低于纯文本任务的异常现象(Seeing but Not Thinking),探究其根本原因。 Method: 通过系统性分析验证跨模态语义共享的存在;揭示视觉专家与领域专家的层间分离及图像输入引发的路由发散;提出Routing Distraction假说;设计路由引导干预方法以增强领域专家激活。 Result: 在三个多模态MoE模型、六个基准上的实验表明,该方法在复杂视觉推理任务上最高提升3.17%;并发现领域专家可定位通用认知功能,支持跨任务迁移。 Conclusion: ‘Seeing but Not Thinking’的根本原因在于图像输入干扰了MoE的专家路由机制,使其无法充分调用负责推理的领域专家;通过显式引导路由可有效缓解该问题,且领域专家具有认知功能层面的泛化性。 Abstract: Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

[142] Coordinate-Based Dual-Constrained Autoregressive Motion Generation

Kang Ding,Hongsong Wang,Jie Gui,Liang Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为CDAMD的坐标驱动双约束自回归运动生成框架,结合自回归与扩散模型优势,解决误差放大和模式崩溃问题,在文本到运动生成与编辑任务中达到SOTA性能。

Details Motivation: 扩散模型存在噪声预测中的误差放大问题,而自回归模型因运动离散化导致模式坍塌,现有坐标系下的运动合成研究也十分有限。 Method: 提出Coordinate-based Dual-constrained Autoregressive Motion Generation(CDAMD)框架:以运动坐标为输入,采用自回归范式;引入扩散启发的多层感知机提升运动保真度;设计Dual-Constrained Causal Mask,将运动token作为先验并与文本编码拼接以引导生成。 Result: 在新构建的文本到运动生成与运动编辑基准上,CDAMD在运动保真度和语义一致性两方面均达到当前最优性能(SOTA)。 Conclusion: CDAMD通过融合自回归结构与扩散思想,并引入双重约束机制,实现了高保真、语义忠实的文本驱动运动生成,同时推动了坐标级运动建模的发展。 Abstract: Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human-computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD). With motion coordinates as input, CDAMD follows the autoregressive paradigm and leverages diffusion-inspired multi-layer perceptrons to enhance the fidelity of predicted motions. Furthermore, a Dual-Constrained Causal Mask is introduced to guide autoregressive generation, where motion tokens act as priors and are concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish new benchmarks for both text-to-motion generation and motion editing. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of both fidelity and semantic consistency on these benchmarks.

[143] EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition

Junbo Wang,Liangyu Fu,Yuke Li,Yining Zhu,Xuecheng Wu,Kun Hu

Main category: cs.CV

TL;DR: 本文提出了一种高效微表情识别框架EPIR,通过双范数偏移分块(DNSPT)、令牌集成和判别性令牌提取器,在保持高性能的同时显著降低计算复杂度。

Details Motivation: 现有基于Transformer的微表情识别方法计算复杂度高,且受限于小规模数据集,难以学习有效的微表情表征。 Method: 提出EPIR框架,包括:1)双范数偏移分块(DNSPT)模块以建模面部区域像素空间关系;2)令牌集成模块减少令牌数量而不损失信息;3)判别性令牌提取器(含改进注意力机制与动态令牌选择模块DTSM)以捕获更具判别性的微表情特征。 Result: 在CASME II、SAMM、SMIC和CAS(ME)3四个数据集上显著优于SOTA方法,如在CAS(ME)3上UF1提升9.6%,在SMIC上UAR提升4.58%。 Conclusion: EPIR框架在保证高识别精度的同时有效降低了计算开销,为资源受限场景下的微表情识别提供了新思路。 Abstract: Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.

[144] OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

Seungjae Moon,Seunghyun Oh,Youngmin Ro

Main category: cs.CV

TL;DR: 本文提出OV-Stitcher,一种无需训练的开放词汇语义分割框架,通过在编码器最后一层直接拼接子图特征,实现全局注意力,提升上下文聚合与分割一致性。

Details Motivation: 现有无训练开放词汇语义分割方法受限于预训练编码器输入分辨率,依赖滑动窗口策略导致特征碎片化、缺乏全局上下文建模能力。 Method: OV-Stitcher在预训练视觉语言模型的最终编码器块中,对滑动窗口提取的子图特征进行特征级拼接与注意力重建,从而恢复全局注意力机制。 Result: 在八个基准上显著提升mIoU(从48.7提升至50.7),验证了其在保持无训练特性的同时增强空间一致性和语义对齐的能力。 Conclusion: OV-Stitcher为无训练开放词汇分割提供了可扩展且有效的解决方案,突破了分辨率与全局建模之间的权衡限制。 Abstract: Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.

[145] Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Luozheng Qin,Jia Gong,Qian Qiao,Tianjiao Li,Li Xu,Haoyu Pan,Chao Qu,Zhiyu Tan,Hao Li

Main category: cs.CV

TL;DR: 本文提出Uni-ViGU框架,以视频生成模型为基座,通过统一连续/离散流匹配、模态驱动的MoE结构及双向训练机制,实现视频生成与理解的统一,在两类任务上均取得竞争力性能。

Details Motivation: 视觉生成(尤其是视频)计算成本远高于理解,导致传统以理解为中心的多模态大模型难以高效扩展;因此需转向以生成为中心的统一架构。 Method: 提出Uni-ViGU:1)统一流匹配方法(视频用连续流匹配、文本用离散流匹配);2)模态驱动的MoE增强Transformer,轻量适配文本生成并保留视频生成先验;3)双向训练机制——知识回溯(重建输入提示)和能力精炼(细粒度字幕微调)。 Result: 在视频生成与视频理解任务上均达到具有竞争力的性能,验证了生成中心范式的有效性与可扩展性。 Conclusion: 以生成模型为基座、通过统一建模与双向知识迁移,可构建高效、可扩展的统一多模态智能框架;生成-centric设计是通向统一多模态智能的可行路径。 Abstract: Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

[146] PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

Zhi-Yi Lin,Thomas Markhorst,Jouh Yeong Chew,Xucong Zhang

Main category: cs.CV

TL;DR: 本文提出PolySLGen框架,用于生成多参与者(polyadic)场景下目标个体的多模态(语音、身体动作、说话状态)反应,通过姿态融合模块和社会线索编码器建模群体交互,显著提升反应的上下文恰当性、时序连贯性与真实感。

Details Motivation: 现有方法局限于单模态或仅说话的双人交互,忽视非言语线索和多人交互的复杂动态,难以适应真实社交场景。 Method: 提出PolySLGen在线框架,包含姿态融合模块和社会线索编码器,联合聚合群体的动作与社会信号,以生成目标参与者的语音、身体动作及说话状态得分。 Result: 实验表明,PolySLGen在动作质量、动作-语音对齐、说话状态预测及人类感知真实性等方面均优于多个适配基线和SOTA方法。 Conclusion: PolySLGen有效建模多人多模态交互,为具身AI在自然群体互动中生成类人反应提供了新范式。 Abstract: Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.

[147] Bag of Bags: Adaptive Visual Vocabularies for Genizah Join Image Retrieval

Sharva Gogawale,Gal Grudka,Daria Vasyutinsky-Shapira,Omer Ventura,Berat Kurar-Barakat,Nachum Dershowitz

Main category: cs.CV

TL;DR: 本文提出了一种名为Bag of Bags(BoB)的图像级表示方法,用于手稿碎片的匹配检索,通过局部视觉词典和集合间距离度量,在开罗藏经阁数据集上取得了优于传统Bag of Words方法的性能。

Details Motivation: 解决手稿碎片归属同一原始手稿的检索问题,即给定一个碎片图像,检索出同源的其他碎片。 Method: 提出Bag of Bags(BoB)表示法:使用稀疏卷积自编码器训练二值化碎片块,对每页连通分量编码,再对每个图像用k-means聚类嵌入,最后用局部词典间的集合距离进行图像比较;还引入质量加权的BoB-OT变体,并提供其近似保证;采用BoW初筛+BoB-OT重排序的两阶段策略平衡精度与效率。 Result: 在开罗藏经阁数据集上,最优BoB变体(Chamfer)达到Hit@1=0.78、MRR=0.84,相比最强BoW基线(0.74和0.80)提升6.1%;BoB-OT具备理论近似保证;两阶段流程兼顾效果与可扩展性。 Conclusion: BoB方法在手稿碎片检索任务中显著优于传统BoW,尤其在细粒度局部结构建模和集合匹配方面更有效,且通过组合策略具备实际部署潜力。 Abstract: A join is a set of manuscript fragments identified as originally emanating from the same manuscript. We study manuscript join retrieval: Given a query image of a fragment, retrieve other fragments originating from the same physical manuscript. We propose Bag of Bags (BoB), an image-level representation that replaces the global-level visual codebook of classical Bag of Words (BoW) with a fragment-specific vocabulary of local visual words. Our pipeline trains a sparse convolutional autoencoder on binarized fragment patches, encodes connected components from each page, clusters the resulting embeddings with per image $k$-means, and compares images using set to set distances between their local vocabularies. Evaluated on fragments from the Cairo Genizah, the best BoB variant (viz.\@ Chamfer) achieves Hit@1 of 0.78 and MRR of 0.84, compared to 0.74 and 0.80, respectively, for the strongest BoW baseline (BoW-RawPatches-$χ^2$), a 6.1\% relative improvement in top-1 accuracy. We furthermore study a mass-weighted BoB-OT variant that incorporates cluster population into prototype matching and present a formal approximation guarantee bounding its deviation from full component-level optimal transport. A two-stage pipeline using a BoW shortlist followed by BoB-OT reranking provides a practical compromise between retrieval strength and computational cost, supporting applicability to larger manuscript collections.

[148] Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection

Yushuo Zhang,Yu Cheng,Yongkang Hu,Jiuan Zhou,Jiawei Chen,Yuan Xie,Zhaoxia Yin

Main category: cs.CV

TL;DR: 本文提出Face-D²CL框架,通过多域协同表征和双持续学习机制(EWC+OGC),在不依赖历史数据回放的情况下,有效缓解特征表达不足与灾难性遗忘问题,显著提升DeepFake检测模型在持续学习场景下的稳定性与适应性。

Details Motivation: 面部伪造技术快速发展,对公众信任与信息安全构成严重威胁;现有持续学习方法在面部DeepFake检测中面临特征表示不足和灾难性遗忘两大瓶颈。 Method: 提出Face-D²CL框架:1)多域协同表征——融合空间域与频率域特征以全面捕获伪造痕迹;2)双持续学习机制——结合弹性权重巩固(EWC)区分真假样本参数重要性,以及正交梯度约束(OGC)保障任务适配器更新不干扰旧知识。 Result: 在稳定性与可塑性上超越当前SOTA方法:平均检测错误率相对降低60.7%;在未见伪造域上平均检测AUC提升7.9%。 Conclusion: Face-D²CL实现了抗遗忘能力与新伪造范式适应性的动态平衡,为无数据回放的持续DeepFake检测提供了高效可行方案。 Abstract: The rapid advancement of facial forgery techniques poses severe threats to public trust and information security, making facial DeepFake detection a critical research priority. Continual learning provides an effective approach to adapt facial DeepFake detection models to evolving forgery patterns. However, existing methods face two key bottlenecks in real-world continual learning scenarios: insufficient feature representation and catastrophic forgetting. To address these issues, we propose Face-D(^2)CL, a framework for facial DeepFake detection. It leverages multi-domain synergistic representation to fuse spatial and frequency-domain features for the comprehensive capture of diverse forgery traces, and employs a dual continual learning mechanism that combines Elastic Weight Consolidation (EWC), which distinguishes parameter importance for real versus fake samples, and Orthogonal Gradient Constraint (OGC), which ensures updates to task-specific adapters do not interfere with previously learned knowledge. This synergy enables the model to achieve a dynamic balance between robust anti-forgetting capabilities and agile adaptability to emerging facial forgery paradigms, all without relying on historical data replay. Extensive experiments demonstrate that our method surpasses current SOTA approaches in both stability and plasticity, achieving 60.7% relative reduction in average detection error rate, respectively. On unseen forgery domains, it further improves the average detection AUC by 7.9% compared to the current SOTA method.

[149] T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

Pranjal Khadka

Main category: cs.CV

TL;DR: 本文提出了一种时序适配器,通过在视觉语言模型(VLM)中引入相邻切片上下文信息,显著提升3D医学图像分割性能,尤其在少样本、零样本及跨模态场景下表现优异。

Details Motivation: 传统3D全监督分割依赖大量昂贵的体素级标注;现有VLM直接应用于2D切片导致解剖连续性差、分割噪声大。 Method: 设计一种时序适配器,包含:1)在token级对固定窗口内相邻切片建模的时序Transformer;2)增强单切片空间表征的空间上下文模块;3)自适应门控机制融合时序与单切片特征。 Result: 在FLARE22上(30例标注数据)达平均Dice 0.704(+0.206 vs 基线);零样本迁移到BTCV/AMOS22分别提升+0.210/+0.230;跨模态(CT预训练→AMOS22 MRI)Dice达0.366,超越仅用CT训练的DynUNet(0.224)。 Conclusion: 时序适配器有效弥合了VLM在3D医学图像分割中的上下文缺失问题,提升了泛化性、鲁棒性与跨模态迁移能力,为低标注成本、高适应性的医学分割提供了新范式。 Abstract: Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.

[150] OceanMAE: A Foundation Model for Ocean Remote Sensing

Viola-Joanna Stamer,Panagiotis Agrafiotis,Behnood Rasti,Begüm Demir

Main category: cs.CV

TL;DR: 本文提出OceanMAE,一种面向海洋遥感的掩码自编码器,通过融合多光谱Sentinel-2影像与物理意义明确的海洋描述符进行自监督预训练,提升下游海洋分割与水深估计任务性能。

Details Motivation: 海洋遥感受限于标注数据稀缺及主流陆地预训练模型迁移能力弱的问题,亟需面向海洋的、物理信息引导的自监督预训练方法。 Method: 提出OceanMAE模型,在标准MAE框架中引入多光谱Sentinel-2数据与物理海洋描述符(如叶绿素浓度、悬浮物等)作为辅助特征进行掩码重建式自监督预训练;下游采用改进UNet结构完成海洋污染物分割与水深回归任务。 Result: 在MADOS和MARIDA数据集上显著提升海洋污染物/漂浮物分割性能;在MagicBathyNet上水深估计效果具竞争力且任务相关;消融实验表明加入海洋描述符可提升分割质量。 Conclusion: 物理信息引导、领域对齐的自监督预训练能有效提升海洋遥感任务性能,OceanMAE为海洋RS提供了更适配的基础模型范式。 Abstract: Accurate ocean mapping is essential for applications such as bathymetry estimation, seabed characterization, marine litter detection, and ecosystem monitoring. However, ocean remote sensing (RS) remains constrained by limited labeled data and by the reduced transferability of models pre-trained mainly on land-dominated Earth observation imagery. In this paper, we propose OceanMAE, an ocean-specific masked autoencoder that extends standard MAE pre-training by integrating multispectral Sentinel-2 observations with physically meaningful ocean descriptors during self-supervised learning. By incorporating these auxiliary ocean features, OceanMAE is designed to learn more informative and ocean-aware latent representations from large- scale unlabeled data. To transfer these representations to downstream applications, we further employ a modified UNet-based framework for marine segmentation and bathymetry estimation. Pre-trained on the Hydro dataset, OceanMAE is evaluated on MADOS and MARIDA for marine pollutant and debris segmentation, and on MagicBathyNet for bathymetry regression. The experiments show that OceanMAE yields the strongest gains on marine segmentation, while bathymetry benefits are competitive and task-dependent. In addition, an ablation against a standard MAE on MARIDA indicates that incorporating auxiliary ocean descriptors during pre-training improves downstream segmentation quality. These findings highlight the value of physically informed and domain-aligned self-supervised pre- training for ocean RS. Code and weights are publicly available at https://git.tu-berlin.de/joanna.stamer/SSLORS2.

[151] On the Global Photometric Alignment for Low-Level Vision

Mingjia Li,Tianle Du,Hainuo Wang,Qiming Hu,Xiaojie Guo

Main category: cs.CV

TL;DR: 本文提出Photometric Alignment Loss (PAL),通过闭式仿射颜色对齐来折扣无关的光度差异,同时保留与恢复相关的监督信号,从而解决低层视觉任务中因成对训练数据光度不一致导致的优化病理问题。

Details Motivation: 监督式低层视觉模型依赖于像素级损失函数,但成对训练集存在每对图像间的光度不一致(如亮度、色彩、白平衡差异),这种不一致会干扰内容恢复的优化过程。 Method: 通过最小二乘分解证明预测与目标残差中的光度分量和结构分量正交,并发现光度分量主导梯度能量;据此提出PAL损失,在监督中通过仅需协方差统计和微小矩阵求逆的闭式仿射颜色对齐,剔除光度干扰。 Result: 在6个低层视觉任务、16个数据集和16种网络架构上,PAL持续提升性能指标与泛化能力。 Conclusion: PAL是一种轻量、通用且有效的监督损失设计,能显著缓解光度不一致带来的优化问题,提升低层视觉模型训练稳定性与效果。 Abstract: Supervised low-level vision models rely on pixel-wise losses against paired references, yet paired training sets exhibit per-pair photometric inconsistency, say, different image pairs demand different global brightness, color, or white-balance mappings. This inconsistency enters through task-intrinsic photometric transfer (e.g., low-light enhancement) or unintended acquisition shifts (e.g., de-raining), and in either case causes an optimization pathology. Standard reconstruction losses allocate disproportionate gradient budget to conflicting per-pair photometric targets, crowding out content restoration. In this paper, we investigate this issue and prove that, under least-squares decomposition, the photometric and structural components of the prediction-target residual are orthogonal, and that the spatially dense photometric component dominates the gradient energy. Motivated by this analysis, we propose Photometric Alignment Loss (PAL). This flexible supervision objective discounts nuisance photometric discrepancy via closed-form affine color alignment while preserving restoration-relevant supervision, requiring only covariance statistics and tiny matrix inversion with negligible overhead. Across 6 tasks, 16 datasets, and 16 architectures, PAL consistently improves metrics and generalization. The implementation is in the appendix.

[152] MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

Zheng Jiang,Heng Guo,Chengyu Fang,Changchen Xiao,Xinyang Hu,Lifeng Sun,Minfeng Xu

Main category: cs.CV

TL;DR: 本文提出MedVR,一种无需人工标注的强化学习框架,用于提升医学视觉语言模型(VLMs)的视觉推理能力,通过熵引导的视觉重定位(EVR)和基于共识的信用分配(CCA)机制,在多个医学视觉问答基准上达到SOTA性能。

Details Motivation: 现有医学VLMs受限于纯文本推理范式,难以有效结合视觉证据,导致细粒度视觉分析能力弱、易产生视觉幻觉,影响临床安全性与可靠性。 Method: 提出MedVR强化学习框架,包含两个核心机制:熵引导的视觉重定位(EVR)利用模型不确定性指导视觉探索;基于共识的信用分配(CCA)从多轮推理结果的一致性中提取伪监督信号,全程无需人工标注中间步骤。 Result: 在多个公开医学VQA基准上显著超越现有方法,达到SOTA性能,且提升了模型鲁棒性与推理透明性。 Conclusion: MedVR实现了无需标注的视觉驱动推理,为推动医学AI在临床场景中的安全、可信部署提供了新路径。 Abstract: Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.

[153] OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

Yiduo Jia,Muzhi Zhu,Hao Zhong,Mingyu Liu,Yuling Xi,Hao Chen,Bin Qin,Yongjie Yang,Zhenbo Luo,Chunhua Shen

Main category: cs.CV

TL;DR: OmniJigsaw 是一种面向多模态(视频-音频)模型的通用自监督强化学习后训练框架,通过时间顺序重建打乱的音视频片段,促进跨模态协同理解与推理。

Details Motivation: 将强化学习后训练范式扩展至全模态(omni-modal)模型,以同时提升视频-音频理解与协同推理能力,并解决现有代理任务中因模态捷径导致的跨模态融合不足问题。 Method: 提出 OmniJigsaw 框架,基于时间重排序代理任务,采用联合模态融合、样本级模态选择和片段级模态掩码三种策略;并设计粗到细两阶段数据过滤流程以适配海量无标注全模态数据。 Result: 在15个基准测试上显著提升视频、音频及协同推理性能;发现并验证了‘双模态捷径现象’,证明片段级模态掩码优于样本级选择。 Conclusion: OmniJigsaw 是一种可扩展、高效的全模态自监督学习范式,能有效增强音视频跨模态集成与协同推理能力。 Abstract: To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

[154] SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

You Hu,Chenzhuo Zhao,Changfa Mo,Haotian Liu,Xiaobai Li

Main category: cs.CV

TL;DR: 本文提出了首个用于检测AI生成科学图表的基准数据集SciFigDetect,通过基于代理的数据管道构建,涵盖多种图表类型和生成源,并在零样本、跨生成器及退化图像设置下评估现有检测方法,发现当前方法在零样本迁移、泛化性和鲁棒性方面存在显著不足。

Details Motivation: 现代多模态生成模型能生成接近出版质量的科学图表,但现有AI生成图像检测方法主要针对开放域自然图像,缺乏对结构化、文本密集且语义严谨的科学图表的检测能力,亟需专门基准推动研究。 Method: 设计并实现了一个基于智能体的数据流水线:自动检索授权论文,进行图文多模态理解,构建结构化提示,合成候选图表,并通过人工评审驱动的迭代筛选机制生成高质量真实-合成配对数据;构建了覆盖多类别、多生成源的基准数据集,并在零样本、跨生成器、退化图像等设定下系统评测主流检测器。 Result: 实验表明,现有检测器在零样本迁移下性能急剧下降,严重依赖特定生成器(过拟合),且对常见后处理失真(如压缩、噪声)极为脆弱,揭示了当前AIGI检测能力与高质量科学图表分布之间存在巨大鸿沟。 Conclusion: SciFigDetect是首个面向科学图表的AI生成检测基准,暴露了现有方法的局限性,为开发更鲁棒、可泛化的科学图表取证技术提供了基础支撑和明确方向。 Abstract: Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.

[155] Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

Blessing Agyei Kyem,Joshua Kofi Asamoah,Anthony Dontoh,Armstrong Aboah

Main category: cs.CV

TL;DR: 本文提出PaveInstruct数据集和PaveGPT模型,通过领域特定指令微调,显著提升视觉语言模型在路面状况评估中的性能,实现符合工程标准的统一评估工具。

Details Motivation: 通用视觉语言模型在专业工程领域(如路面检测)表现不佳,难以满足精确术语、结构化推理和工程标准要求。 Method: 构建包含278,889个图像-指令-响应对的PaveInstruct数据集(整合9个异构路面数据集),并基于该数据集训练领域基础模型PaveGPT;在感知、理解与推理任务上对比评测其性能。 Result: 指令微调使模型在空间定位、推理与生成任务上提升超20%,输出符合ASTM D6433标准;支持交通部门用单一对话式工具替代多个专用系统。 Conclusion: 领域指令微调可有效赋能视觉语言模型完成专业基础设施评估任务,为桥梁、铁路、建筑等领域的AI系统开发提供通用路径。 Abstract: General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.

[156] EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

Xiangyuan Wang,Honghao Cai,Yunhao Bai,Tianze Zhou,Haohua Chen,Yao Hu,Xu Tang,Yibo Chen,Wei Zhu

Main category: cs.CV

TL;DR: 本文提出EditCaption,一种用于视觉语言模型(VLM)自动生成高质量图像编辑指令的两阶段后训练方法,显著降低指令错误率并提升下游编辑性能。

Details Motivation: 高质量训练三元组(源-目标图像对+精确编辑指令)稀缺,而现有VLM生成的编辑指令存在方向混淆、视角模糊和属性描述粗略三大系统性错误,超47%指令不可用。 Method: 提出两阶段EditCaption流程:第一阶段构建10万样本监督微调(SFT)数据集,融合GLM自动标注、EditScore过滤与人工精修;第二阶段收集1万组人类偏好对,采用直接偏好优化(DPO)针对性修正三类错误。 Result: 在Eval-400、ByteMorph-Bench和HQ-Edit基准上,微调后的Qwen3-VL模型超越主流开源基线;235B模型在Eval-400达4.712(略超Gemini-3-Pro),人类评估显示关键错误率从47.75%降至23%,正确率从41.75%升至66%。 Conclusion: EditCaption为图像编辑领域提供了可扩展、人机对齐的指令合成实用路径,有效缓解高质量训练数据瓶颈。 Abstract: High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.

[157] Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges

Saniya M. Deshmukh,Kailash A. Hambarde,Hugo Proença

Main category: cs.CV

TL;DR: 本文是一篇关于跨域目标检测(CDOD)的综述论文,系统梳理了该领域的挑战、方法分类、领域偏移传播机制、数据集与评估标准,并指出了未来研究方向。

Details Motivation: 目标检测模型在源域训练后迁移到未见目标域时性能显著下降,现有研究分散且缺乏对领域偏移本质挑战和适配策略效果的统一视角。 Method: 提出多阶段问题建模框架,构建基于适配范式、建模假设和检测流程组件的概念化分类体系,并分析领域偏移在检测各阶段的传播机制。 Result: 建立了CDOD的统一分析框架,系统归纳了主流方法、常用数据集与评估协议,并揭示了检测任务比分类任务更难适配的根本原因。 Conclusion: 该综述为理解与推进跨域目标检测提供了理论基础与实践指导,有助于构建更具鲁棒性的检测系统。 Abstract: Object detection models trained on a source domain often exhibit significant performance degradation when deployed in unseen target domains, due to various kinds of variations, such as sensing conditions, environments and data distributions. Hence, regardless the recent breakthrough advances in deep learning-based detection technology, cross-domain object detection (CDOD) remains a critical research area. Moreover, the existing literature remains fragmented, lacking a unified perspective on the structural challenges underlying domain shift and the effectiveness of adaptation strategies. This survey provides a comprehensive and systematic analysis of CDOD. We start upon a problem formulation that highlights the multi-stage nature of object detection under domain shift. Then, we organize the existing methods through a conceptual taxonomy that categorizes approaches based on adaptation paradigms, modeling assumptions, and pipeline components. Furthermore, we analyze how domain shift propagates across detection stages and discuss why adaptation in object detection is inherently more complex than in classification. In addition, we review commonly used datasets, evaluation protocols, and benchmarking practices. Finally, we identify the key challenges and outline promising future research directions. Cohesively, this survey aims to provide a unified framework for understanding CDOD and to guide the development of more robust detection systems.

[158] $\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

Arnav Devalapally,Poornima Jain,Kartik Srinivas,Vineeth N. Balasubramanian

Main category: cs.CV

TL;DR: 本文提出了一种面向源域独有类别遗忘的无源域自适应(SFDA)机器遗忘新设定SCADA-UL,并设计了结合对抗样本生成、重标定标签策略与对抗优化的新遗忘方法,在保护源域隐私的同时保持目标域性能。

Details Motivation: 现有源自由域自适应(SFDA)模型会无意中将源域独有类别的知识泄露至目标域,带来隐私风险,而传统机器遗忘方法未考虑分布偏移问题,缺乏针对性解决方案。 Method: 提出SCADA-UL遗忘设定;设计基于对抗生成‘遗忘类’样本、重标定标签策略及对抗优化的联合遗忘机制;拓展至持续学习和未知遗忘类别两种变体。 Result: 在多个基准数据集上,所提方法在源独有类别遗忘任务中显著优于基线方法,遗忘效果达重训练水平,且兼顾目标域分类性能。 Conclusion: SCADA-UL为SFDA场景下的隐私保护提供了首个系统化机器遗忘框架,验证了在分布偏移下实现可控、高效遗忘的可行性与有效性。 Abstract: The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called SCADA-UL: Unlearning Source-exclusive ClAsses in Domain Adaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization. We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown. Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets. Our code is available at https://github.com/D-Arnav/SCADA

[159] DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection

Jiangbei Yue,Sharib Ali

Main category: cs.CV

TL;DR: 本文提出了一种双分支多模态框架,结合文本-图像分支和视觉分支,以提升深度学习模型在内窥镜图像中对分布外(OOD)样本的检测性能,显著优于现有方法。

Details Motivation: 现有OOD检测方法通常仅依赖单一视觉模态或仅图像-文本匹配,未能充分利用多模态信息,难以应对临床环境中复杂动态的数据分布变化。 Method: 提出双分支多模态框架:一个文本-图像分支和一个纯视觉分支;分别计算两个分支的OOD得分(St和Sv),融合得到最终OOD得分S用于阈值判断。 Result: 在公开内窥镜图像数据集上实验表明,该框架在多种骨干网络下均表现稳健,OOD检测性能较当前最优方法最高提升24.84%。 Conclusion: 所提双分支多模态框架能更充分地利用多模态信息,显著提升临床场景下DL模型对OOD样本的识别能力与可靠性。 Abstract: The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%

[160] Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

Jing Gu,Niccolò Cavagnero,Gijs Dubbelman

Main category: cs.CV

TL;DR: 本文提出了一种针对自动驾驶的轻量级视觉模型Orion-Lite,通过潜在特征蒸馏与真值轨迹监督,成功将大语言模型(LLM)增强的VLA教师模型ORION的知识蒸馏到紧凑的学生模型中,在闭环复杂交互场景下超越教师模型,并在Bench2Drive基准上达到80.6的驾驶分数。

Details Motivation: 现有LLM增强的VLA模型虽性能优异,但参数量大、难以满足自动驾驶对低延迟和高能效的要求;已有知识蒸馏工作多限于简单场景和开环评估,缺乏在复杂、交互式闭环场景下的验证。 Method: 采用潜在特征蒸馏(latent feature distillation)结合真值轨迹监督(ground-truth trajectory supervision),将大型VLA教师模型ORION的知识迁移至轻量级纯视觉学生模型Orion-Lite。 Result: Orion-Lite在Bench2Drive基准上取得80.6的Driving Score,刷新SOTA,且性能反超其大型教师模型ORION。 Conclusion: 纯视觉架构在高性能反应式规划中仍具巨大未开发潜力;高质量知识蒸馏可在显著降低计算开销的同时,甚至提升复杂闭环驾驶任务的性能。 Abstract: Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.

[161] Preventing Overfitting in Deep Image Prior for Hyperspectral Image Denoising

Panagiotis Gkotsis,Athanasios A. Rontogiannis

Main category: cs.CV

TL;DR: 本文提出了一种结合鲁棒数据保真项和显式敏感性正则化的DIP方法,用于高光谱图像去噪,有效缓解过拟合问题并提升性能。

Details Motivation: DIP方法在高光谱图像去噪中易过拟合,导致性能下降且需早停,亟需改进。 Method: 采用Smooth ℓ1数据项、基于散度的正则化及输入优化联合训练。 Result: 在含高斯、稀疏和条纹噪声的真实高光谱图像上,该方法有效防止过拟合,去噪性能优于现有DIP方法。 Conclusion: 联合鲁棒数据保真与敏感性正则化可显著提升DIP在HSI去噪中的泛化能力与鲁棒性。 Abstract: Deep image prior (DIP) is an unsupervised deep learning framework that has been successfully applied to a variety of inverse imaging problems. However, DIP-based methods are inherently prone to overfitting, which leads to performance degradation and necessitates early stopping. In this paper, we propose a method to mitigate overfitting in DIP-based hyperspectral image (HSI) denoising by jointly combining robust data fidelity and explicit sensitivity regularization. The proposed approach employs a Smooth $\ell_1$ data term together with a divergence-based regularization and input optimization during training. Experimental results on real HSIs corrupted by Gaussian, sparse, and stripe noise demonstrate that the proposed method effectively prevents overfitting and achieves superior denoising performance compared to state-of-the-art DIP-based HSI denoising methods.

[162] Revisiting Radar Perception With Spectral Point Clouds

Hamza Alsharif,Jing Gu,Pavol Jancura,Satish Ravindran,Gijs Dubbelman

Main category: cs.CV

TL;DR: 本文提出光谱点云范式,将点云视为雷达光谱的稀疏压缩表示,并通过注入光谱信息提升其性能,使其在特定密度下达到甚至超越密集距离-多普勒(RD)光谱的性能,为统一雷达感知和雷达基础模型奠定基础。

Details Motivation: 密集的距离-多普勒(RD)光谱虽被普遍认为优于稀疏点云,但其易受传感器与配置差异影响,导致跨平台迁移困难;而点云作为通用三维表示形式,亟需提升其对雷达感知任务的表征能力。 Method: 提出光谱点云(Spectral Point Cloud)范式,将点云视为雷达光谱的稀疏压缩表示;设计实验框架,在不同点云密度下对比光谱点云模型与密集RD基准性能;探索两种基础光谱增强方法,向点云注入目标相关光谱信息(如多普勒、信噪比等)。 Result: 在特定点云密度下,未经增强的光谱点云模型可达到与密集RD基准相当的性能;经光谱增强后,点云模型性能可显著超越RD基准。 Conclusion: 点云无需在雷达感知中处于劣势;光谱点云是一种鲁棒、统一且可扩展的输入表示,有望成为未来雷达基础模型的核心数据格式。 Abstract: Radar perception models are trained with different inputs, from range-Doppler spectra to sparse point clouds. Dense spectra are assumed to outperform sparse point clouds, yet they can vary considerably across sensors and configurations, which hinders transfer. In this paper, we provide alternatives for incorporating spectral information into radar point clouds and show that, point clouds need not underperform compared to spectra. We introduce the spectral point cloud paradigm, where point clouds are treated as sparse, compressed representations of the radar spectra, and argue that, when enriched with spectral information, they serve as strong candidates for a unified input representation that is more robust against sensor-specific differences. We develop an experimental framework that compares spectral point cloud (PC) models at varying densities against a dense range-Doppler (RD) benchmark, and report the density levels where the PC configurations meet the performance of the RD benchmark. Furthermore, we experiment with two basic spectral enrichment approaches, that inject additional target-relevant information into the point clouds. Contrary to the common belief that the dense RD approach is superior, we show that point clouds can do just as well, and can surpass the RD benchmark when enrichment is applied. Spectral point clouds can therefore serve as strong candidates for unified radar perception, paving the way for future radar foundation models.

[163] CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild

Siyuan Yao,Hao Sun,Ruiqi Yu,Xiwei Jiang,Wenqi Ren,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文构建了一个名为CAMotion的高质量视频伪装物体检测(VCOD)基准数据集,旨在解决现有VCOD数据集规模小、多样性不足的问题,支持对伪装移动物体在复杂场景下的深入分析与模型评估。

Details Motivation: 现有视频伪装物体检测(VCOD)数据集规模小、多样性有限,难以支撑数据驱动的深度学习方法的充分训练与全面评估。 Method: 构建了CAMotion基准数据集,包含多种野外真实场景下的视频序列,涵盖不确定边缘、遮挡、运动模糊、形状复杂等挑战性属性,并提供详尽的序列标注与多维度统计分析;同时对当前SOTA模型在该数据集上进行了系统评测。 Result: 发布了首个面向野外动态伪装物体的大规模、多属性VCOD基准CAMotion,提供了详细标注、统计分布及SOTA模型评测结果,揭示了VCOD任务中的主要挑战。 Conclusion: CAMotion填补了VCOD领域高质量基准的空白,有望推动伪装物体检测尤其是视频域的研究进展。 Abstract: Discovering camouflaged objects is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. While the problem of camouflaged object detection over sequential video frames has received increasing attention, the scale and diversity of existing video camouflaged object detection (VCOD) datasets are greatly limited, which hinders the deeper analysis and broader evaluation of recent deep learning-based algorithms with data-hungry training strategy. To break this bottleneck, in this paper, we construct CAMotion, a high-quality benchmark covers a wide range of species for camouflaged moving object detection in the wild. CAMotion comprises various sequences with multiple challenging attributes such as uncertain edge, occlusion, motion blur, and shape complexity, etc. The sequence annotation details and statistical distribution are presented from various perspectives, allowing CAMotion to provide in-depth analyses on the camouflaged object's motion characteristics in different challenging scenarios. Additionally, we conduct a comprehensive evaluation of existing SOTA models on CAMotion, and discuss the major challenges in VCOD task. The benchmark is available at https://www.camotion.focuslab.net.cn, we hope that our CAMotion can lead to further advancements in the research community.

[164] GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis

Yishen Liu,Hongcang Chen,Pengcheng Zhao,Yunfan Bao,Yuxi Tian,Jieming Zhang,Hao Chen,Zheng Zhi,Yongchun Liu,Ying Li,Dongpu Cao

Main category: cs.CV

TL;DR: 本文提出GroundingAnomaly框架,通过空间条件模块和门控自注意力模块,在少样本下生成高质量、精准定位的工业异常图像,显著提升异常检测、分割与实例检测性能。

Details Motivation: 工业视觉异常检测受限于真实异常样本稀缺,现有异常合成方法存在融合效果差或掩码不准确的问题。 Method: 提出GroundingAnomaly框架:包含利用逐像素语义图实现精确定位控制的空间条件模块,以及通过门控注意力将条件令牌注入冻结U-Net的门控自注意力模块,兼顾预训练先验与少样本适应稳定性。 Result: 在MVTec AD和VisA数据集上验证,生成异常质量高,在异常检测、分割和实例级检测等下游任务中达到SOTA性能。 Conclusion: GroundingAnomaly有效解决了少样本异常合成中的空间控制与模型稳定性难题,为工业质检提供了更可靠的数据增强方案。 Abstract: The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.

[165] Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow

Richard Petersen,Fredrik Kahl,Jennifer Alvén

Main category: cs.CV

TL;DR: 本文提出了一种无需密集标注的弱监督肺结节分割方法,结合预训练的3D校正流生成模型与预测器,在仅使用图像级标签的情况下实现高质量分割。

Details Motivation: 密集标注(如分割掩码)在3D医学图像中成本高昂且耗时,尤其需要专家进行体素级标注;现有基于归因的弱监督方法难以准确捕捉小结构(如肺结节)。 Method: 提出一种即插即用的弱监督分割方法:利用预训练3D校正流模型进行无训练引导,仅微调预测器,使用图像级标签,不重新训练生成模型。 Result: 在LUNA16数据集上实验表明,该方法显著优于基线方法,能稳定检测不同大小和形状的肺结节,并提升分割质量。 Conclusion: 生成式基础模型可作为弱监督3D医学图像分割的有效工具,本方法为降低标注依赖提供了新思路。 Abstract: Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.

[166] Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

Yuchuan Deng,Qijie Wei,Kaiheng Qian,Jiazhen Liu,Zijie Xin,Bangxiang Lan,Jingyu Liu,Jianfeng Dong,Xirong Li

Main category: cs.CV

TL;DR: 本文提出Fundus-R1模型,利用纯公开数据(94%仅含图像级标签)训练一个具备推理能力的眼底图像理解多模态大语言模型,通过RAG生成知识感知的推理链,并在RLVR中引入过程奖励提升推理自洽性,在多个基准上显著优于基线。

Details Motivation: 现有眼底图像理解模型依赖大量私有、高质量临床报告配对数据,但这些数据不公开,导致研究不可复现且门槛高,亟需基于公开数据构建高性能模型。 Method: 1)提出基于RAG的方法,自动构建图像特异、知识感知的推理链,将通用MLLM识别的视觉发现与图像级标签通过眼科知识关联;2)改进强化学习(RLVR),引入鼓励推理链自我一致性的过程奖励。 Result: 在FunBench、Omni-Fundus和GMAI-Fundus三个眼底阅读基准上,Fundus-R1显著优于通用多模态模型Qwen2.5-VL及未使用生成推理链的后训练版本。 Conclusion: 证明仅用大规模公开、弱标注(图像级标签)数据即可训练出高性能、可解释的眼底阅读MLLM,为该领域普惠化研究开辟新路径。 Abstract: Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.

[167] Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

Xun Zhu,Fanbin Mo,Xi Chen,Kaili Zheng,Shaoshuai Yang,Yiming Shi,Jian Gao,Miao Li,Ji Wu

Main category: cs.CV

TL;DR: 本文通过特征探测方法,系统分析了14个开源医疗多模态大语言模型(MLLMs)在图像分类任务中性能下降的根本原因,揭示了四种典型失效模式,并提出了量化评估指标。

Details Motivation: 尽管医疗多模态大语言模型(MLLMs)在预训练数据和参数规模上具有显著优势,但在基础的医学图像分类任务上却持续落后于传统深度学习模型,这一矛盾亟需深入探究其性能退化根源。 Method: 在三个代表性医学图像分类数据集上对14个开源医疗MLLMs进行大规模实验,采用模块化、逐层的视觉特征探针(feature probing)技术,追踪视觉信息流,可视化分类信号在模型各阶段的失真、稀释或覆盖过程。 Result: 首次系统识别出医疗MLLMs分类性能退化的四大失效模式:视觉表征质量受限、连接器投影保真度损失、大语言模型推理理解不足、语义映射错位;并提出可量化的特征演化健康度评分,支持跨模型与跨数据集的客观比较。 Conclusion: 当前医疗MLLMs距离临床可用仍有显著差距,其性能瓶颈源于架构与任务需求间的深层不匹配,需从特征建模、跨模态对齐与临床语义理解等多维度协同突破。 Abstract: The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.

[168] InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Ashutosh Kumar,Rajat Saini,Jingjing Pan,Mustafa Erdogan,Mingfang Zhang,Betty Le Dem,Norimasa Kobori,Quan Kong

Main category: cs.CV

TL;DR: 本文提出InstAP框架,通过全局视觉-文本对齐与细粒度实例级对比对齐联合优化,解决现有视觉语言预训练在实例级推理上的不足;并构建InstVL大规模双粒度标注数据集(200万图像+5万视频),验证了该方法在实例检索和零样本视频理解任务上的优越性及定位准确性。

Details Motivation: 现有视觉语言预训练范式擅长全局场景理解,但在实例级推理上受限于仅使用全局监督信号。 Method: 提出InstAP实例感知预训练框架,联合优化全局视觉-文本对齐和细粒度实例级对比对齐(将文本提及锚定到具体时空区域);构建InstVL双粒度数据集(含整体场景描述与密集接地的实例描述)。 Result: 在InstVL基准上显著超越现有VLP模型的实例级检索性能;即使与在相同数据上训练的强基线相比仍具优势;同时提升全局理解能力,在MSR-VTT、DiDeMo等视频零样本任务中表现优异;可视化显示其能准确定位文本提及的实例。 Conclusion: 实例感知的预训练不仅能增强细粒度推理能力,还能反哺全局理解,验证了引入实例级监督对视觉语言模型的重要价值。 Abstract: Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

[169] PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

Ruizhi Zhang,Ye Huang,Yuangang Pan,Chuanfu Shen,Zhilin Liu,Ting Xie,Wen Li,Lixin Duan

Main category: cs.CV

TL;DR: 本文提出了PokeGym——一个基于《宝可梦传说:Z-A》的视觉驱动、长时程3D具身智能体基准,旨在克服现有VLM评测在交互性、3D深度感知、状态泄露和可扩展性上的四大缺陷;通过严格隔离视觉输入与自动评估,系统揭示当前VLM的核心瓶颈是物理死锁恢复能力,并发现模型在死锁认知上存在‘无意识’与‘有意识’的元认知差异,呼吁在VLM中引入显式空间直觉。

Details Motivation: 现有VLM评测基准存在四大缺陷:被动感知、2D简化、状态泄露、人工评估昂贵不可扩展,难以支撑VLM在复杂3D具身环境中的真实能力评估。 Method: 构建PokeGym基准:基于《宝可梦传说:Z-A》游戏实现30个长时程(30–220步)视觉驱动任务,涵盖导航、交互与混合场景;采用三级指令粒度(视觉引导/步骤引导/目标仅给);实行代码级隔离——代理仅接收原始RGB帧,评估器通过内存扫描独立验证结果。 Result: 实证发现:物理死锁恢复能力是当前VLM的主要瓶颈(而非高层规划),且死锁率与任务成功率呈强负相关;进一步揭示‘元认知分化’:弱模型多陷入‘无意识死锁’,强模型则常出现‘有意识死锁’(能识别但无法脱困)。 Conclusion: 当前VLM缺乏显式空间建模与物理直觉,亟需在架构中融入空间推理机制以提升具身交互鲁棒性;PokeGym为未来VLM具身智能研究提供了可扩展、自动化、真实3D的评测新范式。 Abstract: While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.

[170] MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

Junyao Gao,Sibo Liu,Jiaxing Li,Yanan Sun,Yuanpeng Tu,Fei Shen,Weidong Zhang,Cairong Zhao,Jun Zhang

Main category: cs.CV

TL;DR: 本文提出了MegaStyle,一个用于构建高质量风格数据集的可扩展数据整理流程,并基于该数据集训练了风格编码器和风格迁移模型。

Details Motivation: 当前风格迁移任务受限于缺乏高质量、风格内一致且风格间多样的数据集,因此需要一种新的数据整理方法来提升风格表示学习和迁移效果。 Method: 利用大生成模型对文本描述到图像风格的一致映射能力,构建包含17万风格提示和40万内容提示的提示库,并通过组合生成大规模风格数据集MegaStyle-1.4M;在此基础上,采用风格监督对比学习训练MegaStyle-Encoder,并基于FLUX架构训练风格迁移模型MegaStyle-FLUX。 Result: 实验证明MegaStyle-1.4M在保持风格内一致性、风格间多样性及图像质量方面具有优势;MegaStyle-Encoder能提供可靠的风格相似度度量,MegaStyle-FLUX实现泛化性强的风格迁移。 Conclusion: MegaStyle为风格迁移领域提供了高质量数据集构建范式与配套模型,显著推动了该方向的发展。 Abstract: In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, which can generate images in the same style from a given style description. Building on this foundation, we curate a diverse and balanced prompt gallery with 170K style prompts and 400K content prompts, and generate a large-scale style dataset MegaStyle-1.4M via content-style prompt combinations. With MegaStyle-1.4M, we propose style-supervised contrastive learning to fine-tune a style encoder MegaStyle-Encoder for extracting expressive, style-specific representations, and we also train a FLUX-based style transfer model MegaStyle-FLUX. Extensive experiments demonstrate the importance of maintaining intra-style consistency, inter-style diversity and high-quality for style dataset, as well as the effectiveness of the proposed MegaStyle-1.4M. Moreover, when trained on MegaStyle-1.4M, MegaStyle-Encoder and MegaStyle-FLUX provide reliable style similarity measurement and generalizable style transfer, making a significant contribution to the style transfer community. More results are available at our project website https://jeoyal.github.io/MegaStyle/.

[171] SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction

Chensheng Dai,Shengjun Zhang,Min Chen,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出SurfelSplat,一种前馈式框架,用于从稀疏视角图像生成像素对齐的高斯surfels表示,以实现快速、通用的多视角表面重建;通过基于奈奎斯特采样定理的跨视角特征聚合模块,解决高频几何属性恢复难题,在DTU数据集上达到SOTA性能且推理速度提升100倍。

Details Motivation: 现有基于优化的3DGS表面重建方法依赖密集视角输入且耗时长,难以满足实时与泛化需求。 Method: 提出SurfelSplat前馈框架,引入空间采样率引导的低通滤波调整高斯surfels几何形态,并通过跨视角投影与特征融合网络建模视图间相关性,回归精确几何属性。 Result: 在DTU基准上达到与SOTA优化方法相当的重建精度,单场景推理时间约1秒,提速约100倍,无需逐场景训练。 Conclusion: SurfelSplat有效平衡了重建质量、速度与泛化能力,为稀疏视角下的高效3D表面重建提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive performance in 3D scene reconstruction. Beyond novel view synthesis, it shows great potential for multi-view surface reconstruction. Existing methods employ optimization-based reconstruction pipelines that achieve precise and complete surface extractions. However, these approaches typically require dense input views and high time consumption for per-scene optimization. To address these limitations, we propose SurfelSplat, a feed-forward framework that generates efficient and generalizable pixel-aligned Gaussian surfel representations from sparse-view images. We observe that conventional feed-forward structures struggle to recover accurate geometric attributes of Gaussian surfels because the spatial frequency of pixel-aligned primitives exceeds Nyquist sampling rates. Therefore, we propose a cross-view feature aggregation module based on the Nyquist sampling theorem. Specifically, we first adapt the geometric forms of Gaussian surfels with spatial sampling rate-guided low-pass filters. We then project the filtered surfels across all input views to obtain cross-view feature correlations. By processing these correlations through a specially designed feature fusion network, we can finally regress Gaussian surfels with precise geometry. Extensive experiments on DTU reconstruction benchmarks demonstrate that our model achieves comparable results with state-of-the-art methods, and predict Gaussian surfels within 1 second, offering a 100x speedup without costly per-scene training.

[172] Phantasia: Context-Adaptive Backdoors in Vision Language Models

Nam Duong Tran,Phi Le Nguyen

Main category: cs.CV

TL;DR: 本文揭示了现有视觉语言模型(VLM)后门攻击的隐蔽性被高估,并提出一种新型上下文自适应后门攻击方法Phantasia,能生成语义一致且难以检测的恶意响应。

Details Motivation: 现有VLM后门攻击多依赖固定、易识别的中毒模式,其隐蔽性被高估;缺乏真正隐蔽、语义自适应的攻击方法。 Method: 1)利用跨模态防御技术评估并揭露现有攻击的脆弱性;2)提出Phantasia:一种上下文自适应后门攻击,通过动态对齐输入语义生成连贯但恶意的输出,避免静态中毒模式。 Result: Phantasia在多种VLM架构上实现SOTA攻击成功率,同时在各类防御设置下保持正常任务性能,显著提升隐蔽性与泛化性。 Conclusion: VLM后门攻击的实际威胁被低估,Phantasia为评估和提升VLM鲁棒性提供了更真实、更具挑战性的基准。 Abstract: Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.

[173] SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

Wenli Zhang,Xianglong Shi,Sirui Zhao,Xinqi Chen,Guo Cheng,Yifan Xu,Tong Xu,Yong Liao

Main category: cs.CV

TL;DR: 本文提出SyncBreaker,一种阶段感知的多模态保护框架,通过联合扰动图像和音频输入,有效抑制扩散模型驱动的说话头生成中的唇部同步与面部动态,同时保持输入感知质量并具备抗净化鲁棒性。

Details Motivation: 现有防护方法局限于单模态(仅图像或仅音频),无法有效抑制语音驱动的面部动态,存在被滥用于欺诈和虚假信息的风险。 Method: 提出SyncBreaker框架:1)图像流采用多区间采样(MIS)下的归零监督,跨扩散阶段聚合去噪引导;2)音频流采用跨注意力欺骗(CAF),抑制特定区间的音频条件跨注意力响应;两流独立优化、推理时融合。 Result: 在白盒主动防护设置下,SyncBreaker显著优于强单模态基线,在降低唇同步与面部动态的同时,保持输入感知质量,并对净化攻击具有鲁棒性。 Conclusion: SyncBreaker验证了阶段感知、多模态协同扰动是防御音频驱动说话头生成滥用的有效新范式。 Abstract: Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.

[174] BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

Fan Yang,Wenrui Chen,Guorun Yan,Ruize Liao,Wanjun Jia,Dongsheng Luo,Kailun Yang,Zhiyong Li,Yaonan Wang

Main category: cs.CV

TL;DR: 本文提出BLaDA框架,通过语言解析、三角形功能点定位和3D关键点抓取矩阵变换,实现零样本、可解释的功能性灵巧抓取,显著提升功能区域定位精度与操作成功率。

Details Motivation: 现有模块化分层方法依赖预定义功能标签,缺乏语义与姿态的紧密耦合,难以支持开放词汇指令下的功能性灵巧操作。 Method: 提出BLaDA框架:1)知识引导的语言解析(KLP)模块将自然语言解析为六元操作约束;2)三角功能点定位(TriLocation)模块基于3D高斯泼溅表示,在三角几何约束下定位功能区域;3)3D关键点抓取矩阵变换执行(KGT3D+)模块将语义-几何约束解码为手腕姿态与手指级指令。 Result: 在多个复杂基准测试中,BLaDA在功能区域定位精度和跨类别/任务的功能性操作成功率上均显著优于现有方法。 Conclusion: BLaDA实现了开放词汇指令驱动、语义-姿态强耦合、物理可解释的零样本功能性灵巧操作,提升了系统可控性与泛化能力。 Abstract: In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.

[175] HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment

Changdao Chen

Main category: cs.CV

TL;DR: 本文提出HST-HGN模型,结合分层超图网络与双向状态空间模型(Bi-Mamba),在保持低计算开销的同时,有效建模驾驶员疲劳视频中高阶面部协同变形与长程时序演化,实现实时、精准的疲劳检测。

Details Motivation: 现有方法难以在计算受限条件下,从非裁剪视频中准确建模细微面部表情的长程时序依赖:重模型计算开销大,轻量图模型又缺乏高阶协同与全局时序建模能力。 Method: 提出异构时空超图网络HST-HGN:空间上采用分层超图融合姿态解耦几何拓扑与多模态纹理块;时间上引入线性复杂度的双向Bi-Mamba模块进行显式时序演化滤波。 Result: 在多个疲劳检测基准上达到SOTA性能,在判别力与计算效率间取得良好平衡,适用于车载边缘端实时部署。 Conclusion: HST-HGN通过异构超图与双向状态空间建模,解决了轻量化下高阶协同与长程时序建模的难题,为实时驾驶员疲劳监测提供了新范式。 Abstract: It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.

[176] CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Rui Gan,Junyi Ma,Pei Li,Xingyou Yang,Kai Chen,Sikai Chen,Bin Ran

Main category: cs.CV

TL;DR: 本文提出了CrashSight,一个面向路侧视角的大型视觉-语言基准数据集,用于评估模型在道路碰撞场景中的理解能力,特别关注时间与因果推理等安全关键能力;实验表明现有视觉语言模型在此类任务上表现不足。

Details Motivation: 现有视觉语言模型(VLMs)在自动驾驶中缺乏针对路侧视角和安全关键交通场景(如碰撞理解)的充分评估,因主流基准聚焦于自车视角,难以支撑协同式自动驾驶所需的基础设施协同感知。 Method: 构建了CrashSight基准:包含250个真实路侧摄像头拍摄的碰撞视频,标注13K多选问答对,按两层分类体系设计——Tier 1评估视觉定位(场景上下文、涉事方),Tier 2评估高阶推理(碰撞机理、因果归因、时序演化、事后结果);并对8种SOTA VLM进行系统评测与失败案例分析。 Result: 当前SOTA VLM在场景描述任务上表现良好,但在时间推理与因果推理等安全关键能力上显著不足;识别出典型失败模式,并指出改进方向。 Conclusion: CrashSight填补了路侧视角下VLM安全推理评估的空白,为协同自动驾驶中的基础设施辅助感知提供了标准化评测框架。 Abstract: Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

[177] OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Haoxi Zeng,Qiankun Liu,Yi Bin,Haiyue Zhang,Yujuan Ding,Guoqing Wang,Deqiang Ouyang,Heng Tao Shen

Main category: cs.CV

TL;DR: 本文提出OVS-DINO框架,通过将DINO模型与SAM的结构先验对齐,恢复其深层特征中的边界感知能力,显著提升开放词汇分割性能,尤其在复杂场景下效果突出。

Details Motivation: CLIP类方法语义泛化强但空间细节不足;基于DINO等视觉基础模型的方法仍缺乏精确边缘感知能力。 Method: 发现DINO深层特征中边界信息逐渐衰减,提出OVS-DINO框架:引入结构感知编码器(SAE)和结构调制解码器(SMD),利用SAM的结构先验激活DINO的边界特征,并采用SAM生成的伪掩码进行监督。 Result: 在多个弱监督OVS基准上达到SOTA,平均分数提升2.1%(44.8%→46.9%);在Cityscapes上提升6.3%(36.6%→42.9%)。 Conclusion: 通过结构对齐可有效恢复DINO的隐式边界敏感性,验证了融合多模型结构先验对开放词汇分割的重要性。 Abstract: Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).

[178] LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

Jingjing Wang,Zhengdong Hong,Chong Bao,Yuke Zhu,Junhan Sun,Guofeng Zhang

Main category: cs.CV

TL;DR: 本文提出LAMP方法,利用图像编辑作为3D先验,提取物体间连续、几何感知的3D变换表示,以提升开放世界机器人操作中的泛化能力。

Details Motivation: 现有基于学习的方法(如强化学习、模仿学习和视觉语言动作模型)在面对新任务和未见环境时泛化能力不足;而大语言模型和视觉语言模型受限于3D感知能力,难以支持细粒度操作。 Method: 提出LAMP框架,将图像编辑中隐含的2D空间线索提升为连续、几何感知的3D变换表示,作为开放世界操作的通用表征。 Result: 实验表明LAMP能准确输出3D变换,并在开放世界操作任务中实现强零样本泛化性能。 Conclusion: LAMP通过引入图像编辑驱动的3D先验,有效提升了机器人在开放世界中对新任务和新环境的操作泛化能力。 Abstract: Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.

[179] Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Sai Srinivas Kancheti,Aditya Kanade,Rohit Sinha,Vineeth N Balasubramanian,Tanuja Ganu

Main category: cs.CV

TL;DR: 本文提出Faithful GRPO(FGRPO)方法,通过拉格朗日对偶上升法在强化学习中引入逻辑一致性和视觉接地性约束,显著提升多模态推理模型生成的思维链(CoT)的质量与最终答案准确率。

Details Motivation: 现有基于强化学习的多模态推理模型(如ViGoRL-Spatial、TreeVGR及标准GRPO训练模型)虽提升答案准确率,但其生成的思维链常与答案不一致、且缺乏对图像证据的有效 grounding,损害推理可信度。 Method: 提出Faithful GRPO(FGRPO),在Group Relative Policy Optimization(GRPO)框架中,将逻辑一致性(CoT是否蕴含答案)和视觉接地性(每步是否准确描述图像中的对象、属性与空间关系)建模为批次级约束,并通过拉格朗日对偶上升法动态调整约束权重,融入优势函数计算。 Result: 在Qwen2.5-VL-7B/3B模型上于7个空间推理数据集验证:FGRPO将CoT不一致率从24.5%降至1.7%,视觉接地分数提升+13%,同时最终答案准确率也优于标准GRPO。 Conclusion: 强制保证思维链的逻辑一致性和视觉接地性不仅提升推理质量,还能反哺任务性能;FGRPO为构建可信赖、可解释的多模态推理模型提供了有效路径。 Abstract: Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

[180] Novel View Synthesis as Video Completion

Qi Wu,Khiem Vuong,Minsik Jeon,Srinivasa Narasimhan,Deva Ramanan

Main category: cs.CV

TL;DR: 本文提出FrameCrafter,将稀疏新视角合成(NVS)建模为低帧率视频补全任务,通过修改视频扩散模型架构(如引入逐帧隐编码、移除时序位置编码)使其对输入视图顺序不变,从而高效利用视频模型中隐含的多视角知识,在稀疏NVS任务上取得竞争力结果。

Details Motivation: 现有基于单图扩散模型的方法缺乏多视角知识;而视频扩散模型天然蕴含多视角信息,更易适配NVS任务。 Method: 将稀疏NVS建模为低帧率视频补全问题,并设计FrameCrafter架构:采用逐帧隐编码、移除时间位置编码等,使原有时序敏感的视频模型变为对输入视图排列不变。 Result: 视频模型经少量监督即可‘遗忘’时间信息,在多个稀疏新视角合成基准上达到具有竞争力的性能。 Conclusion: 视频扩散模型蕴含的隐式多视角先验可被有效挖掘,无需复杂训练即可适配稀疏NVS任务,验证了将其作为通用3D感知先验的可行性。 Abstract: We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to "forget" about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/

[181] Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification

Kabilan Elangovan,Daniel Ting

Main category: cs.CV

TL;DR: 本文提出了一种无需标注的、置信度加权的一致性评估指标C-Score,用于衡量类激活映射(CAM)方法在相同病理不同患者间的空间推理策略一致性,并揭示了分类性能(AUC)与解释一致性之间存在的多种解耦现象。

Details Motivation: 现有CAM评估方法仅关注定位准确性(如与放射科医生标注的吻合度),而忽略了模型是否对同类病例采用一致的空间推理策略,即解释的一致性问题。 Method: 提出C-Score:基于正确分类样本间强度加权的软IoU(intensity-emphasised pairwise soft IoU)来量化类内解释可重复性;在Kermany胸部X光数据集上,系统评估6种CAM方法与3种CNN架构在30个训练周期中的表现,涵盖迁移学习与微调阶段。 Result: 发现三种AUC与一致性解耦机制:阈值导致的‘金标准列表坍缩’、峰值AUC处的技术特异性归因坍缩、全局聚合掩盖的类别级一致性缺失;C-Score可在AUC崩溃前一个检查点预警模型不稳定性(如ScoreCAM在ResNet50V2上的恶化)。 Conclusion: C-Score是一种有效、无标注、临床实用的一致性评估新范式,能提供早于性能下降的模型稳定性预警,并支持基于解释质量而非单纯预测排名的架构选择与临床部署决策。 Abstract: Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.

[182] Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Ying Shen,Jerry Xiong,Tianjiao Yu,Ismini Lourentzou

Main category: cs.CV

TL;DR: 本文提出Phantom模型,通过在视频生成过程中联合建模视觉内容与潜在物理动力学,提升生成视频的物理一致性与视觉真实性。

Details Motivation: 现有生成视频模型虽具高视觉保真度,但缺乏对真实世界物理规律的理解和遵循,导致运动不真实;单纯扩大数据和模型规模无法解决该问题。 Method: 提出Phantom(Physics-Infused Video Generation)模型,将潜在物理属性推理直接融入视频生成流程;引入物理感知的视频表征作为抽象但信息丰富的物理嵌入,实现无需显式物理方程即可联合预测物理动力学与视频帧。 Result: 在标准视频生成与物理感知基准上,Phantom在物理动态一致性方面优于现有方法,同时保持有竞争力的感知质量。 Conclusion: 将物理推理内生于生成过程可有效提升视频的物理合理性,且不牺牲视觉 realism,为构建具备物理常识的生成模型提供了新范式。 Abstract: Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

[183] Visually-grounded Humanoid Agents

Hang Ye,Xiaoxuan Ma,Fan Lu,Wayne Wu,Kwan-Yee Lin,Yizhou Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Visually-grounded Humanoid Agents的两层架构,使数字人仅凭视觉观测和指定目标即可在新3D场景中自主、自然地执行目标导向行为。

Details Motivation: 现有数字人系统多为被动驱动,依赖状态信息或脚本控制,难以扩展到新环境;本文旨在实现仅基于视觉和目标的主动、具身式数字人行为。 Method: 构建世界层(重建语义丰富的3D高斯场景并支持可动画化高斯人类头像)与智能体层(赋予头像第一人称RGB-D感知能力,结合空间感知与迭代推理进行具身规划,并生成全身动作执行)。 Result: 在自建基准上实验表明,该方法在任务成功率和碰撞率方面优于消融模型及现有最先进规划方法。 Conclusion: 本工作实现了可规模化部署的主动数字人,推动了以人为中心的具身AI发展,并将开源数据、代码与模型。 Abstract: Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with digital humans at scale that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer reconstructs semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware pipeline and accommodates animatable Gaussian-based human avatars. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid-scene interaction in diverse reconstructed environments. Experiments show our agents achieve robust autonomous behavior, yielding higher task success rates and fewer collisions than ablations and state-of-the-art planning methods. This work enables active digital human population and advances human-centric embodied AI. Data, code, and models will be open-sourced.

[184] When Fine-Tuning Changes the Evidence: Architecture-Dependent Semantic Drift in Chest X-Ray Explanations

Kabilan Elangovan,Daniel Ting

Main category: cs.CV

TL;DR: 本文研究了医学图像分类中迁移学习与微调过程中模型归因结构的稳定性问题,提出了‘语义漂移’概念,并在胸部X光五分类任务中量化分析了不同架构下归因图的空间定位与结构一致性变化。

Details Motivation: 在多类医学图像分类中,尽管迁移学习加微调能提升准确率,但分类性能稳定并不意味着模型所依赖的视觉证据(即归因结构)稳定;需关注模型推理过程是否发生潜在偏移。 Method: 在胸部X-ray五分类任务上,采用DenseNet201、ResNet50V2和InceptionV3,实施两阶段训练(迁移学习+全量微调),并使用无参考指标(如IoU)量化LayerCAM与GradCAM++归因图的空间定位与结构一致性变化。 Result: 粗粒度解剖定位保持稳定,但细粒度归因重组织显著且依赖于网络架构;不同归因方法(LayerCAM vs GradCAM++)下的稳定性排序在相同预测性能下可能反转。 Conclusion: 解释稳定性是网络架构、优化阶段与归因目标三者交互的结果,不能仅凭分类准确率评估模型可信性;语义漂移应成为医学AI可解释性评估的关键维度。 Abstract: Transfer learning followed by fine-tuning is widely adopted in medical image classification due to consistent gains in diagnostic performance. However, in multi-class settings with overlapping visual features, improvements in accuracy do not guarantee stability of the visual evidence used to support predictions. We define semantic drift as systematic changes in the attribution structure supporting a model's predictions between transfer learning and full fine-tuning, reflecting potential shifts in underlying visual reasoning despite stable classification performance. Using a five-class chest X-ray task, we evaluate DenseNet201, ResNet50V2, and InceptionV3 under a two-stage training protocol and quantify drift with reference-free metrics capturing spatial localization and structural consistency of attribution maps. Across architectures, coarse anatomical localization remains stable, while overlap IoU reveals pronounced architecture-dependent reorganization of evidential structure. Beyond single-method analysis, stability rankings can reverse across LayerCAM and GradCAM++ under converged predictive performance, establishing explanation stability as an interaction between architecture, optimization phase, and attribution objective.

[185] MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Tanmay Gupta,Piper Wolters,Zixian Ma,Peter Sushko,Rock Yuren Pang,Diego Llanes,Yue Yang,Taira Anderson,Boyuan Zheng,Zhongzheng Ren,Harsh Trivedi,Taylor Blanton,Caleb Ouellette,Winson Han,Ali Farhadi,Ranjay Krishna

Main category: cs.CV

TL;DR: 本文提出了MolmoWebMix数据集和MolmoWeb开源多模态网页代理模型,旨在推动开放、可复现的网页智能体研究;该模型仅基于网页截图和指令进行动作预测,在多个基准上达到SOTA性能,并将全部资源开源。

Details Motivation: 当前高性能网页代理多依赖闭源模型,限制了科研透明性、可复现性与社区协作;作者主张构建完全开放的网页代理系统。 Method: 构建大规模开源数据集MolmoWebMix(含10万+合成轨迹、3万+人工演示、GUI感知数据),并训练指令驱动的视觉-语言动作策略模型MolmoWeb(4B/8B参数),仅输入网页截图和任务指令,直接预测浏览器动作,无需HTML或可访问性树。 Result: MolmoWeb在WebVoyager、Online-Mind2Web、DeepShop等基准上超越同规模开源模型(如Fara-7B),且MolmoWeb-8B性能优于基于GPT-4o的SoM闭源代理;通过测试时并行采样与best-of-N选择,pass@4显著提升(WebVoyager达94.7%)。 Conclusion: MolmoWeb系列是首个高性能、全开源的多模态网页代理方案,配套发布模型、数据、代码与统一评测框架,为开放网页智能体研究奠定基础。 Abstract: Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

[186] UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

Joungbin An,Agrim Jain,Kristen Grauman

Main category: cs.CV

TL;DR: 本文提出UniversalVTG,一种轻量级、跨数据集预训练的视频时序定位统一模型,通过离线查询统一器和高效定位头,在多个基准上达到SOTA,且参数量比MLLM方法小100倍以上。

Details Motivation: 现有视频时序定位(VTG)方法依赖数据集特异性模型,泛化性差;而基于大多媒体语言模型(MLLM)的方法计算开销大、视频上下文受限,难以处理长视频。 Method: 提出UniversalVTG:采用跨数据集大规模预训练;设计离线Query Unifier将异构查询格式统一为声明式空间以缓解语言不匹配与负迁移;结合高效定位头支持长视频处理。 Result: 在GoalStep-StepGrounding、Ego4D-NLQ、TACoS、Charades-STA、ActivityNet-Captions等多个基准上,单个UniversalVTG检查点达到SOTA;参数量超MLLM方法100倍更小,但精度持平或更优。 Conclusion: UniversalVTG证明了轻量级、统一监督建模路径的有效性,为VTG提供了兼顾性能、效率与泛化性的实用替代方案。 Abstract: Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.

[187] FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

Johanna Karras,Yuanhao Wang,Yingwei Li,Ira Kemelmacher-Shlizerman

Main category: cs.CV

TL;DR: 本文提出首个支持'不合身'场景的虚拟试穿数据集FIT,包含113万组图像三元组及精确尺寸标注,并基于物理仿真与重纹理技术生成逼真且几何保真的合成图像,同时构建了首个尺寸感知的虚拟试穿基线模型。

Details Motivation: 现有虚拟试穿方法忽视服装与人体尺寸匹配(如大号衬衫穿在小号人体上)这一关键体验,主因缺乏带精确尺寸标注(尤其'不合身'案例)的数据集。 Method: 提出FIT数据集构建方法:(1)用GarmentCode程序化生成3D服装并经物理仿真模拟真实穿着效果;(2)设计新型重纹理框架,将合成渲染图转为照片级真实感图像且严格保持几何结构;(3)在重纹理中引入人物身份保持机制,生成同一人物穿不同服装的配对图像用于监督训练;最后基于FIT训练尺寸感知的VTO基线模型。 Result: 构建了首个大规模、带精确人体与服装尺寸标注的虚拟试穿数据集FIT(1.13M三元组);实现了几何保真、照片级真实的合成图像生成;训练出首个能反映服装实际合身程度的虚拟试穿基线模型,性能达新SOTA。 Conclusion: 本文首次系统性地解决了虚拟试穿中‘服装合身性’建模缺失的问题,通过构建FIT数据集与配套方法,为尺寸感知的虚拟试穿提供了新基准与技术路径。 Abstract: Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment, while preserving their original pose and identity. Although recent VTO methods excel at visualizing garment appearance, they largely overlook a crucial aspect of the try-on experience: the accuracy of garment fit -- for example, depicting how an extra-large shirt looks on an extra-small person. A key obstacle is the absence of datasets that provide precise garment and body size information, particularly for "ill-fit" cases, where garments are significantly too large or too small. Consequently, current VTO methods default to generating well-fitted results regardless of the garment or person size. In this paper, we take the first steps towards solving this open problem. We introduce FIT (Fit-Inclusive Try-on), a large-scale VTO dataset comprising over 1.13M try-on image triplets accompanied by precise body and garment measurements. We overcome the challenges of data collection via a scalable synthetic strategy: (1) We programmatically generate 3D garments using GarmentCode and drape them via physics simulation to capture realistic garment fit. (2) We employ a novel re-texturing framework to transform synthetic renderings into photorealistic images while strictly preserving geometry. (3) We introduce person identity preservation into our re-texturing model to generate paired person images (same person, different garments) for supervised training. Finally, we leverage our FIT dataset to train a baseline fit-aware virtual try-on model. Our data and results set the new state-of-the-art for fit-aware virtual try-on, as well as offer a robust benchmark for future research. We will make all data and code publicly available on our project page: https://johannakarras.github.io/FIT.

[188] Self-Improving 4D Perception via Self-Distillation

Nan Huang,Pengcheng Yu,Weijia Zeng,James M. Rehg,Angjoo Kanazawa,Haiwen Feng,Qianqian Wang

Main category: cs.CV

TL;DR: SelfEvo是一种无需标注数据的自改进框架,利用时空上下文不对称性进行自蒸馏,显著提升多视角4D重建模型在动态场景中的性能。

Details Motivation: 现有大规模多视图重建模型严重依赖昂贵且稀缺的3D/4D真值标注,尤其在动态场景中难以扩展。 Method: 提出SelfEvo框架,采用基于时空上下文不对称性的自蒸馏机制,结合系统化的损失设计、不对称形式选择与训练策略,在无标注视频上持续提升预训练模型。 Result: 在八个跨域基准测试中一致提升多种基线模型(如VGGT和π³),视频深度估计相对提升达36.5%,相机位姿估计提升20.1%。 Conclusion: SelfEvo验证了无监督自改进在4D感知中的有效性,为大规模动态场景重建提供了可扩展、免标注的新范式。 Abstract: Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.

[189] RewardFlow: Generate Images by Optimizing What You Reward

Onkar Susladkar,Dong-Hwan Jang,Tushar Prakash,Adheesh Juvekar,Vedant Shah,Ayush Barik,Nabeel Bashir,Muntasir Wahed,Ritish Shrirao,Ismini Lourentzou

Main category: cs.CV

TL;DR: RewardFlow是一种无需模型 inversion 的推理时引导框架,通过多奖励 Langevin 动力学调控预训练扩散与流匹配模型,融合语义对齐、感知保真、局部定位、物体一致性及人类偏好等可微奖励,并引入可微 VQA 奖励;提出提示感知自适应策略动态调节奖励权重与步长,在图像编辑与组合生成任务中达到 SOTA 性能。

Details Motivation: 现有图像编辑与生成方法常依赖模型 inversion 或难以协调多目标奖励(如语义、感知、定位、一致性、偏好),缺乏细粒度语义监督和动态优化机制。 Method: 提出 RewardFlow 框架:1)在推理时采用多奖励 Langevin 动力学;2)统一多种可微奖励,并新增可微 VQA 奖励;3)设计提示感知自适应策略,从指令中提取语义原语、推断编辑意图,并动态调整各奖励权重与采样步长。 Result: 在多个图像编辑与组合生成基准上,RewardFlow 在编辑保真度与组合对齐性方面达到当前最优(SOTA)性能。 Conclusion: RewardFlow 证明了无需 inversion 的多奖励协同优化是可行且高效的,为可控图像生成提供了更灵活、鲁棒、语义精细的推理时控制范式。 Abstract: We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.

[190] ParseBench: A Document Parsing Benchmark for AI Agents

Boyang Zhang,Sebastián G. Acosta,Preston Carlson,Sacha Bron,Pierre-Loïc Doulcet,Simon Suo

Main category: cs.CV

TL;DR: 本文提出了ParseBench,一个面向AI代理需求的新型文档解析基准,强调语义正确性,涵盖表格、图表、内容保真度、语义格式和视觉定位五个维度,在2000页真实企业文档上评估14种方法,揭示当前系统能力碎片化与关键差距。

Details Motivation: 现有文档解析基准无法满足AI代理对语义正确性的要求(如准确表格结构、图表数据、语义格式和视觉定位),且依赖窄分布文档和文本相似度指标,忽略代理关键错误。 Method: 构建ParseBench基准:包含约2000页人工验证的企业文档(保险、金融、政府领域),按五维能力(表格、图表、内容保真、语义格式、视觉 grounding)组织;评估14种方法(含多模态模型、专用解析器及LlamaParse)。 Result: 实验显示当前方法能力高度碎片化——无一能在全部五维上持续表现优异;LlamaParse Agentic以最高综合得分(agenticoverall%)领先,但仍存在显著能力缺口。 Conclusion: ParseBench填补了面向AI代理的文档解析评估空白,揭示了语义解析的核心挑战与提升方向,推动更可靠、任务驱动的文档理解技术发展。 Abstract: AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.

[191] Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Tao Xie,Peishan Yang,Yudong Jin,Yingfeng Cai,Wei Yin,Weiqiang Ren,Qian Zhang,Wei Hua,Sida Peng,Xiaoyang Guo,Xiaowei Zhou

Main category: cs.CV

TL;DR: 本文提出了一种神经全局上下文表示方法,通过轻量级子网络在测试时自监督快速适应,提升长视频序列下的大规模3D场景重建精度与一致性。

Details Motivation: 现有前馈重建模型受限于内存容量和缺乏全局上下文建模能力,在长视频序列中难以保持重建精度和一致性;受人类利用全局场景理解辅助局部感知的启发,需构建高效全局上下文表示。 Method: 设计一种可高效压缩并保留长程场景信息的神经全局上下文表示,由一组轻量级神经子网络实现,并通过测试时自监督目标快速适应。 Result: 在KITTI Odometry和Oxford Spires等大规模基准上取得领先位姿精度和最先进3D重建精度,同时保持高效率。 Conclusion: 所提方法有效增强了长序列下的重建一致性与准确性,验证了全局上下文建模对大规模3D重建的关键作用。 Abstract: This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.

[192] E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation

Mayur Deshmukh,Hiroyasu Akada,Helge Rhodin,Christian Theobalt,Vladislav Golyanik

Main category: cs.CV

TL;DR: 本文提出E-3DPSM,一种面向事件流特性的连续姿态状态机,用于头戴式设备上的单目自中心3D人体姿态估计,显著提升精度与时间稳定性。

Details Motivation: 现有方法未充分适配事件相机异步、连续的数据特性,导致3D估计精度低、易受自遮挡和时序抖动影响,难以满足VR/AR等应用需求。 Method: 提出事件驱动的连续姿态状态机E-3DPSM,将人体连续运动与细粒度事件动态对齐,通过演化潜在状态并预测事件关联的3D关节点连续变化,再融合直接3D姿态预测,实现稳定无漂移重建。 Result: 在两个基准上达到SOTA,MPJPE精度最高提升19%,时间稳定性提升达2.7倍;实时运行达80 Hz。 Conclusion: E-3DPSM有效克服了事件流建模不匹配问题,为基于事件相机的自中心3D人体姿态估计提供了更鲁棒、高效的新范式。 Abstract: Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x. See our project page for the source code and trained models.

[193] Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Shilin Yan,Jintao Tong,Hongwei Xue,Xiaojun Tang,Yangyang Wang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou

Main category: cs.CV

TL;DR: 本文提出HDPO框架,通过解耦准确性和效率两个优化通道,解决多模态智能体中工具滥用问题,显著减少工具调用次数并提升推理准确性。

Details Motivation: 现有多模态智能体存在元认知缺陷,难以判断何时应依赖内部知识、何时需调用外部工具,导致盲目调用工具、延迟高、噪声大;而现有基于标量奖励的强化学习方法无法兼顾准确率与工具使用效率。 Method: 提出HDPO框架,摒弃奖励标量化,构建两个正交优化通道:准确性通道(最大化任务正确率)和效率通道(仅在准确轨迹内通过条件优势估计强制执行经济性),从而形成认知课程学习机制。 Result: 所提出的模型Metis在多项评估中将工具调用次数降低数量级,同时提升推理准确率。 Conclusion: HDPO通过条件化效率优化而非标量奖励权衡,有效缓解工具滥用问题,为多模态智能体的高效自主决策提供了新范式。 Abstract: The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

[194] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Zhengyang Sun,Yu Chen,Xin Zhou,Xiaofan Li,Xiwu Chen,Dingkang Liang,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出NUMINA框架,通过训练-free的方式提升文本到视频扩散模型中对象数量的准确性。

Details Motivation: 文本到视频扩散模型在生成提示中指定的正确对象数量方面存在困难。 Method: NUMINA是一种训练-free的identify-then-guide框架:首先识别提示与布局不一致之处,通过选择判别性自注意和交叉注意头来推导可计数的潜在布局;然后保守地优化该布局,并调节交叉注意以引导再生。 Result: 在新构建的CountBench基准上,NUMINA显著提升了计数准确率(Wan2.1-1.3B模型提升7.4%,5B和14B模型分别提升4.9%和5.5%),同时改善CLIP对齐并保持时间一致性。 Conclusion: 结构化引导可作为种子搜索和提示增强的有效补充,为实现数量准确的文本到视频扩散提供实用路径。 Abstract: Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

[195] GaussiAnimate: Reconstruct and Rig Animatable Categories with Level of Dynamics

Jiaxin Wang,Dongxin Lyu,Zeyu Cai,Zhiyang Dou,Cheng Lin,Anpei Chen,Yuliang Xiu

Main category: cs.CV

TL;DR: 本文提出了一种名为Skelebones的Scaffold-Skin绑定系统,通过自由形态骨骼(Bones)、平均曲率骨架(Skeleton)和非参数化部件级运动匹配(Binding)三步,实现对4D形状动态性的高效压缩与可控表达,在重建精度与重动画性能上显著优于LBS和BoB等基线方法。

Details Motivation: 自由形态骨骼能很好捕捉非刚性形变但缺乏可控制的运动学结构;传统骨架方法难以兼顾类别无关性、运动自适应性与拓扑正确性。因此需要一种兼具表达力与可控性的新型绑定系统。 Method: 提出Skelebones系统:(1) 将时序一致的可变形高斯分布压缩为自由形态骨骼以逼近表面非刚性形变;(2) 从规范高斯中提取并时序优化平均曲率骨架,确保类别无关、运动自适应且拓扑正确;(3) 利用非参数化部件级运动匹配(PartMM)绑定骨架与骨骼,通过匹配、检索与融合已有运动合成新骨运动。 Result: 在合成与真实数据集上验证:相比LBS和BoB,PSNR分别提升17.3%和21.7%;PartMM在低数据(~1000帧)下RMSE比鲁棒LBS改善48.4%,且优于GRU/MLP方法超20%;支持高保真重建与复杂非刚性动态建模。 Conclusion: Skelebones成功将4D形状的动力学层级压缩为紧凑、可控且富有表现力的skelebone表示,为高斯辐射场驱动的角色绑定提供了新范式。 Abstract: Free-form bones, that conform closely to the surface, can effectively capture non-rigid deformations, but lack a kinematic structure necessary for intuitive control. Thus, we propose a Scaffold-Skin Rigging System, termed "Skelebones", with three key steps: (1) Bones: compress temporally-consistent deformable Gaussians into free-form bones, approximating non-rigid surface deformations; (2) Skeleton: extract a Mean Curvature Skeleton from canonical Gaussians and refine it temporally, ensuring a category-agnostic, motion-adaptive, and topology-correct kinematic structure; (3) Binding: bind the skeleton and bones via non-parametric partwise motion matching (PartMM), synthesizing novel bone motions by matching, retrieving, and blending existing ones. Collectively, these three steps enable us to compress the Level of Dynamics of 4D shapes into compact skelebones that are both controllable and expressive. We validate our approach on both synthetic and real-world datasets, achieving significant improvements in reanimation performance across unseen poses-with 17.3% PSNR gains over Linear Blend Skinning (LBS) and 21.7% over Bag-of-Bones (BoB)-while maintaining excellent reconstruction fidelity, particularly for characters exhibiting complex non-rigid surface dynamics. Our Partwise Motion Matching algorithm demonstrates strong generalization to both Gaussian and mesh representations, especially under low-data regime (~1000 frames), achieving 48.4% RMSE improvement over robust LBS and outperforming GRU- and MLP-based learning methods by >20%. Code will be made publicly available for research purposes at cookmaker.cn/gaussianimate.

[196] ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets

Xiaoben Li,Jingyi Wu,Zeyu Cai,Yu Siyuan,Boqian Li,Yuliang Xiu

Main category: cs.CV

TL;DR: 本文提出ETCH-X,一种改进的人体拟合方法,通过紧致性感知拟合范式去除衣物干扰('脱衣'),采用SMPL-X增强局部表达能力,并以隐式密集对应替代显式稀疏标记实现更鲁棒、细粒度的拟合;模块化设计支持多源数据解耦训练,显著提升在多样服装、姿态及不完整输入下的拟合性能。

Details Motivation: 现有方法难以同时兼顾局部细节表达(如手部、面部)与全局鲁棒性(应对衣物动态、姿态变化、噪声/部分点云等真实挑战),缺乏一体化解决方案。 Method: 升级ETCH为ETCH-X:1)引入紧致性感知拟合范式实现'undress'以抑制衣物干扰;2)采用SMPL-X模型提升表达能力;3)用隐式密集对应('dense fit')替代易受部分数据影响的显式稀疏标记;4)将'undress'与'dense fit'解耦为可独立、可扩展训练的模块,融合CLOTH3D、AMASS、InterHand2.6M等多源数据。 Result: 在多个基准上显著超越原ETCH:在已见数据(4D-Dress、CAPE)上MPJPE-All提升33.0%,V2V-Hands提升35.8%;在未见数据(BEDLAM2.0)上MPJPE-All提升80.8%,V2V-All提升80.5%。 Conclusion: ETCH-X实现了对人体在复杂服装、姿态及不完整输入下兼具鲁棒性与表达力的高质量拟合,其模块化、多源可扩展训练范式为通用人体建模提供了新思路。 Abstract: Human body fitting, which aligns parametric body models such as SMPL to raw 3D point clouds of clothed humans, serves as a crucial first step for downstream tasks like animation and texturing. An effective fitting method should be both locally expressive-capturing fine details such as hands and facial features-and globally robust to handle real-world challenges, including clothing dynamics, pose variations, and noisy or partial inputs. Existing approaches typically excel in only one aspect, lacking an all-in-one solution.We upgrade ETCH to ETCH-X, which leverages a tightness-aware fitting paradigm to filter out clothing dynamics ("undress"), extends expressiveness with SMPL-X, and replaces explicit sparse markers (which are highly sensitive to partial data) with implicit dense correspondences ("dense fit") for more robust and fine-grained body fitting. Our disentangled "undress" and "dense fit" modular stages enable separate and scalable training on composable data sources, including diverse simulated garments (CLOTH3D), large-scale full-body motions (AMASS), and fine-grained hand gestures (InterHand2.6M), improving outfit generalization and pose robustness of both bodies and hands. Our approach achieves robust and expressive fitting across diverse clothing, poses, and levels of input completeness, delivering a substantial performance improvement over ETCH on both: 1) seen data, such as 4D-Dress (MPJPE-All, 33.0% ) and CAPE (V2V-Hands, 35.8% ), and 2) unseen data, such as BEDLAM2.0 (MPJPE-All, 80.8% ; V2V-All, 80.5% ). Code and models will be released at https://xiaobenli00.github.io/ETCH-X/.