Table of Contents
cs.CL [Back]
[1] Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild
Berkin Durmus,Chen Cen,Eduardo Pacheco,Arda Okan,Atila Orhon
Main category: cs.CL
TL;DR: 本文提出Contextual Earnings-22数据集,旨在解决现有语音转文本系统在学术基准上性能停滞、但工业场景中仍有提升空间的问题,核心在于引入真实场景中的上下文定制词汇,并建立标准化评测基准。
Details
Motivation: 学术基准多使用常见通用词汇,识别难度低;而实际高风险场景中,罕见且依赖上下文的定制词汇对转录可用性影响更大,但缺乏标准化的上下文语音识别评测基准。 Method: 构建基于Earnings-22的开源数据集Contextual Earnings-22,包含真实定制词汇上下文;设置六种强基线方法,涵盖关键词提示(keyword prompting)和关键词增强(keyword boosting)两类主流上下文建模方法。 Result: 实验表明,两类方法在扩展至大规模系统后,均达到可比且显著提升的识别准确率。 Conclusion: 上下文语音识别具有实际提升潜力,Contextual Earnings-22为该方向提供了首个标准化、面向真实场景的评测基准,推动研究落地。 Abstract: The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems.[2] Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition
Youcef Soufiane Gheffari,Oussama Mustapha Benouddane,Samiya Silarbi
Main category: cs.CL
TL;DR: 本文提出了一种基于CNN-Transformer混合架构的阿拉伯语语音情感识别(SER)系统,在EYASE数据集上取得了97.8%准确率和0.98宏观F1分数。
Details
Motivation: 阿拉伯语语音情感识别研究稀缺,主要受限于标注数据集匮乏。 Method: 采用CNN-Transformer混合架构:CNN提取梅尔频谱图的判别性频谱特征,Transformer编码器建模语音中的长程时序依赖。 Result: 在EYASE(埃及阿拉伯语情感语音)语料库上达到97.8%准确率和0.98宏观F1分数。 Conclusion: CNN与注意力机制结合对阿拉伯语SER有效,Transformer方法在低资源语言中具有潜力。 Abstract: Recognizing emotions from speech using machine learning has become an active research area due to its importance in building human-centered applications. However, while many studies have been conducted in English, German, and other European and Asian languages, research in Arabic remains scarce because of the limited availability of annotated datasets. In this paper, we present an Arabic Speech Emotion Recognition (SER) system based on a hybrid CNN-Transformer architecture. The model leverages convolutional layers to extract discriminative spectral features from Mel-spectrogram inputs and Transformer encoders to capture long-range temporal dependencies in speech. Experiments were conducted on the EYASE (Egyptian Arabic speech emotion) corpus, and the proposed model achieved 97.8% accuracy and a macro F1-score of 0.98. These results demonstrate the effectiveness of combining convolutional feature extraction with attention-based modeling for Arabic SER and highlight the potential of Transformer-based approaches in low-resource languages.[3] Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Avyav Kumar Singh,Yen-Chen Wu,Alexandru Cioba,Alberto Bernacchia,Davide Buffelli
Main category: cs.CL
TL;DR: 本文提出了一种名为字节级蒸馏(BLD)的简单有效方法,通过在字节级别对齐教师和学生模型的输出分布来解决跨分词器知识蒸馏(CTD)问题,并在多个基准上表现出色。
Details
Motivation: 现有跨分词器蒸馏(CTD)方法依赖启发式策略对齐不匹配的词汇表,复杂度高且效果有限,亟需更简洁通用的解决方案。 Method: 提出字节级蒸馏(BLD):将教师模型输出分布转换为字节级概率,为学生模型附加轻量级字节级解码头,并通过该共享字节接口进行知识蒸馏。 Result: BLD在1B至8B参数规模的多种蒸馏任务中,性能与更复杂CTD方法相当甚至更优,验证了字节级作为跨分词器知识迁移自然共同界面的有效性。 Conclusion: 字节级是跨分词器知识蒸馏的可行且有效的统一接口,但CTD仍是一个尚未完全解决的问题,不同任务和基准上的性能一致性仍有待提升。 Abstract: Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.[4] Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá
Opeyemi Osakuade,Simon King
Main category: cs.CL
TL;DR: 本文发现离散语音单元(DSUs)在编码超音段信息(如声调)方面不如音段结构可靠,尽管自监督学习(SSL)的潜在表示本身能编码声调;作者提出一种分阶段残差聚类方法以提升声调编码能力。
Details
Motivation: DSUs虽广泛用于语音任务(如TTS、多模态对话),但在编码超音段特征(如声调、语调)时表现不佳,而现有量化方法未对此进行针对性优化。 Method: 在声调语言(普通话和约鲁巴语)上分析SSL模型隐表示及多种量化方法(包括K-means)生成的DSUs对声调的保留能力,并提出两阶段K-means残差量化策略。 Result: SSL隐表示本身可编码声调,但常规DSU量化严重偏向音段结构,导致声调信息丢失;两阶段残差量化显著提升声调编码可靠性。 Conclusion: 当前DSU量化策略不适用于超音段建模,需发展声调/语调感知的新表征学习方法;两阶段残差量化是一种有前景的方向。 Abstract: Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yorùbá show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.[5] Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma
Xuechen Zhang,Aviv Slobodkin,Joydeep Paul,Mandar Sharma,Samet Oymak,Shravya Shetty,Gautam Prasad
Main category: cs.CL
TL;DR: 本文提出DFR-Gemma框架,使大语言模型能直接对地理空间嵌入进行推理,无需中间文本转换,提升效率与准确性。
Details
Motivation: 现有地理空间基础模型嵌入与大语言模型(LLM)集成方式存在冗余、token低效和数值失真问题。 Method: 提出Direct Feature Reasoning-Gemma(DFR-Gemma),通过轻量级投影器将高维地理空间嵌入对齐至LLM隐空间,作为语义token与自然语言指令共同输入。 Result: 在多任务地理空间基准测试中,DFR-Gemma实现零样本准确推理,显著优于基于文本的基线方法,且更高效。 Conclusion: 将嵌入视为首要数据输入,是实现更直接、高效、可扩展的多模态地理空间智能的新范式。 Abstract: Representation learning for geospatial and spatio-temporal data plays a critical role in enabling general-purpose geospatial intelligence. Recent geospatial foundation models, such as the Population Dynamics Foundation Model (PDFM), encode complex population and mobility dynamics into compact embeddings. However, their integration with Large Language Models (LLMs) remains limited. Existing approaches to LLM integration treat these embeddings as retrieval indices or convert them into textual descriptions for reasoning, introducing redundancy, token inefficiency, and numerical inaccuracies. We propose Direct Feature Reasoning-Gemma (DFR-Gemma), a novel framework that enables LLMs to reason directly over dense geospatial embeddings. DFR aligns high-dimensional embeddings with the latent space of an LLM via a lightweight projector, allowing embeddings to be injected as semantic tokens alongside natural language instructions. This design eliminates the need for intermediate textual representations and enables intrinsic reasoning over spatial features. To evaluate this paradigm, we introduce a multi-task geospatial benchmark that pairs embeddings with diverse question-answer tasks, including feature querying, comparison, and semantic description. Experimental results show that DFR allows LLMs to decode latent spatial patterns and perform accurate zero-shot reasoning across tasks, while significantly improving efficiency compared to text-based baselines. Our results demonstrate that treating embeddings as primary data inputs, provides a more direct, efficient, and scalable approach to multimodal geospatial intelligence.[6] Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
Mengdan Zhu,Senhao Cheng,Liang Zhao
Main category: cs.CL
TL;DR: 本文提出了一种名为'分解、观察与推理'(DLR)的强化潜在推理框架,以解决视觉语言模型在复杂视觉推理中因文本思维链导致的视觉信息丢失问题。DLR通过动态分解查询、提取前提条件下的连续视觉潜在表示,并基于具身化推理得出答案,同时引入球面高斯潜在策略以增强潜在空间探索能力。实验表明,DLR在多个视觉为中心的基准上优于现有强基线方法,并具备更优的逐步可解释性。
Details
Motivation: 现有视觉语言模型在复杂视觉推理任务中存在视觉信息丢失问题,尤其在文本思维链(CoT)中;已有方法或增加工具调用开销,或依赖局部图像块嵌入,难以支持多步语义推理。 Method: 提出'分解、观察与推理'(DLR)框架:1)动态将查询分解为文本前提;2)提取前提条件下的连续视觉潜在表示;3)基于具身化推理得出答案;并设计三阶段训练流程与球面高斯潜在策略以提升潜在空间探索效率。 Result: 在多个视觉中心基准上,DLR持续超越文本-only、交错式多模态CoT及其它潜在推理方法等强基线,同时提供更优的逐步可解释性。 Conclusion: DLR通过联合建模文本前提引导的视觉潜在表示与强化推理过程,有效缓解了视觉信息损失问题,为复杂视觉推理提供了高效且可解释的新范式。 Abstract: Vision-Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch-based embeddings that are insufficient to extract semantics in multi-step reasoning. We propose \emph{"Decompose, Look, and Reason" (DLR)}, a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise-conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three-stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision-centric benchmarks show that DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.[7] EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents
Xueren Ge,Sahil Murtaza,Anthony Cortez,Homa Alemzadeh
Main category: cs.CL
TL;DR: 本文提出了一种基于ePCR和主题流的多智能体生成流程,构建了EMSDialog数据集,并验证其在提升EMS对话诊断预测性能上的有效性。
Details
Motivation: 现有医疗对话语料库大多为双人对话,缺乏支持多角色工作流和细粒度标注的资源,难以满足流式临床对话中动态证据追踪与适时诊断决策的需求。 Method: 设计了一个基于ePCR、以主题流为导向的多智能体生成流程,通过迭代式规划、生成与自修正,并结合基于规则的事实性与主题连贯性检验,生成高质量多说话人EMS对话。 Result: 构建了包含4,414段合成EMS对话的EMSDialog数据集,涵盖43种诊断、说话人角色及轮次级主题标注;经人工与大模型评估,证实其高质量与高真实性;增强训练显著提升了诊断预测的准确性、及时性与稳定性。 Conclusion: EMSDialog填补了多角色临床对话数据空白,所提生成流程具备可扩展性,为流式诊断建模提供了可靠数据基础与新范式。 Abstract: Conversational diagnosis prediction requires models to track evolving evidence in streaming clinical conversations and decide when to commit to a diagnosis. Existing medical dialogue corpora are largely dyadic or lack the multi-party workflow and annotations needed for this setting. We introduce an ePCR-grounded, topic-flow-based multi-agent generation pipeline that iteratively plans, generates, and self-refines dialogues with rule-based factual and topic flow checks. The pipeline yields EMSDialog, a dataset of 4,414 synthetic multi-speaker EMS conversations based on a real-world ePCR dataset, annotated with 43 diagnoses, speaker roles, and turn-level topics. Human and LLM evaluations confirm high quality and realism of EMSDialog using both utterance- and conversation-level metrics. Results show that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction.[8] TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization
Figen Eğin,Aytuğ Onan
Main category: cs.CL
TL;DR: 本文提出AutoMUP方法,基于多个人工摘要自动生成土耳其语教育视频的黄金标准摘要,通过嵌入聚类与共识建模生成分级摘要,并在TR-EduVSum数据集上验证其与强LLM摘要高度语义重合。
Details
Motivation: 缺乏高质量、可复现的土耳其语教育视频自动摘要黄金标准;现有金字塔评估法需人工参与,难以扩展。 Method: 构建TR-EduVSum数据集(82个土耳其语算法课程视频+3281份人工摘要);提出AutoMUP:提取意义单元→嵌入聚类→统计建模参与者间一致性→按共识权重生成分级摘要;黄金摘要取最高共识配置。 Result: AutoMUP摘要与Flash 2.5、GPT-5.1等强LLM摘要语义重合度高;消融实验证明共识权重和聚类对摘要质量起决定性作用;方法可低成本推广至其他突厥语族语言。 Conclusion: AutoMUP实现了土耳其语教育视频摘要黄金标准的全自动、可复现构建,为低资源语言摘要评估提供了新范式。 Abstract: This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of "Data Structures and Algorithms" and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM (Large Language Model) summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.[9] Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
Tunazzina Islam
Main category: cs.CL
TL;DR: 本文提出一种基于大语言模型(LLM)推理的无监督聚类结果优化框架,通过一致性验证、冗余裁决和标签接地三阶段,提升聚类的连贯性、可解释性与人类对齐度,无需标注数据。
Details
Motivation: 无监督语义聚类方法常产生不连贯、冗余或缺乏依据的簇,难以在无标注数据下有效验证。 Method: 设计三阶段LLM推理框架:(i)一致性验证(检查簇摘要是否被成员文本支持);(ii)冗余裁决(基于语义重叠合并或剔除候选簇);(iii)标签接地(完全无监督地生成可解释簇标签),解耦表征学习与结构验证。 Result: 在两个不同社交平台的真实数据上显著优于经典主题模型及最新表征基线,提升簇连贯性与人工评估一致的标签质量;人类评估显示LLM生成标签具有高可信度;跨平台鲁棒性分析证实稳定性。 Conclusion: LLM可作为通用语义结构验证与优化机制,为大规模文本的无监督、可靠且可解释分析提供新范式。 Abstract: Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.[10] CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data
Mohamed Ehab,Ali Hamdi,Khaled Shaban
Main category: cs.CL
TL;DR: 本文提出了一种名为CAMO(Class-Aware Minority-Optimized)的新型集成方法,专为类别不平衡分类任务设计,通过分层机制动态增强少数类预测,在多个不平衡基准数据集上显著提升宏观F1分数。
Details
Motivation: 现实世界中的分类任务常受类别不平衡严重制约,传统集成方法偏向多数类,损害少数类性能和整体F1分数。 Method: 提出CAMO方法,采用分层策略,融合投票分布、置信度校准与模型间不确定性,动态提升少数类权重并强化其预测。 Result: 在DIAR-AI/Emotion和BEA 2025两个高度不平衡领域数据集上,CAMO在零样本与微调设置下,结合8种语言模型(3个大模型+5个小模型),持续取得最高严格宏观F1分数,超越7种基线集成算法。 Conclusion: CAMO是一种可靠、领域无关的不平衡分类集成框架,其效果与模型适配协同增强,表明最优集成策略依赖于模型自身特性。 Abstract: Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts. We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.[11] ADAG: Automatically Describing Attribution Graphs
Aryaman Arora,Zhengxuan Wu,Jacob Steinhardt,Sarah Schwettmann
Main category: cs.CL
TL;DR: 本文提出ADAG,一种端到端自动化的电路追踪分析流程,通过归因特征剖面、新型聚类算法与大语言模型解释-模拟框架,实现对语言模型内部特征功能角色的自动识别与自然语言解释,并在已知任务和Llama 3.1有害建议越狱案例中验证其有效性。
Details
Motivation: 现有电路追踪研究依赖人工解读特征作用,缺乏自动化、可扩展的解释方法,限制了可解释性研究的效率与可靠性。 Method: 提出归因剖面(量化特征的输入/输出梯度效应)、新型特征聚类算法,以及LLM解释-模拟框架,用于生成并评分特征组的自然语言功能解释。 Result: 在已知人工分析的电路追踪任务中成功复现可解释电路;并发现Llama 3.1 8B Instruct中导致有害建议越狱的可操控特征簇。 Conclusion: ADAG实现了语言模型电路追踪的全自动化与可解释化,显著提升可解释性研究的客观性与实用性,为安全分析与模型调控提供新工具。 Abstract: In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.[12] DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification
Ziyi Wang,Siva Rajesh Kasa,Ankith M S,Santhosh Kumar Kasa,Jiaru Zou,Sumit Negi,Ruqi Zhang,Nan Jiang,Qifan Song
Main category: cs.CL
TL;DR: 本文提出DIVERSED框架,通过动态松弛验证步骤来提升推测解码的效率,在保持生成质量的同时提高接受率和推理速度。
Details
Motivation: 标准推测解码中严格的验证步骤限制了接受率,导致加速效果受限。 Method: 提出基于集成的动态验证器,根据任务和上下文自适应地融合草稿模型与目标模型的概率分布。 Result: 相比标准推测解码,DIVERSED显著提升了推理效率,同时维持生成质量。 Conclusion: 松弛验证策略在不牺牲质量的前提下可有效提升大语言模型推理速度,为高效推理提供了新思路。 Abstract: Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.[13] Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction
Mingchen Li,Jiatan Huang,Zonghai Yao,Hong yu
Main category: cs.CL
TL;DR: K2K is a novel framework that replaces external retrieval with internal, key-based knowledge access to improve LLM reliability in healthcare settings.
Details
Motivation: Large language models (LLMs) have potential in healthcare but suffer from hallucinations and lack of granular medical context. Standard Retrieval Augmented Generation (RAG) pipelines are computationally intensive and cause high latency, making them impractical for time-sensitive care. Method: The paper introduces Keys to Knowledge (K2K), a framework that encodes essential clinical information directly into the model's parameter space for rapid internal key-value memory retrieval. It enhances retrieval quality through activation-guided probe construction and cross-attention reranking. Result: K2K achieves state-of-the-art performance across four benchmark healthcare outcome prediction datasets. Conclusion: K2K effectively improves LLM reliability in healthcare by enabling fast, accurate internal knowledge access without inference-time overhead. Abstract: Large language models (LLMs) hold significant promise for healthcare, yet their reliability in high-stakes clinical settings is often compromised by hallucinations and a lack of granular medical context. While Retrieval Augmented Generation (RAG) can mitigate these issues, standard supervised pipelines require computationally intensive searches over massive external knowledge bases, leading to high latency that is impractical for time-sensitive care. To address this, we introduce Keys to Knowledge (K2K), a novel framework that replaces external retrieval with internal, key-based knowledge access. By encoding essential clinical information directly into the model's parameter space, K2K enables rapid retrieval from internal key-value memory without inference-time overhead. We further enhance retrieval quality through activation-guided probe construction and cross-attention reranking. Experimental results demonstrate that K2K achieves state-of-the-art performance across four benchmark healthcare outcome prediction datasets.[14] Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models
Ziyi Chen,Yasir Khan,Mengyuan Zhang,Cheng Peng,Mengxian Lyu,Yiyang Liu,Krishna Vaddiparti,Robert L Cook,Mattia Prosperi,Yonghui Wu
Main category: cs.CL
TL;DR: 本研究开发了一个基于大语言模型(LLM)的工具,用于从临床笔记中自动识别HIV相关污名,通过人工标注1332条句子并比较多种模型性能,发现GatorTron-large表现最优(Micro F1=0.62),few-shot提示显著提升生成式模型效果。
Details
Motivation: HIV相关污名是影响感染者心理健康、治疗依从性和结局的关键心理社会因素,但目前缺乏可直接用于临床笔记中污名内容提取与分类的现成NLP工具。 Method: 基于佛罗里达大学2012–2022年PLWH临床笔记,利用专家定义关键词和临床词嵌入扩展候选句子;人工标注1332句至四个污名子维度;对比GatorTron-large、BERT等编码器模型与GPT-OSS-20B、LLaMA-8B、MedGemma-27B等生成式LLM在zero-shot和few-shot设置下的性能。 Result: GatorTron-large整体最佳(Micro F1=0.62);few-shot下GPT-OSS-20B和LLaMA-8B分别达0.57和0.59;Negative Self-Image最易预测,Personalized Stigma最难;zero-shot生成推理失败率高达32%。 Conclusion: 本研究首次构建了实用的NLP工具以识别临床笔记中的HIV污名,验证了领域适配编码器模型的有效性,并揭示了few-shot提示对生成式模型的重要性。 Abstract: Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.[15] SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs
Jie Sun,Yu Liu,Lu Han,Qiwen Deng,Xiang Shu,Yang Xiao,Xingyu Lu,Jun Zhou,Pengfei Liu,Lintao Ma,Jiancan Wu,Xiang Wang
Main category: cs.CL
TL;DR: SepSeq is a training-free, plug-and-play framework that inserts separator tokens to mitigate attention dispersion in LLMs when processing long numerical sequences, improving accuracy and reducing token consumption.
Details
Motivation: Transformer-based LLMs suffer from severe performance degradation on long numerical sequences due to attention dispersion caused by the Softmax mechanism. Method: Proposes Separate Sequence (SepSeq), a training-free, plug-and-play framework that strategically inserts separator tokens to act as attention sinks, recalibrating attention to focus on local segments while preserving global context. Result: Extensive evaluations on 9 LLMs show SepSeq achieves an average relative accuracy improvement of 35.6% across domains and reduces total inference token consumption by 16.4% on average. Conclusion: Separator tokens effectively mitigate attention dispersion in LLMs for long numerical sequences, offering a simple, efficient, and broadly applicable solution without requiring retraining. Abstract: While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing total inference token consumption by 16.4% on average.[16] Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models
Steven Au,Sujit Noronha
Main category: cs.CL
TL;DR: 本文提出PPT-Bench基准,用于检测大语言模型在哲学性压力(如知识合法性、价值否定等)下的认知脆弱性,并揭示现有社会压力评测未能覆盖的新型认知失败模式。
Details
Motivation: 现有对大语言模型'讨好性'(sycophancy)的研究集中于意见分歧、奉承和偏好对齐,忽视了更广泛的认知层面失败;作者旨在系统评估模型面对挑战知识、价值或身份合法性的提示时的稳定性与一致性。 Method: 构建PPT-Bench诊断基准,基于哲学压力分类法(PPT),涵盖四类压力:认识论失稳、价值消解、权威倒置、身份溶解;每项测试包含三层:基线(L0)、单轮压力(L1)、多轮苏格拉底式升级(L2);在5个模型上评估不一致性与对话屈服现象,并对比多种缓解策略效果。 Result: 四类哲学压力引发统计上可区分的不一致模式,表明'认识论攻击'暴露了标准社会压力基准未覆盖的模型弱点;缓解效果高度依赖压力类型与模型架构:API模型中提示锚定与人格稳定性提示最有效,开源模型中Leading Query Contrastive Decoding最稳健。 Conclusion: PPT-Bench为理解LLM在深层认知维度上的脆弱性提供了新范式,强调需超越传统偏好对齐视角,关注模型在知识正当性等基础问题上的鲁棒性。 Abstract: Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.[17] An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations
Clarissa Miranda-Pena,Andrew Reeson,Cécile Paris,Josiah Poon,Jonathan K. Kummerfeld
Main category: cs.CL
TL;DR: 本文研究了静态分析工具在检测和缓解大型语言模型(LLM)生成代码时的幻觉问题(尤其是库相关幻觉)方面的潜力与局限性,发现其可检测16%-70%的错误和14%-85%的库幻觉,但存在固有上限(48.5%-77%),表明其虽廉价有效,却无法完全解决该问题。
Details
Motivation: 大型语言模型在生成需调用库的代码时频繁产生幻觉(如使用不存在的库功能),亟需有效、低成本的检测与缓解方法。 Method: 系统评估多种静态分析工具在多个NL-to-code基准数据集上对LLM生成代码中幻觉(特别是库幻觉)的检测能力,并通过人工分析确定其理论检测上限。 Result: 静态分析可检测16%-70%的全部错误及14%-85%的库幻觉;人工分析得出其理论性能上限为48.5%-77%。 Conclusion: 静态分析是一种廉价且部分有效的幻觉缓解手段,但受其本质限制,永远无法彻底解决LLM代码生成中的幻觉问题。 Abstract: Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses.One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.[18] Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao
Main category: cs.CL
TL;DR: 本文研究了分组查询注意力(GQA)Transformer中任务敏感层与位置编码(RoPE)影响层是否重合的问题,发现二者呈现强反向定位关系(late layers敏感于任务,early layers敏感于RoPE),但将两种适配方法(LSLoRA和GARFA)共同应用于任务敏感层仍能显著提升性能。
Details
Motivation: 探究GQA模型中任务正确性最敏感的网络层是否与位置编码(RoPE)适应最具影响力的层相重合(即‘共定位假设’),以指导更高效的参数高效微调策略。 Method: 提出LSLoRA(基于正确性差异隐状态度量识别任务敏感层并限制LoRA适配范围)和GARFA(为每个目标层的每个KV头引入8个可学习RoPE频率缩放因子),并在Llama 3.1 8B(32层、4:1 Q:KV头比)上进行层敏感性分析与交叉消融实验。 Result: 发现任务敏感层集中于后段(23–31层),RoPE影响层集中于前段(0–9层),Spearman相关系数rs = −0.735(p < 0.001),证实强反定位;但将LSLoRA与GARFA联合施加于任务敏感层,在6个基准上全面优于其他配置(+4–16个百分点),HumanEval+达67.1%,接近Claude 3.5 Haiku(68.3%),总计算成本仅100美元。 Conclusion: 任务敏感性与RoPE影响力在GQA模型中空间分离,但面向任务敏感层协同优化注意力机制与位置编码仍是最优适配策略,挑战了共定位直觉,支持‘功能解耦、协同优化’的设计范式。 Abstract: We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.[19] TEMPER: Testing Emotional Perturbation in Quantitative Reasoning
Atahan Dokme,Benjamin Reichman,Larry Heck
Main category: cs.CL
TL;DR: 本文研究情感表达是否会影响大语言模型在定量推理任务中的表现,发现情感语境会降低准确率2-10个百分点,但通过中性化处理可恢复大部分性能,表明问题源于情感风格而非内容失真。
Details
Motivation: 现实世界中的查询常带有情绪(如沮丧、紧迫感或兴奋),而现有大语言模型主要在中性、规范的语言上训练和评估,因此需探究纯情感框架(不改变数值内容)是否本身就会损害模型推理能力。 Method: 构建了一个受控的情感翻译框架,将数学推理题重写为多种情感变体(保持所有数量与逻辑关系不变),并据此创建了Temper-5400基准(含5400组语义验证的情感–中性配对),在18个不同规模模型上进行评测;同时对比非情感改写与情感中性化干预的效果。 Result: 1)情感 framing 导致准确率下降2–10个百分点;2)对情感变体进行中性化后,性能基本恢复;3)非情感的表面改写不引起性能下降,证实是情感内容而非形式变化导致退化。 Conclusion: 情感表达本身会干扰大语言模型的定量推理能力,该效应可逆且具鲁棒性,提示未来模型需增强对情感风格的鲁棒性,也提供了通用的受控风格迁移与稳健性评估框架。 Abstract: Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion--neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.[20] GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning
Kaiyuan Tian,Yu Tang,Gongqingjian Jiang,Baihui Liu,Yifu Gao,Xialin Su,Linbo Qiao,Dongsheng Li
Main category: cs.CL
TL;DR: 本文提出GRASS框架,通过基于梯度的自适应层重要性采样和层优化器状态卸载机制,在降低内存消耗的同时提升微调性能。
Details
Motivation: 现有低秩适配方法限制模型表达能力,层微调方法忽略任务和训练阶段对层重要性的影响,导致下游任务性能不佳。 Method: GRASS使用平均梯度范数作为任务和训练阶段感知的层重要性度量,并通过自适应训练策略动态调整层采样概率;同时引入层优化器状态卸载机制以重叠计算与通信。 Result: 在多个模型和基准测试中,GRASS平均准确率提升达4.38点,内存使用减少最多19.97%。 Conclusion: GRASS在保持训练吞吐量的同时显著提升性能并降低内存开销,优于当前最优方法。 Abstract: Full-parameter fine-tuning of large language models is constrained by substantial GPU memory requirements. Low-rank adaptation methods mitigate this challenge by updating only a subset of parameters. However, these approaches often limit model expressiveness and yield lower performance than full-parameter fine-tuning. Layer-wise fine-tuning methods have emerged as an alternative, enabling memory-efficient training through static layer importance sampling strategies. However, these methods overlook variations in layer importance across tasks and training stages, resulting in suboptimal performance on downstream tasks. To address these limitations, we propose GRASS, a gradient-based adaptive layer-wise importance sampling framework. GRASS utilizes mean gradient norms as a task-aware and training-stage-aware metric for estimating layer importance. Furthermore, GRASS adaptively adjusts layer sampling probabilities through an adaptive training strategy. We also introduce a layer-wise optimizer state offloading mechanism that overlaps computation and communication to further reduce memory usage while maintaining comparable training throughput. Extensive experiments across multiple models and benchmarks demonstrate that GRASS consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of up to 4.38 points and reducing memory usage by up to 19.97\%.[21] AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
Yuxuan Hu,Jianchao Tan,Jiaqi Zhang,Wen Zan,Pingwei Sun,Yifan Lu,Yerui Sun,Yuchen Xie,Xunliang Cai,Jing Zhang
Main category: cs.CL
TL;DR: 本文提出AsyncTLS,一种分层稀疏注意力系统,通过结合粗粒度块过滤与细粒度令牌选择,并利用异步卸载引擎重叠KV缓存传输与计算,显著提升长上下文推理的效率与精度。
Details
Motivation: 长上下文推理面临注意力机制二次复杂度和KV缓存内存开销大的双重挑战;现有稀疏注意力方法在精度与效率间难以兼顾。 Method: 提出AsyncTLS:1)分层稀疏注意力——先块级粗筛再令牌级细选;2)异步卸载引擎——利用时间局部性重叠KV缓存传输与计算。 Result: 在Qwen3和GLM-4.7-Flash模型(GQA/MLA架构)上,于48k–96k长上下文下,精度接近全注意力,算子速度提升1.2x–10.0x,端到端吞吐提升1.3x–4.7x。 Conclusion: AsyncTLS在保持高精度的同时显著提升了长上下文推理的效率,为大模型实际部署提供了可行方案。 Abstract: Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.[22] Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model
Kunfeng Chen,Luyao Zhuang,Fei Liao,Juhua Liu,Jian Wang,Bo Du
Main category: cs.CL
TL;DR: 本文提出了一种名为Tool Retrieval Bridge (TRB) 的新方法,用于提升大语言模型在面对模糊用户指令时的工具检索性能;通过构建新基准VGToolBench验证问题,并利用桥接模型重写模糊指令以适配检索器偏好,显著提升了多个基线检索器的效果。
Details
Motivation: 现有工具检索方法依赖于包含详细API信息的学术基准,而真实场景中用户指令往往模糊,导致性能下降,亟需适配模糊指令的检索方法。 Method: 构建新基准VGToolBench模拟模糊指令;提出TRB方法,引入桥接模型将模糊指令重写为更具体的指令,从而缩小指令与检索器偏好之间的差距。 Result: TRB在多种检索设置下均带来一致且显著的性能提升,例如BM25的NDCG平均分从9.73提升至19.59(相对提升111.51%)。 Conclusion: TRB是一种简单而有效的方法,能有效缓解模糊指令带来的歧义问题,显著增强工具检索在现实场景中的实用性。 Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.[23] Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
Harsh Kohli,Srinivasan Parthasarathy,Huan Sun,Yuekun Yao
Main category: cs.CL
TL;DR: 本文研究了隐式推理能力,提出循环深度Transformer模型以增强模型的组合泛化能力,特别是在系统性泛化和深度外推任务上取得显著进展,并揭示了'三阶段grokking'现象及'过度思考'问题。
Details
Motivation: Transformer大语言模型虽存储大量知识和规则,但在隐式多跳推理中缺乏对参数化知识的组合泛化能力,难以进行单次前向传递中的知识组合。 Method: 提出并研究循环深度Transformer(recurrent-depth transformers),即在相同Transformer层上进行迭代计算;通过从头训练模型,在系统性泛化和深度外推两个可控任务上开展实验,并结合机制分析与训练策略研究。 Result: 循环深度Transformer显著提升系统性泛化与深度外推能力;系统性泛化经由三阶段grokking过程涌现;深度外推可通过扩大推理时循环次数实现,但存在‘过度思考’限制。 Conclusion: 循环深度架构是提升隐式推理中组合泛化能力的有效途径,其能力涌现具有阶段性规律,且推理时计算深度需权衡,为构建更可靠推理模型提供了新思路与实践指导。 Abstract: We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.[24] Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers
Michelle Damin Kim,Ellie S. Paek,Yufen Lin,Emily Mroz,Jane Chung,Jinho D. Choi
Main category: cs.CL
TL;DR: This paper introduces an LLM-driven pipeline using GPT-4o and GPT-5 variants to build and analyze a Reddit-based dataset for measuring loneliness in caregivers vs. non-caregivers, supported by expert-developed evaluation and cause-typology frameworks, revealing distinct loneliness patterns between groups.
Details
Motivation: To measure and compare loneliness experiences between caregiver and non-caregiver populations using scalable, high-quality social media data, addressing the lack of diverse, population-specific loneliness datasets. Method: Developed an expert-informed loneliness evaluation framework and cause typology; applied GPT-4o, GPT-5-nano, and GPT-5 within a human-validated pipeline to construct and annotate a Reddit corpus; performed demographic extraction and comparative analysis. Result: Loneliness evaluation achieved 76.09% (caregivers) and 79.78% (non-caregivers) accuracy; cause categorization achieved micro-F1 of 0.825 (caregivers) and 0.80 (non-caregivers); caregivers showed distinct causes—caregiving roles, identity recognition, abandonment—versus non-caregivers. Conclusion: The LLM-based pipeline enables robust, scalable construction and analysis of social media datasets for loneliness research, revealing meaningful population-level differences and supporting future targeted interventions. Abstract: This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers' loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.[25] MemReader: From Passive to Active Extraction for Long-Term Agent Memory
Jingyi Kang,Chunyu Li,Ding Chen,Bo Tang,Feiyu Xiong,Zhiyu Li
Main category: cs.CL
TL;DR: 本文提出MemReader系列模型,通过主动式长期记忆提取方法解决现有系统中记忆污染、低价值写入和不一致等问题,其中MemReader-0.6B为轻量级被动提取器,MemReader-4B采用GRPO优化,支持基于推理的选择性记忆写入,在多个评测基准上达到SOTA,并已集成至MemOS并投入实际应用。
Details
Motivation: 现有长期记忆提取方法为被动、单次转录,难以应对对话噪声、指代缺失和跨轮依赖,导致记忆污染、低价值写入与不一致。 Method: 提出MemReader家族:MemReader-0.6B为蒸馏得到的紧凑型被动提取器,保证准确性和模式一致性;MemReader-4B为基于Group Relative Policy Optimization(GRPO)优化的主动提取器,在ReAct范式下显式评估信息价值、指代模糊性和完整性,并可选择写入、延迟、检索或丢弃。 Result: 在LOCOMO、LongMemEval和HaluMem三个基准上全面超越现有基于抽取的方法;MemReader-4B在知识更新、时序推理和幻觉抑制任务上达到SOTA。 Conclusion: 有效代理记忆的关键不在于提取更多信息,而在于推理驱动、选择性的记忆提取,以构建低噪声、动态演化的长期记忆;MemReader已集成进MemOS并部署于真实场景,模型与API已开源。 Abstract: Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.[26] Contextualising (Im)plausible Events Triggers Figurative Language
Annerose Eichel,Tonmoy Rakshit,Sabine Schulte im Walde
Main category: cs.CL
TL;DR: 本文探讨了英语主谓宾事件中(非)字面意义与合理性之间的关系,通过设计系统化的合理/不合理事件三元组及抽象/具体成分类别,对比人类与大语言模型(LLM)在判断和语境化方面的差异。
Details
Motivation: 探究(非)字面性与事件合理性之间的关系,理解人类与LLM在语义判断上的认知差异。 Method: 构建包含合理/不合理、抽象/具体成分的事件三元组,收集并分析人类与LLM对事件的合理性判断及上下文解释。 Result: 人类能精细区分非字面事件与不合理事件,并有效结合语境;而LLM表现出浅层语境化能力,倾向于将不合理事件曲解为看似合理但非字面的解释。 Conclusion: LLM在处理(非)字面性与合理性交织的语义现象时存在系统性偏差,其判断缺乏人类所具有的深层语义与语境整合能力。 Abstract: This work explores the connection between (non-)literalness and plausibility at the example of subject-verb-object events in English. We design a systematic setup of plausible and implausible event triples in combination with abstract and concrete constituent categories. Our analysis of human and LLM-generated judgments and example contexts reveals substantial differences between assessments of plausibility. While humans excel at nuanced detection and contextualization of (non-)literal vs. implausible events, LLM results reveal only shallow contextualization patterns with a bias to trade implausibility for non-literal, plausible interpretations.[27] Linear Representations of Hierarchical Concepts in Language Models
Masaki Sakata,Benjamin Heinzerling,Takumi Ito,Sho Yokoi,Kentaro Inui
Main category: cs.CL
TL;DR: 本文研究了语言模型如何在内部表征中编码层级关系(如日本 ⊂ 东亚 ⊂ 亚洲),提出基于线性关系概念的方法,训练特定于层级深度和语义域的线性变换,发现层级信息以高度可解释的线性形式编码于低维、领域特定的子空间中,且这些子空间间具有高度相似性。
Details
Motivation: 探究语言模型内部表征中层级关系(如地理、类别等)的编码方式与程度,弥补以往工作在多词实体和跨层表征分析上的不足。 Method: 基于Linear Relational Concepts,为每个层级深度和语义域训练专用线性变换,通过比较变换来刻画层级相关的表征差异;覆盖多词实体与跨层表示;评估域内泛化与跨域迁移能力。 Result: 层级关系可在单一语义域内从模型表征中线性恢复;层级信息集中在相对低维且领域特定的子空间中;不同领域的此类子空间在结构上高度相似。 Conclusion: 所测试的所有语言模型均以高度可解释的线性形式编码概念层级,表明层级知识是模型表征中一种稳健、结构性强的特征。 Abstract: We investigate how and to what extent hierarchical relations (e.g., Japan $\subset$ Eastern Asia $\subset$ Asia) are encoded in the internal representations of language models. Building on Linear Relational Concepts, we train linear transformations specific to each hierarchical depth and semantic domain, and characterize representational differences associated with hierarchical relations by comparing these transformations. Going beyond prior work on the representational geometry of hierarchies in LMs, our analysis covers multi-token entities and cross-layer representations. Across multiple domains we learn such transformations and evaluate in-domain generalization to unseen data and cross-domain transfer. Experiments show that, within a domain, hierarchical relations can be linearly recovered from model representations. We then analyze how hierarchical information is encoded in representation space. We find that it is encoded in a relatively low-dimensional subspace and that this subspace tends to be domain-specific. Our main result is that hierarchy representation is highly similar across these domain-specific subspaces. Overall, we find that all models considered in our experiments encode concept hierarchies in the form of highly interpretable linear representations.[28] Data Selection for Multi-turn Dialogue Instruction Tuning
Bo Li,Shikun Zhang,Wei Ye
Main category: cs.CL
TL;DR: 本文提出MDS(多轮对话选择)框架,通过全局覆盖和局部结构两个阶段对多轮对话进行整体评分与筛选,以提升指令微调语言模型的数据质量。
Details
Motivation: 现有指令微调所依赖的多轮对话数据集存在噪声大、结构不一致、话题漂移、重复闲聊及答案格式不匹配等问题,亟需从数据选择角度提升质量。 Method: MDS是一种对话级数据选择框架:第一阶段为全局覆盖,在用户查询轨迹空间中按bin进行代表性且非冗余的对话筛选;第二阶段为局部结构评估,包括实体支撑的话题一致性、信息进展度以及问答格式的功能对齐性。 Result: MDS在三个多轮基准测试集及一个领域内Banking测试集上均优于强单轮选择器、对话级LLM打分器及各类启发式基线,在无参考和有参考指标下均取得最佳综合排名,且在相同训练预算下对长对话更鲁棒。 Conclusion: 对话级联合建模(覆盖+结构)比逐轮处理更能有效提升多轮对话数据质量,为指令微调提供更可靠、更一致的训练语料。 Abstract: Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines on three multi-turn benchmarks and an in-domain Banking test set, achieving the best overall rank across reference-free and reference-based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.[29] TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation
Xinliang Frederick Zhang,Lu Wang
Main category: cs.CL
TL;DR: 本文提出TSUBASA方法,通过动态记忆演化与基于上下文蒸馏的自学习机制,提升个性化大语言模型在长周期任务中的记忆读写能力,在Qwen-3系列模型上显著优于现有记忆增强系统。
Details
Motivation: 现有个性化大语言模型在长周期任务(如长期对话或行为追踪)中表现不佳:传统记忆机制难以捕捉用户行为演化,RAG面临质量与效率权衡,参数化适配受限于标注数据稀缺导致的训推差距。 Method: 提出TSUBASA双路径框架:1)动态记忆演化以改进记忆写入;2)面向上下文蒸馏的自学习机制以提升记忆读取,使模型内化用户经验。 Result: 在Qwen-3(4B至32B)模型上经长周期基准测试验证,TSUBASA显著超越Mem0、Memory-R1等以记忆写入为主的竞争方法,并实现质量与效率的帕累托改进,降低token消耗同时保持高保真个性化。 Conclusion: TSUBASA有效突破了长周期个性化建模中的质量-效率瓶颈,为PLLMs提供了更鲁棒、高效且可扩展的记忆增强范式。 Abstract: Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences. However, they still struggle with long-horizon tasks, such as tracking a user's extensive history of conversations or activities. Existing memory mechanisms often fail to capture evolving behaviors, and RAG paradigms are trapped by a quality-efficiency tradeoff. Meanwhile, parametric adaptation is bottlenecked by train-inference gap due to the scarcity of labeled data. To enhance the long-horizon capabilities of PLLMs, we introduce TSUBASA, a two-pronged approach designed to improve memory writing via dynamic memory evolution, and memory reading via self-learning with a context distillation objective to internalize user experiences. Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and Memory-R1. Our analyses further confirms that TSUBASA breaks the quality-efficiency barrier to achieve Pareto improvements, delivering robust, high-fidelity personalization with a reduced token budget.[30] HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy
Guoqi Ma,Liang Zhang,Hongyao Tu,Hao Fu,Hui Li,Yujie Lin,Longyue Wang,Weihua Luo,Jinsong Su
Main category: cs.CL
TL;DR: 本文探索了大语言模型(LLM)在跨文档关系抽取(RE)任务中的应用,发现直接使用LLM效果有限;为此提出基于层次分类的HCRE模型,结合层次化关系树与预测-验证推理策略,显著提升性能。
Details
Motivation: 现有小语言模型(SLM)+分类器范式受限于语言理解能力;而初步实验发现LLM在跨文档RE中并未稳定超越SLM,其瓶颈在于大量预定义关系带来的分类难度。 Method: 提出HCRE模型:1)构建预定义关系集的层次化关系树;2)利用LLM进行逐级关系预测;3)采用预测-再验证策略,在每一层级通过多视角验证缓解误差传播。 Result: HCRE在多个数据集上显著优于现有基线方法,验证了层次分类与验证策略的有效性。 Conclusion: LLM在跨文档RE中需结合结构化先验(如层次关系树)与鲁棒推理机制(如预测-验证),而非简单替代SLM;HCRE为LLM在细粒度分类任务中的落地提供了新思路。 Abstract: Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.[31] Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
Shiwan Zhao,Zhihu Wang,Xuyang Zhao,Jiaming Zhou,Caiyue Xu,Chenfei Liu,Liting Zhang,Yuhang Jia,Yanzhe Zhang,Hualong Yu,Zichen Xu,Qicheng Li,Yong Qin
Main category: cs.CL
TL;DR: 本文提出了一种理解大语言模型(LLM)后训练的统一框架,将后训练视为对模型行为的结构化干预,按轨迹来源分为离策略与在策略学习,并从有效支持扩展、策略重塑和行为整合三方面解析各类方法。
Details
Motivation: 现有后训练方法(如SFT、偏好优化、RL等)常被按目标函数或标签碎片化讨论,缺乏对其所解决的行为瓶颈的系统性理解。 Method: 提出基于轨迹来源(off-policy / on-policy)和行为干预角色(支持扩展、策略重塑、行为整合)的二维分析框架,对主流后训练范式进行统一解读。 Result: 揭示了不同方法在行为干预中的本质角色(如SFT兼具支持扩展与策略重塑,偏好学习多为离策略重塑,蒸馏本质是行为整合),并指出后训练进步日益依赖多阶段协同系统设计。 Conclusion: LLM后训练应被视作结构化行为干预过程;该框架有助于诊断瓶颈、指导阶段组合,强调系统级协同设计比单一目标优化更为关键。 Abstract: Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.[32] Rethinking Data Mixing from the Perspective of Large Language Models
Yuanjian Xu,Tianze Sun,Changwei Xu,XinLong Zhao,Jianing Hao,Ran Chen,Yang Liu,Ruijie Xu,Stephen Chen,Guang Zhang
Main category: cs.CL
TL;DR: 本文提出DoGraph框架,通过建立梯度动力学与领域分布的理论联系,将数据调度建模为图约束优化问题,以解决LLM训练中领域定义、人机领域感知对齐及领域加权影响泛化等基础问题。
Details
Motivation: 数据混合策略对大语言模型训练至关重要,但当前缺乏对‘领域’定义、人类与模型对领域的感知是否一致、以及领域加权如何影响泛化等基本问题的理论理解。 Method: 建立梯度动力学与领域分布之间的形式化联系,构建理论框架;在此基础上提出DoGraph重加权框架,将数据调度建模为图约束优化问题。 Result: 在不同规模GPT-2模型上的大量实验表明,DoGraph持续取得具有竞争力的性能。 Conclusion: 领域感知的数据重加权可通过理论驱动的图优化方法有效提升LLM训练泛化能力,为人机领域认知对齐提供了可量化的分析路径。 Abstract: Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.[33] AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification
Hongyi Cen,Mingxin Wang,Yule Liu,Jingyi Zheng,Hanze Jia,Tan Tang,Yingcai Wu
Main category: cs.CL
TL;DR: 本文提出AtomEval框架,通过SROM原子分解和原子有效性评分(AVS)来评估对抗性声明重写的真实性一致性,克服了传统指标的局限性,并揭示了当前对抗评估实践中被忽视的问题。
Details
Motivation: 标准评估指标无法捕捉真值条件一致性,常将语义被破坏的重写误判为成功,因此需要更有效的评估框架。 Method: 提出AtomEval框架,将声明分解为SROM(主语-关系-宾语-修饰语)原子,并设计原子有效性评分(AVS)来检测事实性破坏。 Result: 在FEVER数据集上的实验表明,AtomEval比传统指标提供更可靠的评估信号;同时发现更强的LLM生成器并不一定产生更有效的对抗性声明。 Conclusion: AtomEval提升了对抗性事实核查评估的真实性敏感性,揭示了当前LLM对抗生成方法在有效性上的认知偏差,推动更严谨的评估实践。 Abstract: Adversarial claim rewriting is widely used to test fact-checking systems, but standard metrics fail to capture truth-conditional consistency and often label semantically corrupted rewrites as successful. We introduce AtomEval, a validity-aware evaluation framework that decomposes claims into subject-relation-object-modifier (SROM) atoms and scores adversarial rewrites with Atomic Validity Scoring (AVS), enabling detection of factual corruption beyond surface similarity. Experiments on the FEVER dataset across representative attack strategies and LLM generators show that AtomEval provides more reliable evaluation signals in our experiments. Using AtomEval, we further analyze LLM-based adversarial generators and observe that stronger models do not necessarily produce more effective adversarial claims under validity-aware evaluation, highlighting previously overlooked limitations in current adversarial evaluation practices.[34] Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
George Fountzoulas
Main category: cs.CL
TL;DR: Kathleen是一种直接在原始UTF-8字节上运行的文本分类架构,采用频域处理,无需分词器和注意力机制,仅含733K参数,并在多个基准数据集上超越更大规模的token化模型。
Details
Motivation: 避免传统NLP模型对分词、大参数量和高计算复杂度(如Transformer的O(L^2))的依赖,探索更轻量、高效、可扩展的字节级建模方法。 Method: 提出三个新组件:(1) RecurrentOscillatorBanks(带时序记忆的阻尼正弦卷积,实现O(L)序列处理);(2) FFT-Rotate Wavetable Encoder(用单个可学习向量映射256个字节值,替代嵌入表);(3) PhaseHarmonics(仅6参数的正弦非线性激活函数)。整体为纯频域、无注意力、无分词的轻量架构。 Result: Kathleen-Clean在IMDB(88.6%)、AG News(92.3%)、SST-2(83.3%)上达到SOTA级性能,显著优于参数量16倍的token化基线;PhaseHarmonics仅6参数即带来+2.6%准确率提升;推理为O(L)时间/内存复杂度,支持超长字节序列。 Conclusion: 频域建模可在极小参数量下超越复杂认知架构,验证了字节级、无注意力、低复杂度设计在文本分类中的有效性与潜力。 Abstract: We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing -- requiring no tokenizer, no attention mechanism, and only 733K parameters. Kathleen introduces three novel components: (1) RecurrentOscillatorBanks -- damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats), replacing conventional embedding tables (65K parameters) while improving accuracy; (3) PhaseHarmonics -- a sinusoidal non-linearity with just 6 learnable phase parameters that our ablation identifies as the single most impactful component (+2.6% accuracy, <0.001% of model parameters). Through comprehensive ablation of a 1.8M-parameter predecessor, we show that frequency-domain components systematically outperform complex cognitive architectures: removing a 560K-parameter bio-inspired framework costs only -0.2%, while removing the 6-parameter PhaseHarmonics costs -2.6%. The resulting Kathleen-Clean achieves 88.6% on IMDB, 92.3% on AG News, and 83.3% on SST-2 -- outperforming a tokenized counterpart with 16x more parameters on IMDB (+1.6%) and AG News (+2.1%). Kathleen processes sequences in O(L) time and memory, enabling byte-level operation at sequence lengths where O(L^2) Transformers exhaust GPU memory.[35] A Decomposition Perspective to Long-context Reasoning for LLMs
Yanling Xiao,Huaibing Xie,Guoliang Zhao,Shihan Dou,Shaolei Wang,Yiting Liu,Nantao Zheng,Cheng Zhang,Pluto Zhou,Zhisong Zhang,Lemao Liu
Main category: cs.CL
TL;DR: 本文提出一种将长上下文推理任务分解为基本原子技能的方法,并通过强化学习在合成的伪数据集上提升这些技能,从而增强大语言模型的长上下文推理能力。
Details
Motivation: 当前长上下文推理研究忽视了任务内部的复杂性,缺乏对基础能力的细粒度建模。 Method: 将长上下文推理分解为若干原子技能,自动合成针对性的伪数据集,并利用强化学习在这些数据集上训练模型以提升各原子技能。 Result: 在多个长文本基准(Loogle、Loong、LongBench-v2等)上平均提升7.7%,从46.3%提升至54.0%。 Conclusion: 提升原子技能可有效增强模型整体长上下文推理能力,验证了细粒度能力分解与定向优化的有效性。 Abstract: Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model's atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7\% (improving from 46.3\% to 54.0\%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.[36] Rag Performance Prediction for Question Answering
Or Dado,David Carmel. Oren Kurland
Main category: cs.CL
TL;DR: 本文研究了预测RAG(检索增强生成)在问答任务中相对于不使用RAG所带来的性能增益问题,比较了多种预检索、后检索及后生成预测器,并提出了一种能显式建模问题、检索段落与生成答案之间语义关系的新型监督预测器,其预测效果最佳。
Details
Motivation: 预测RAG在问答任务中是否带来性能提升,从而指导是否启用RAG以节省计算资源或提升效率。 Method: 评估若干源自ad hoc检索的预检索和后检索预测器,并设计并测试若干后生成预测器,其中一种为本研究提出的新型监督预测器,显式建模问题、检索文档与生成答案间的语义关系。 Result: 所提出的新型监督预测器在预测RAG增益方面表现最优,显著优于其他预检索和后检索预测器。 Conclusion: 显式建模问题、检索内容与生成答案三者间语义关系的监督预测方法,是预测RAG增益最有效的方式;后生成阶段的预测器潜力大于传统检索阶段的预测器。 Abstract: We address the task of predicting the gain of using RAG (retrieval augmented generation) for question answering with respect to not using it. We study the performance of a few pre-retrieval and post-retrieval predictors originally devised for ad hoc retrieval. We also study a few post-generation predictors, one of which is novel to this study and posts the best prediction quality. Our results show that the most effective prediction approach is a novel supervised predictor that explicitly models the semantic relationships among the question, retrieved passages, and the generated answer.[37] Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation
Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Yuxi Zhang,Huimin Wang,Yutian Zhao,Yefeng Zheng,Binyang Li,Kam-Fai Wong,Xian Wu
Main category: cs.CL
TL;DR: 本文提出GuarantRAG框架,通过显式解耦推理与证据整合,利用Inner-Answer(基于参数知识)和Refer-Answer(通过对比DPO目标强制依赖外部证据),再经联合解码融合二者优势,在多个QA基准上显著提升准确率并减少幻觉。
Details
Motivation: 现有RAG方法虽能检索相关文档,但大模型常因内部参数知识与外部证据冲突而无法有效利用检索结果,即存在“整合瓶颈”;隐式地在单次生成中解决该冲突效果不佳。 Method: 提出GuarantRAG框架:1)生成仅依赖参数知识的Inner-Answer以建模推理流;2)设计对比式DPO目标训练Refer-Answer,将Inner-Answer作为负样本、检索文档作为正样本,抑制幻觉;3)引入token级动态联合解码机制,融合Inner-Answer的逻辑连贯性与Refer-Answer的事实准确性。 Result: 在五个QA基准上,相比标准及动态RAG基线,准确率最高提升12.1%,幻觉率降低16.3%。 Conclusion: 显式解耦推理与证据整合,并通过对比学习与联合解码协同优化,可更有效地缓解RAG中的整合瓶颈,提升事实一致性与生成质量。 Abstract: Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs) by providing access to external knowledge. However, current research primarily focuses on retrieval quality, often overlooking the critical ''integration bottleneck'': even when relevant documents are retrieved, LLMs frequently fail to utilize them effectively due to conflicts with their internal parametric knowledge. In this paper, we argue that implicitly resolving this conflict in a single generation pass is suboptimal. We introduce GuarantRAG, a framework that explicitly decouples reasoning from evidence integration. First, we generate an ''Inner-Answer'' based solely on parametric knowledge to capture the model's reasoning flow. Second, to guarantee faithful evidence extraction, we generate a ''Refer-Answer'' using a novel Contrastive DPO objective. This objective treats the parametric Inner-Answer as a negative constraint and the retrieved documents as positive ground truth, forcing the model to suppress internal hallucinations in favor of external evidence during this phase. Finally, rather than naive concatenation or using the DPO trained model directly, we propose a joint decoding mechanism that dynamically fuses the logical coherence of the Inner-Answer with the factual precision of the Refer-Answer at the token level. Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.[38] Efficient Provably Secure Linguistic Steganography via Range Coding
Ruiyi Yan,Yugo Murawaki
Main category: cs.CL
TL;DR: 本文提出了一种基于范围编码和旋转机制的高效、可证明安全的语言隐写方法,在保持高隐蔽性的同时显著提升了嵌入容量和速度。
Details
Motivation: 实现语言隐写的可证明安全性,尤其是兼顾完美不可感知性(零KL散度)与高嵌入容量这一长期挑战。 Method: 直接采用经典熵编码方法(范围编码),并引入旋转机制,构建一种高效且可证明安全的语言隐写方案。 Result: 在多个语言模型上实验表明,该方法嵌入效率接近100%,嵌入速度最高达1554.66 bits/s(GPT-2),优于现有基线方法。 Conclusion: 所提方法在保证可证明安全性与完美不可感知性的前提下,显著提升了语言隐写的嵌入容量与效率,为实际应用提供了新路径。 Abstract: Linguistic steganography involves embedding secret messages within seemingly innocuous texts to enable covert communication. Provable security, which is a long-standing goal and key motivation, has been extended to language-model-based steganography. Previous provably secure approaches have achieved perfect imperceptibility, measured by zero Kullback-Leibler (KL) divergence, but at the expense of embedding capacity. In this paper, we attempt to directly use a classic entropy coding method (range coding) to achieve secure steganography, and then propose an efficient and provably secure linguistic steganographic method with a rotation mechanism. Experiments across various language models show that our method achieves around 100% entropy utilization (embedding efficiency) for embedding capacity, outperforming the existing baseline methods. Moreover, it achieves high embedding speeds (up to 1554.66 bits/s on GPT-2). The code is available at github.com/ryehr/RRC_steganography.[39] Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving
Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen
Main category: cs.CL
TL;DR: 本文提出了一种双池 token 预算路由机制,通过将同构 GPU 集群划分为短上下文高吞吐池和长上下文高容量池,并基于在线学习的字节-词元比动态估算请求 token 预算进行路由,显著提升 vLLM 服务效率。
Details
Motivation: 生产环境中 vLLM 服务常为最坏情况(长上下文)配置实例,导致 KV 缓存严重过分配、并发利用率低;80–95% 的短请求因此被低效服务,浪费 4–8 倍吞吐能力,并引发 OOM、抢占和请求拒绝等可靠性问题;根本原因是配置与实际流量不匹配。 Method: 提出 dual-pool token-budget routing:将集群划分为两个专用池;按请求预估总 token 预算(基于每类请求在线学习的字节-词元比,用 prompt_tokens 反馈通过指数滑动平均更新,无需 tokenizer)进行路由;辅以可解析的分析模型,预测不同工作负载下的成本节约。 Result: 在 Azure 和 LMSYS 真实轨迹上评估(Llama-3-70B/A100),GPU 小时减少 31–42%,年节省达 2.86 百万美元;抢占率降低 5.4 倍,P99 TTFT 提升 6%;Qwen3-235B-A22B/MI300X 案例预估年省 15.4 百万美元;调度开销仅 O(1),自动适配异构流量,兼容 PagedAttention 等现有优化。 Conclusion: 双池 token 预算路由是一种轻量、自适应、即插即用的调度策略,能有效缓解配置-流量错配问题,在不牺牲可靠性前提下大幅提升 vLLM 生产部署的资源效率与经济效益。 Abstract: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.[40] Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection
Khalid Zaman,Melike Sah,Anuwat Chaiwongyenc,Cem Direkoglu
Main category: cs.CL
TL;DR: 本文提出量子视觉(QV)理论,将量子物理中的波粒二象性思想引入深度学习音频分类,特别是深度伪造语音检测;通过QV块将语音特征(如STFT、梅尔谱图、MFCC)转化为信息波,再输入CNN或ViT模型,显著提升了在ASVSpoof数据集上的检测性能。
Details
Motivation: 将已在图像分类中验证有效的QV理论拓展至音频领域,探索其在深伪语音检测中的适用性与优势。 Method: 设计QV块将语音信号的STFT、梅尔谱图和MFCC等时频特征转化为信息波,构建QV-CNN和QV-ViT模型,在ASVSpoof数据集上进行训练与评估。 Result: QV-CNN和QV-ViT均优于对应基线模型;QV-CNN+MFCC达94.20%准确率(EER=9.04%),QV-CNN+梅尔谱图达最高准确率94.57%。 Conclusion: QV理论是一种有效且具前景的量子启发式方法,为音频感知任务尤其是深度伪造语音检测提供了新思路。 Abstract: We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.[41] Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization
Ian W. Kennedy,Nafise Sadat Moosavi
Main category: cs.CL
TL;DR: 本文提出了一种面向输出的EM初始化方法OA-EM,用于改善加性量化(AQ)在2比特极低精度下的性能瓶颈,核心问题在于传统贪心顺序初始化导致优化陷入不良区域;OA-EM利用Hessian加权马氏距离进行初始化,在多个模型和压缩率下显著提升PV微调后的质量,并揭示了表征比ρ=N/KM对初始化敏感性的关键影响。
Details
Motivation: 加性量化(AQ)在2比特精度下常发生灾难性失效,即使采用大量搜索和微调也难以缓解,作者指出其根本瓶颈在于代码本(codebook)的初始化策略。 Method: 提出OA-EM(Output-Aware Expectation-Maximization)初始化方法,基于Hessian加权的马氏诺比斯距离,在EM框架中实现输出感知的代码本初始化;引入表征比ρ = N/KM刻画权重分组数与代码本容量的关系,并据此分析初始化敏感性。 Result: OA-EM在Llama 3.2 3B、Llama 3.1 8B和Qwen 2.5 3B三个模型上,不同压缩率和搜索预算下均优于基线方法;尤其在2 bpp时,可避免困惑度(perplexity)数量级恶化,显著提升PV-tuning后质量,并主导质量-计算权衡前沿。 Conclusion: 代码本初始化是极低比特AQ的关键瓶颈,其影响随表征比ρ增大而加剧;良好的几何感知初始化(如OA-EM)可主导后续搜索与微调效果,凸显优化几何在压缩模型空间中的核心地位。 Abstract: Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.[42] LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs
Tian Huang,Tom Bourgeade,Irina Illina
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLM)在法语低资源环境下自动生成和评估OSCE医患对话的方法,通过可控合成与银标标注,实现了媲美GPT-4o的评估准确率(约90%),支持本地化、隐私保护的医学教育评估系统。
Details
Motivation: 法国OSCE培训受限于人力与后勤,学生缺乏反复练习与结构化反馈机会;同时真实法语OSCE标注语料极度稀缺,制约可复现研究与可靠基准测试。 Method: 构建一个可控生成管道,依据场景特定评估标准生成理想与扰动(模拟不同能力水平)的法语医患对话;采用LLM辅助框架对合成对话进行可调严格度的自动银标标注;并在多种开源与闭源LLM上进行基准测试。 Result: 中等规模模型(≤32B参数)在合成数据上的评估准确率可达约90%,与GPT-4o相当。 Conclusion: 在低资源法语OSCE场景下,LLM可用于高效生成带银标标注的高质量对话数据,并支撑本地化、隐私安全的自动化评估系统。 Abstract: Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.[43] Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs
Soveatin Kuntur,Maciej Krzywda,Anna Wróblewska,Marcin Paprzycki,Maria Ganzha,Szymon Łukasik,Amir H. Gandomi
Main category: cs.CL
TL;DR: 本文在可控条件下对比了图神经网络(GNN)与传统机器学习方法在多语言虚假信息检测任务上的性能,发现轻量级GNN(如GraphSAGE、ChebNet等)在F1分数上显著优于Logistic Regression、SVM和MLP,且推理时间相当或更短,表明经典GNN仍具高效性与实用性。
Details
Motivation: 现有虚假信息检测模型(如大语言模型、混合架构)计算开销大、部署难,亟需评估更轻量、实用的替代方案。 Method: 在七个英文、印尼文和波兰文公开数据集上,统一使用TF-IDF特征,对比轻量级GNN(GCN、GraphSAGE、GAT、ChebNet)与非图模型(逻辑回归、SVM、MLP),以F1分数和推理时间为评估指标。 Result: GNN在所有数据集上均显著优于非图基线:GraphSAGE在Kaggle和WELFake上F1达96.8%和91.9%,远超MLP的73.2%和66.8%;在COVID-19和FakeNewsNet上也分别领先15.6%和12.7%,且推理时间相当或更低。 Conclusion: 经典轻量级GNN在虚假信息检测中兼具高性能与高效率,挑战了当前盲目追求复杂模型的倾向,为实际部署提供了更优选择。 Abstract: The rapid spread of online misinformation has led to increasingly complex detection models, including large language models and hybrid architectures. However, their computational cost and deployment limitations raise concerns about practical applicability. In this work, we benchmark graph neural networks (GNNs) against non-graph-based machine learning methods under controlled and comparable conditions. We evaluate lightweight GNN architectures (GCN, GraphSAGE, GAT, ChebNet) against Logistic Regression, Support Vector Machines, and Multilayer Perceptrons across seven public datasets in English, Indonesian, and Polish. All models use identical TF-IDF features to isolate the impact of relational structure. Performance is measured using F1 score, with inference time reported to assess efficiency. GNNs consistently outperform non-graph baselines across all datasets. For example, GraphSAGE achieves 96.8% F1 on Kaggle and 91.9% on WELFake, compared to 73.2% and 66.8% for MLP, respectively. On COVID-19, GraphSAGE reaches 90.5% F1 vs. 74.9%, while ChebNet attains 79.1% vs. 66.4% on FakeNewsNet. These gains are achieved with comparable or lower inference times. Overall, the results show that classic GNNs remain effective and efficient, challenging the need for increasingly complex architectures in misinformation detection.[44] Clickbait detection: quick inference with maximum impact
Soveatin Kuntur,Panggih Kusuma Ningrum,Anna Wróblewska,Maria Ganzha,Marcin Paprzycki
Main category: cs.CL
TL;DR: 本文提出了一种轻量级混合方法用于点击诱饵检测,结合OpenAI语义嵌入与六个紧凑启发式特征,并采用PCA降维和XGBoost、GraphSAGE、GCN等分类器进行评估;图神经网络模型在推理时间大幅降低的同时保持了有竞争力的性能。
Details
Motivation: 提升点击诱饵检测的效率与实用性,尤其在资源受限场景下兼顾性能与速度。 Method: 融合OpenAI语义嵌入与六种风格/信息类启发式特征;对嵌入使用PCA降维;分别用XGBoost、GraphSAGE和GCN建模;评估F1-score与ROC-AUC。 Result: 图神经网络(GraphSAGE、GCN)在显著降低推理时间的同时达到与传统模型(XGBoost)相近的F1分数,且ROC-AUC值高,表明其具有强判别能力。 Conclusion: 轻量级混合方法特别是图模型,在点击诱饵检测任务中实现了效率与性能的良好平衡,适合实际部署。 Abstract: We propose a lightweight hybrid approach to clickbait detection that combines OpenAI semantic embeddings with six compact heuristic features capturing stylistic and informational cues. To improve efficiency, embeddings are reduced using PCA and evaluated with XGBoost, GraphSAGE, and GCN classifiers. While the simplified feature design yields slightly lower F1-scores, graph-based models achieve competitive performance with substantially reduced inference time. High ROC--AUC values further indicate strong discrimination capability, supporting reliable detection of clickbait headlines under varying decision thresholds.[45] Training Data Size Sensitivity in Unsupervised Rhyme Recognition
Petr Plecháč,Artjoms Šeļa,Silvie Cinková,Mirella De Sisto,Lara Nugues,Neža Kočnik,Antonina Martynenko,Ben Nagy,Luca Giovannini,Robert Kolár
Main category: cs.CL
TL;DR: 本文研究了无监督韵律识别工具RhymeTagger在七种语言中的性能,探讨训练数据量和语言差异对准确率的影响,并与人工标注一致性及大语言模型进行对比。
Details
Motivation: 韵律识别具有历史建构性和主观性,导致自动化识别困难,尤其在多语种背景下。 Method: 使用语言无关的RhymeTagger工具,在七种语言诗歌语料库中开展无监督韵律识别实验;评估不同训练规模下的性能;计算专家人工标注的一致性;分析影响标注分歧的因素(如语音相似性、词间距离);并以单样本学习方式对比三个大语言模型。 Result: RhymeTagger在获得足够训练数据后,性能稳定超过人工标注一致性;而缺乏语音表征的大语言模型在此任务上表现较差。 Conclusion: 无监督模式下,基于重复模式的RhymeTagger是可靠且跨语言有效的韵律识别方法;语音信息对韵律识别至关重要,单纯依赖文本的大模型难以胜任该任务。 Abstract: Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic similarity between rhyming words and their distance from each other in a poem. We also compare RhymeTagger to three large language models using a one-shot learning strategy. Our findings show that, once provided with sufficient training data, RhymeTagger consistently outperforms human agreement, while LLMs lacking phonetic representation significantly struggle with the task.[46] Self-Debias: Self-correcting for Debiasing Large Language Models
Xuan Feng,Shuai Zhao,Luwei Xiao,Tianlong Gu,Bo An
Main category: cs.CL
TL;DR: 本文提出Self-Debias框架,通过将去偏过程建模为概率质量资源再分配问题,在推理路径层面施加动态约束,实现LLM链式思维中偏见传播的自主识别与中断,并结合一致性过滤的在线自优化机制,以少量标注数据实现高效、自持的去偏。
Details
Motivation: 现有去偏方法多依赖静态约束或外部干预,难以在链式思维(CoT)过程中实时识别并中断持续发生的‘偏见传播’问题。 Method: 提出Self-Debias框架:1)将去偏建模为输出概率质量在偏见启发式与无偏推理路径间的动态再分配;2)设计轨迹级细粒度目标函数,支持仅修正偏见推理后缀、保留有效前缀;3)引入基于一致性过滤的在线自改进机制,自动合成监督信号。 Result: 仅用2万标注样本即实现高效自校正,在多个偏见评测基准上显著优于基线方法,同时保持模型原有通用推理能力。 Conclusion: Self-Debias赋予LLM内在自纠错能力,突破了对外部干预的依赖,为可控、可持续的推理去偏提供了新范式。 Abstract: Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.[47] HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue,Chuanrui Hu,Jiawei Sheng,Zuyi Zhou,Wenyuan Zhang,Tingwen Liu,Li Guo,Yafeng Deng
Main category: cs.CL
TL;DR: 本文提出HyperMem,一种基于超图的分层记忆架构,用于解决现有对话代理长期记忆中难以捕捉高阶关联的问题。通过构建主题-情节-事实三级结构和超边建模,结合混合索引与粗到细检索策略,在LoCoMo基准上达到92.73%的SOTA准确率。
Details
Motivation: 现有方法(如RAG、图记忆)依赖成对关系,难以建模多个元素间的高阶联合依赖,导致记忆检索碎片化,影响长程对话连贯性与个性化。 Method: 提出HyperMem:1)三级分层记忆结构(话题→情节→事实);2)用超边显式建模跨情节与事实的高阶关联;3)设计词法-语义混合索引与粗到细检索策略。 Result: 在LoCoMo基准测试中,LLM-as-a-judge准确率达92.73%,显著优于现有方法,验证了高阶关联建模对长期对话记忆的有效性。 Conclusion: 超图结构能更自然地刻画记忆中多元素协同关系,HyperMem为构建具备深度上下文理解能力的对话智能体提供了新范式。 Abstract: Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues. However, existing approaches as Retrieval-Augmented Generation (RAG) and graph-based memory mostly rely on pairwise relations, which can hardly capture high-order associations, i.e., joint dependencies among multiple elements, causing fragmented retrieval. To this end, we propose HyperMem, a hypergraph-based hierarchical memory architecture that explicitly models such associations using hyperedges. Particularly, HyperMem structures memory into three levels: topics, episodes, and facts, and groups related episodes and their facts via hyperedges, unifying scattered content into coherent units. Leveraging this structure, we design a hybrid lexical-semantic index and a coarse-to-fine retrieval strategy, supporting accurate and efficient retrieval of high-order associations. Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.[48] Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing
Jun Seo,Sangwon Ryu,Heejin Do,Hyounghun Kim,Gary Geunbae Lee
Main category: cs.CL
TL;DR: 本文提出了一种行为感知的题目建模框架BAIM,通过引入Polya问题解决四阶段(理解、计划、执行、回顾)的动态过程信息,增强知识追踪中题目表征,并结合上下文自适应机制适配不同学习者,显著提升预测性能,尤其在多次交互场景下效果突出。
Details
Motivation: 现有知识追踪方法虽利用知识点对齐的题目表征取得进展,但忽略了问题解决过程中的程序性动态特征,难以刻画学习者真实认知行为。 Method: 提出BAIM框架:1)用推理语言模型将每道题解法分解为Polya四阶段;2)从各阶段嵌入轨迹中提取阶段级表征;3)设计上下文条件化的自适应路由机制,将阶段表征动态融合进KT主干模型。 Result: 在XES3G5M和NIPS34数据集上,BAIM持续超越强预训练基线,尤其在重复交互场景下提升显著。 Conclusion: 融入问题解决过程的动态行为信息并适配学习者异质性,可有效提升知识追踪建模能力与预测精度。 Abstract: Knowledge Tracing (KT) aims to predict learners' future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item's solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya's framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.[49] Floating or Suggesting Ideas? A Large-Scale Contrastive Analysis of Metaphorical and Literal Verb-Object Constructions
Prisca Piccirilli,Alexander Fraser,Sabine Schulte im Walde
Main category: cs.CL
TL;DR: 本文通过分析297个英语动词-宾语对在约200万语料库句子中的使用,结合2293个认知与语言特征,探讨隐喻与字面用法在跨对和对内层面的差异,发现二者无统一分布模式,差异主要依赖于具体构式。
Details
Motivation: 先前研究多从认知和心理语言学角度探讨隐喻,但缺乏大规模、近义表达下隐喻与字面语言的系统比较。 Method: 基于297个英语动宾对、约200万语料句子,利用5种NLP工具提取2293个涵盖情感、词汇、句法和话语层面的特征,开展跨对(cross-pair)和对内(within-pair)两类对比分析。 Result: 跨对分析显示:字面语境具有更高词频、语义凝聚性与结构规整性;隐喻语境则表现出更强情感负荷、意象性、词汇多样性与构式特异性;对内分析揭示多数动宾对内部差异不一致,呈现显著异质性。 Conclusion: 隐喻与字面用法之间不存在单一、普适的分布差异模式,其区别本质上是构式依赖的;大规模数据与多维特征结合可实现对VO隐喻使用的精细刻画。 Abstract: Metaphor pervades everyday language, allowing speakers to express abstract concepts via concrete domains. While prior work has studied metaphors cognitively and psycholinguistically, large-scale comparisons with literal language remain limited, especially for near-synonymous expressions. We analyze 297 English verb-object pairs (e.g., float idea vs. suggest idea) in ~2M corpus sentences, examining their contextual usage. Using five NLP tools, we extract 2,293 cognitive and linguistic features capturing affective, lexical, syntactic, and discourse-level properties. We address: (i) whether features differ between metaphorical and literal contexts (cross-pair analysis), and (ii) whether individual VO pairs diverge internally (within-pair analysis). Cross-pair results show literal contexts have higher lexical frequency, cohesion, and structural regularity, while metaphorical contexts show greater affective load, imageability, lexical diversity, and constructional specificity. Within-pair analyses reveal substantial heterogeneity, with most pairs showing non-uniform effects. These results suggest no single, consistent distributional pattern that distinguishes metaphorical from literal usage. Instead, differences are largely construction-specific. Overall, large-scale data combined with diverse features provides a fine-grained understanding of metaphor-literal contrasts in VO usage.[50] When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
Ruotao Xu,Yixin Ji,Yu Luo,Jinpeng Li,Dong Li,Peifeng Li,Juntao Li,Min Zhang
Main category: cs.CL
TL;DR: 本文提出了一种自适应工具信任校准(ATTC)框架,以解决工具集成推理(TIR)模型中模型忽视正确工具结果(即“Tool Ignored”)的问题,通过基于代码块置信度分数动态决定是否信任工具结果,在多个开源TIR模型和数据集上提升性能4.1%–7.5%。
Details
Motivation: 现有工具集成推理(TIR)模型在模型推理与工具结果冲突时倾向于信任自身推理、忽视正确工具输出,即存在‘Tool Ignored’问题,缺乏对工具结果的动态信任判断能力。 Method: 提出Adaptive Tool Trust Calibration(ATTC)框架,利用生成代码块的置信度分数作为依据,引导模型自适应地选择信任或忽略工具执行结果。 Result: 在多种规模的开源TIR模型及多个数据集上的实验表明,ATTC显著缓解‘Tool Ignored’问题,性能提升4.1%至7.5%。 Conclusion: ATTC为TIR范式提供了可扩展的信任校准机制,提升了模型对工具结果的合理利用能力,是增强大型推理模型可靠性的有效途径。 Abstract: Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored''. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.[51] Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language Models
Yating Wang,Wenting Zhao,Yaqi Zhao,Yongshun Gong,Yilong Yin,Haoliang Sun
Main category: cs.CL
TL;DR: 本文研究了大语言模型中规则级知识的编辑问题,发现规则知识在不同transformer层中按形式(公式、描述、实例)分布,因此提出分布式多层编辑方法(DMLE),显著提升了规则级编辑效果。
Details
Motivation: 现有模型编辑方法主要针对事实级知识,假设可通过局部干预实现编辑,但规则级知识具有跨多种表达形式(符号表达、自然语言解释、具体实例)的一致性要求,该假设不成立。 Method: 通过扩展RuleEdit基准(从80到200条人工验证规则),结合细粒度因果追踪分析规则知识在transformer各层中的分布特性,并据此提出分布式多层编辑(DMLE)方法:在早期层统一更新公式和描述,在中间层单独更新实例。 Result: DMLE在GPT-J-6B、Qwen2.5-7B、Qwen2-7B和LLaMA-3-8B上平均提升实例可迁移性和规则理解能力达13.91和50.19个百分点,优于最强基线。 Conclusion: 规则级知识具有形式依赖的分层组织结构,需采用跨层协同编辑策略;DMLE验证了该机制的有效性,为规则知识编辑提供了新范式。 Abstract: Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact-level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule-level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule-level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine-grained causal tracing reveals a form-specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single-layer or contiguous-block intervention. Based on this insight, we propose Distributed Multi-Layer Editing (DMLE), which applies a shared early-layer update to formulas and descriptions and a separate middle-layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule-level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT-J-6B, Qwen2.5-7B, Qwen2-7B, and LLaMA-3-8B. The code is available at https://github.com/Pepper66/DMLE.[52] SeLaR: Selective Latent Reasoning in Large Language Models
Renyu Fu,Guibo Luo
Main category: cs.CL
TL;DR: 本文提出SeLaR(Selective Latent Reasoning),一种无需训练的轻量级框架,通过熵门控机制选择性启用软嵌入(仅在低置信度推理步),并结合熵感知对比正则化防止软嵌入坍缩,从而提升大模型链式推理的稳定性与探索能力。
Details
Motivation: 现有潜推理方法因全局激活和软嵌入坍缩问题,损害推理稳定性与路径多样性;而传统离散CoT受限于token采样的表达能力。 Method: 提出熵门控机制(仅在低置信度步骤启用软嵌入)与熵感知对比正则化(推动软嵌入远离主导token方向),二者均无需额外训练。 Result: 在五个推理基准上,SeLaR持续优于标准CoT及现有无训练SOTA方法。 Conclusion: 选择性地、可控地引入潜空间推理可兼顾稳定性与探索性,为训练-free推理增强提供新范式。 Abstract: Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token's direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.[53] Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Jiawei Chen,Ruoxi Xu,Boxi Cao,Ruotong Pan,Yunfei Zhang,Yifei Hu,Yong Du,Tingting Gao,Yaojie Lu,Yingfei Sun,Xianpei Han,Le Sun,Xiangyu Wu,Hongyu Lin
Main category: cs.CL
TL;DR: 本文提出了首个完全基于真实世界数据构建的用户模拟基准OmniBehavior,揭示了现有LLM在模拟复杂、长周期、跨场景人类行为时存在结构性偏差(如乌托邦偏差、人格同质化),并指出了高保真用户模拟的关键挑战。
Details
Motivation: 现有用户模拟基准局限于孤立场景、狭窄动作空间或合成数据,无法反映真实人类行为的整体性;需构建更贴近现实的基准以推动高保真用户模拟研究。 Method: 构建全真实数据驱动的OmniBehavior基准,涵盖长周期、跨场景、异构行为模式;通过实证分析和大规模LLM评估,对比模拟行为与真实行为,识别结构性偏差。 Result: 发现当前LLM在长程因果链建模上表现不佳,性能随上下文扩展而饱和;系统性揭示其存在‘乌托邦偏差’:过度活跃、人格同质化、趋近正向平均人,导致个体差异与长尾行为丢失。 Conclusion: OmniBehavior为用户模拟提供了新基准和诊断工具;指出克服结构性偏差、建模个体差异与长尾行为是未来高保真模拟的核心方向。 Abstract: The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.[54] A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection
Wenxian Wang,Xiaohu Luo,Junfeng Hao,Xiaoming Gu,Xingshu Chen,Zhu Wang,Haizhou Wang
Main category: cs.CL
TL;DR: 本文提出了一种结合生成对抗网络(GAN)与大语言模型(LLM)的数据增强框架,用于建模用户语言模式以提升中文讽刺检测性能,并构建了包含用户历史行为的新数据集SinaSarc;所提扩展BERT模型在讽刺与非讽刺类别上均取得最优F1分数(0.9138和0.9151)。
Details
Motivation: 现有中文讽刺检测方法受限于数据集小、构建成本高,且多忽略用户特异性语言模式对讽刺表达的影响。 Method: 提出GAN与LLM联合驱动的数据增强框架:先采集新浪微博多主题原始数据,用GAN训练并结合GPT-3.5进行数据合成,构建含目标评论、上下文及用户历史行为的SinaSarc数据集;再扩展BERT架构,显式融合用户历史行为等多维信息。 Result: 模型在讽刺与非讽刺类别上的F1分数分别达0.9138和0.9151,显著优于所有现有SOTA方法。 Conclusion: 该工作为中文讽刺检测提供了动态建模用户长期语言模式的新范式,在数据集构建与方法设计两方面均有重要贡献。 Abstract: Sarcasm is a rhetorical device that expresses criticism or emphasizes characteristics of certain individuals or situations through exaggeration, irony, or comparison. Existing methods for Chinese sarcasm detection are constrained by limited datasets and high construction costs, and they mainly focus on textual features, overlooking user-specific linguistic patterns that shape how opinions and emotions are expressed. This paper proposes a Generative Adversarial Network (GAN) and Large Language Model (LLM)-driven data augmentation framework to dynamically model users' linguistic patterns for enhanced Chinese sarcasm detection. First, we collect raw data from various topics on Sina Weibo. Then, we train a GAN on these data and apply a GPT-3.5 based data augmentation technique to synthesize an extended sarcastic comment dataset, named SinaSarc. This dataset contains target comments, contextual information, and user historical behavior. Finally, we extend the BERT architecture to incorporate multi-dimensional information, particularly user historical behavior, enabling the model to capture dynamic linguistic patterns and uncover implicit sarcastic cues in comments. Experimental results demonstrate the effectiveness of our proposed method. Specifically, our model achieves the highest F1-scores on both the non-sarcastic and sarcastic categories, with values of 0.9138 and 0.9151 respectively, which outperforms all existing state-of-the-art (SOTA) approaches. This study presents a novel framework for dynamically modeling users' long-term linguistic patterns in Chinese sarcasm detection, contributing to both dataset construction and methodological advancement in this field.[55] Synthetic Data for any Differentiable Target
Tristan Thrush,Sung Min Park,Herman Brunborg,Luke Bailey,Marcel Roed,Neil Band,Christopher Potts,Tatsunori Hashimoto
Main category: cs.CL
TL;DR: 本文提出了一种名为Dataset Policy Gradient(DPG)的强化学习方法,通过高阶梯度精确优化合成数据生成器,从而用合成数据微调语言模型,实现对模型参数或行为的定向控制。
Details
Motivation: 探索仅通过合成训练数据控制语言模型的极限,解决传统方法难以精准引导模型行为的问题。 Method: 提出Dataset Policy Gradient(DPG)这一RL原语,利用高阶梯度进行精确数据归因,并将其作为策略梯度奖励;理论证明其逼近不可行的真实梯度。 Result: 成功用SFT使目标模型LM头权重嵌入QR码、数字模式'67'、降低ℓ²范数;并使生成器跨语言重述输入、生成特定UUID,即使提示中未明确指示。 Conclusion: DPG是一种强大且灵活的技术,仅依赖合成训练样本即可有效塑造语言模型的内部参数与外部行为。 Abstract: What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.[56] AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages
Lilian Wanzare,Cynthia Amol,zekiel Maina,Nelson Odhiambo,Hope Kerubo,Leila Misula,Vivian Oloo,Rennish Mboya,Edwin Onkoba,Edward Ombui,Joseph Muguro,Ciira wa Maina,Andrew Kipkebut,Alfred Omondi Otom,Ian Ndung'u Kang'ethe,Angela Wambui Kanyi,Brian Gichana Omwenga
Main category: cs.CL
TL;DR: AfriVoices-KE 是一个涵盖五种肯尼亚语言、总计约3000小时的大型多语种语音数据集,旨在解决非洲语言在语音技术中代表性不足的问题。
Details
Motivation: 解决非洲语言在语音技术中严重缺乏代表性的关键问题,推动包容性语音系统开发和肯尼亚语言遗产的数字化保存。 Method: 采用双轨数据收集方法:脚本语音来自跨11个肯尼亚相关领域的文本语料、翻译及生成句子;自发语音通过文字与图像提示采集;使用定制手机App实现智能手机录音;结合信噪比自动检测与人工内容审核进行多层质量控制。 Result: 构建了包含750小时脚本语音和2250小时自发语音、覆盖4777名母语者、横跨多元地域与人口统计特征的高质量语音数据集 AfriVoices-KE。 Conclusion: AfriVoices-KE 为低资源非洲语言的ASR与TTS系统研发提供了基础资源,同时助力语言多样性保护与社区参与式数据建设模式的推广。 Abstract: AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.[57] AI generates well-liked but templatic empathic responses
Emma Gueorguieva,Hongli Zhan,Jina Suh,Javier Hernandez,Tatiana Lau,Junyi Jessy Li,Desmond C. Ong
Main category: cs.CL
TL;DR: 本文发现大语言模型(LLMs)在提供情感支持时表现更富同理心,原因在于其高度一致地使用了一种结构化的同理心表达模板;研究构建了10种同理心语言策略的分类法,并通过两项实验(共3265条AI响应与1290条人类响应)验证:LLM响应在话语功能层面高度公式化,存在覆盖81–92%内容的通用战术序列模板,而人类响应则更具多样性。
Details
Motivation: 解释为何LLM生成的情感支持响应被用户评价为比人类响应更具同理心。 Method: 构建包含10种同理心语言策略(如情绪确认、复述等)的分类法,并在两项实验中对六种LLM生成的3265条响应与1290条人类撰写的响应进行话语功能层面的模板匹配分析。 Result: 发现一个高覆盖率(83–90%的LLM响应匹配,其中81–92%的内容由该模板覆盖)的结构化同理心表达模板;人类响应则呈现显著更高的多样性。 Conclusion: LLM的同理心优势源于其稳定复现预设话语模板,而非真正理解;这一发现对AI情感支持的设计、伦理及人机协作具有重要启示。 Abstract: Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.[58] What do Language Models Learn and When? The Implicit Curriculum Hypothesis
Emmy Liu,Kaiser Sun,Millicent Li,Isabelle Lee,Lindia Tjuatja,Jen-tse Huang,Graham Neubig
Main category: cs.CL
TL;DR: 本文提出隐式课程假设,认为大语言模型预训练过程中技能以可预测的组合顺序逐步涌现,并通过设计一系列简单可组合任务验证了该假设,发现不同模型间技能涌现顺序高度一致,且复合任务通常在基础任务之后出现。
Details
Motivation: 现有研究仅通过验证损失的缩放定律了解模型性能提升与计算资源的关系,但无法揭示模型在预训练中具体按何种顺序获得哪些能力,因此需要探索预训练过程中的技能演化规律。 Method: 提出隐式课程假设;设计涵盖检索、形态变换、共指消解、逻辑推理和数学等领域的简单可组合任务;在4个参数规模从410M到13B的模型家族上追踪各任务达到固定准确率阈值的涌现点;分析任务函数向量表示及其训练轨迹相似性;利用任务表征空间预测未见复合任务的训练轨迹。 Result: 不同模型对任务涌现顺序的一致性高达ρ=0.81;复合任务大多在对应基础任务之后涌现;任务表征相似性与训练轨迹相似性正相关;能基于任务表征空间高精度预测新任务训练轨迹(R²=0.68–0.84)。 Conclusion: 大语言模型预训练并非随机,而遵循一种跨模型、可泛化、可从内部表征读取的结构化、组合式技能习得课程。 Abstract: Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent ($ρ= .81$ across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining ($R^2 = .68$-$.84$ across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.[59] Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Jiayuan Ye,Vitaly Feldman,Kunal Talwar
Main category: cs.CL
TL;DR: 本文从信息论角度形式化了大语言模型的事实记忆问题,指出当训练数据中事实信息量超过模型容量或事实频率分布偏斜时,事实准确性会下降;提出仅基于训练损失的数据选择方法,通过限制事实数量和均衡频率分布来提升事实记忆能力,在实验中显著提高了小模型的事实记忆性能。
Details
Motivation: 大型语言模型在参数中记忆事实知识的能力有限,容易产生幻觉并在知识密集型任务中表现不佳。 Method: 从信息论角度形式化事实记忆问题,分析训练数据分布对事实准确性的影响,并提出仅基于训练损失的数据选择方案,以限制训练数据中的事实数量并拉平其频率分布。 Result: 在半合成高熵事实数据集上,该方法将事实准确性提升至容量极限;在基于标注维基百科语料的预训练中,GPT2-Small模型(1.1亿参数)使用该方法记忆的实体事实比标准训练多1.3倍,并达到使用全量数据训练的10倍更大模型(13亿参数)的性能。 Conclusion: 训练数据中事实信息总量与分布特性显著影响模型事实记忆能力,合理控制事实数量和频率分布可大幅提升小模型的事实记忆效率,逼近理论容量极限。 Abstract: Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.[60] ClawBench: Can AI Agents Complete Everyday Online Tasks?
Yuxuan Zhang,Yubo Wang,Yipeng Zhu,Penghui Du,Junwen Miao,Xuan Lu,Wendong Xu,Yunzhuo Hao,Songcheng Cai,Xiaochen Wang,Huaisong Zhang,Xian Wu,Yi Lu,Minyi Lei,Kai Zou,Huifeng Yin,Ping Nie,Liang Chen,Dongfu Jiang,Wenhu Chen,Kelsey R. Allen
Main category: cs.CL
TL;DR: 本文提出了ClawBench,一个面向真实线上平台的AI代理评估框架,包含153个日常任务,覆盖144个活跃网站、15类生活与工作场景,强调信息提取、跨平台多步导航和大量表单填写等高难度能力;实验表明当前前沿模型完成率仍很低(如Claude Sonnet 4.6仅33.3%),凸显现实Web交互评估的重要性。
Details
Motivation: 现有AI代理基准多在离线沙箱或静态网页中进行,无法反映真实、动态、复杂的线上交互挑战;亟需一个能评估AI代理处理日常线上任务能力的现实基准。 Method: 构建ClawBench:包含153个真实世界任务,覆盖144个生产环境网站、15个类别;采用轻量级拦截层安全捕获并阻止最终提交请求,确保无实际副作用;在真实网页上端到端评估AI代理表现。 Result: 对7个前沿模型(含闭源与开源)的评测显示整体完成率偏低,最高为Claude Sonnet 4.6的33.3%,验证了任务难度及现有模型的局限性。 Conclusion: ClawBench填补了面向真实线上环境的AI代理评估空白,其高难度任务设计为推动通用AI助手发展提供了关键基准和明确改进方向。 Abstract: AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.[61] Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
Feng Luo,Yu-Neng Chuang,Guanchu Wang,Zicheng Xu,Xiaotian Han,Tianyi Zhang,Vladimir Braverman
Main category: cs.CL
TL;DR: 本文提出StableOPD框架,通过参考分布约束与混合rollout蒸馏,解决On-policy distillation中因重复导致的轨迹长度膨胀与截断崩溃问题,显著提升训练稳定性与数学推理性能。
Details
Motivation: On-policy distillation(OPD)在训练过程中易出现轨迹长度突增与截断崩溃,引发重复饱和、梯度偏差及性能骤降,其根源在于学生模型自采样与蒸馏目标的不良耦合。 Method: 提出StableOPD:引入基于参考分布的散度约束以抑制重复性长度膨胀,并采用rollout混合蒸馏策略,联合缓解截断崩溃并稳定训练动态。 Result: 在多个数学推理数据集上,StableOPD有效防止截断崩溃、稳定训练过程,平均提升性能7.2%。 Conclusion: StableOPD通过解耦学生采样与蒸馏偏好,为on-policy知识蒸馏提供了更鲁棒、可扩展的训练范式。 Abstract: On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.cs.CV [Back]
[62] FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios
Xiangru Jian,Hao Xu,Wei Pang,Xinjian Zhao,Chengyu Tao,Qixin Zhang,Xikun Zhang,Chao Zhang,Guanzhi Deng,Alex Xue,Juan Du,Tianshu Yu,Garth Tarr,Linqi Song,Qiuzhuang Sun,Dacheng Tao
Main category: cs.CV
TL;DR: 本文提出FORGE,一个面向制造业的多模态大语言模型评估框架,通过构建高质量多模态数据集(2D图像+3D点云+细粒度语义标注),揭示当前MLLM在制造任务中性能瓶颈在于领域知识不足而非视觉定位能力,并验证了基于该数据集微调小模型的有效性。
Details
Motivation: 现有评估方法无法反映真实制造环境的严苛需求,且受限于数据稀缺和缺乏细粒度领域语义,阻碍了MLLM在制造业的落地应用。 Method: 构建融合2D图像与3D点云、含细粒度领域语义(如精确型号)标注的高质量多模态数据集;系统评估18个前沿MLLM在三大制造任务上的表现;开展瓶颈分析;利用结构化标注数据对3B参数模型进行监督微调。 Result: 发现视觉定位并非主要瓶颈,领域知识缺失才是关键限制;微调后模型在预留制造场景中准确率相对提升达90.8%。 Conclusion: 提升制造业MLLM性能的关键路径在于注入领域知识,FORGE提供了评估基准与可落地的训练资源,推动领域适配型MLLM发展。 Abstract: The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.[63] Personalizing Text-to-Image Generation to Individual Taste
Anne-Sofie Maerten,Juliane Verwiebe,Shyamgopal Karthik,Ameya Prabhu,Johan Wagemans,Matthias Bethge
Main category: cs.CV
TL;DR: 本文提出PAMELA框架与新数据集,用于建模个性化图像审美偏好,显著提升个体喜好预测准确率,并支持基于提示优化的个性化文生图生成。
Details
Motivation: 现有文本到图像模型及奖励模型仅优化群体平均审美,忽视用户审美主观性与个体差异。 Method: 构建含70,000条评分、覆盖5,000张图像(来自Flux 2和Nano Banana)的个性化评估数据集;设计联合训练的个性化奖励模型,融合高质量标注与现有美学评估子集;结合提示优化实现个性化生成。 Result: 所提模型在个体喜好预测上优于多数现有方法对群体偏好的预测效果;验证了简单提示优化可有效引导生成符合个体偏好的图像;强调了数据质量与个性化建模的关键作用。 Conclusion: 个性化奖励建模是提升T2I模型用户对齐能力的关键路径;高质量主观评估数据与可泛化的个性化框架为后续研究奠定基础;作者开源数据集与模型以推动该方向标准化发展。 Abstract: Modern text-to-image (T2I) models generate high-fidelity visuals but remain indifferent to individual user preferences. While existing reward models optimize for "average" human appeal, they fail to capture the inherent subjectivity of aesthetic judgment. In this work, we introduce a novel dataset and predictive framework, called PAMELA, designed to model personalized image evaluations. Our dataset comprises 70,000 ratings across 5,000 diverse images generated by state-of-the-art models (Flux 2 and Nano Banana). Each image is evaluated by 15 unique users, providing a rich distribution of subjective preferences across domains such as art, design, fashion, and cinematic photography. Leveraging this data, we propose a personalized reward model trained jointly on our high-quality annotations and existing aesthetic assessment subsets. We demonstrate that our model predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences. Using our personalized predictor, we demonstrate how simple prompt optimization methods can be used to steer generations towards individual user preferences. Our results highlight the importance of data quality and personalization to handle the subjectivity of user preferences. We release our dataset and model to facilitate standardized research in personalized T2I alignment and subjective visual quality assessment.[64] GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
Mingyu Ouyang,Siyuan Hu,Kevin Qinghong Lin,Hwee Tou Ng,Mike Zheng Shou
Main category: cs.CV
TL;DR: 本文提出了GameWorld基准,用于标准化和可验证地评估多模态大语言模型(MLLM)作为通用游戏代理在浏览器环境中的表现,涵盖34款游戏、170项任务,并引入两种接口:计算机使用型(键鼠控制)与语义动作解析型;实验表明当前最优代理距人类水平仍有显著差距。
Details
Motivation: 现有MLLM代理在真实世界交互中面临高延迟、稀疏反馈和不可逆错误等挑战;视频游戏虽是理想测试平台,但缺乏统一、可验证的评估标准,受限于异构动作接口和启发式验证方法。 Method: 构建GameWorld基准,包含34个多样化游戏和170个任务,设计两种代理接口:(i)直接输出键鼠控制的计算机使用代理;(ii)通过确定性语义动作解析(Semantic Action Parsing)在语义动作空间中行动的通用多模态代理;所有任务配备状态可验证的指标以支持结果导向评估。 Result: 在18种模型-接口组合上的评测显示,即使最优代理也远未达到人类游戏水平;多次全基准重跑验证了基准鲁棒性;实时交互、上下文记忆敏感性和动作有效性分析进一步揭示了当前游戏代理的关键挑战。 Conclusion: GameWorld提供了标准化、可验证、可复现的评估框架,为多模态游戏代理乃至更广泛具身智能研究奠定了坚实基础。 Abstract: Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.[65] HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents
Tencent Robotics X,HY Vision Team,:,Xumin Yu,Zuyan Liu,Ziyi Wang,He Zhang,Yongming Rao,Fangfu Liu,Yani Zhang,Ruowen Zhao,Oran Wang,Yves Liang,Haitao Lin,Minghui Wang,Yubo Dong,Kevin Cheng,Bolin Ni,Rui Huang,Han Hu,Zhengyou Zhang,Linus,Shunyu Yao
Main category: cs.CV
TL;DR: 本文提出了HY-Embodied-0.5系列基础模型,专为真实世界具身智能体设计,通过MoT架构增强细粒度视觉感知,并引入迭代自进化后训练范式提升推理能力,同时采用策略内蒸馏实现大小模型能力对齐,在多项基准和真实机器人控制任务中表现优异。
Details
Motivation: 弥合通用视觉语言模型(VLM)与具身智能体实际需求之间的差距,提升空间/时间视觉感知及具身推理(预测、交互、规划)等核心能力。 Method: 提出基于混合专家(Mixture-of-Transformers, MoT)的架构,引入潜在token增强感知表征;设计迭代式自进化后训练范式以强化推理;采用on-policy distillation将32B大模型能力迁移至2B小模型。 Result: 在22个涵盖视觉感知、空间推理和具身理解的基准上全面验证:MoT-2B在16个基准上超越同规模SOTA;32B变体性能媲美Gemini 3.0 Pro;下游VLA模型在真实机器人控制实验中取得优异效果。 Conclusion: HY-Embodied-0.5系列模型有效提升了具身智能所需的关键能力,兼顾边缘部署效率与复杂推理性能,代码与模型已开源。 Abstract: We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.[66] SMFD-UNet: Semantic Face Mask Is The Only Thing You Need To Deblur Faces
Abduz Zami
Main category: cs.CV
TL;DR: 本文提出了一种名为SMFD-UNet的轻量级人脸图像去模糊框架,利用语义面部掩码引导去模糊过程,无需高质量参考图像,在CelebA数据集上优于现有方法。
Details
Motivation: 传统去模糊方法难以捕捉人脸特有的结构和身份特征,且依赖通用图像先验,效果受限;需一种能结合人脸语义信息、不依赖清晰参考图的专用方法。 Method: 提出SMFD-UNet:首先用UNet生成模糊图像的细粒度语义面部组件掩码(如眼、鼻、嘴);再通过多阶段特征融合策略将掩码与模糊输入融合,在轻量UNet架构中重建高清人脸;引入随机模糊管线模拟1.74万亿退化场景以增强鲁棒性;集成RDC模块、CBAM注意力、高效上采样及后处理技术。 Result: 在CelebA数据集上PSNR和SSIM指标优于SOTA方法,同时NIQE、LPIPS、FID等自然度指标表现良好,验证了其重建质量与视觉真实感。 Conclusion: SMFD-UNet是一种高效、轻量、鲁棒的人脸去模糊方法,兼顾性能与实用性,可推动人脸图像恢复研究及实际应用落地。 Abstract: For applications including facial identification, forensic analysis, photographic improvement, and medical imaging diagnostics, facial image deblurring is an essential chore in computer vision allowing the restoration of high-quality images from blurry inputs. Often based on general picture priors, traditional deblurring techniques find it difficult to capture the particular structural and identity-specific features of human faces. We present SMFD-UNet (Semantic Mask Fusion Deblurring UNet), a new lightweight framework using semantic face masks to drive the deblurring process, therefore removing the need for high-quality reference photos in order to solve these difficulties. First, our dual-step method uses a UNet-based semantic mask generator to directly extract detailed facial component masks (e.g., eyes, nose, mouth) straight from blurry photos. Sharp, high-fidelity facial images are subsequently produced by integrating these masks with the blurry input using a multi-stage feature fusion technique within a computationally efficient UNet framework. We created a randomized blurring pipeline that roughly replicates real-world situations by simulating around 1.74 trillion deterioration scenarios, hence guaranteeing resilience. Examined on the CelebA dataset, SMFD-UNet shows better performance than state-of-the-art models, attaining higher Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) while preserving satisfactory naturalness measures, including NIQE, LPIPS, and FID. Powered by Residual Dense Convolution Blocks (RDC), a multi-stage feature fusion strategy, efficient and effective upsampling techniques, attention techniques like CBAM, post-processing techniques, and the lightweight design guarantees scalability and efficiency, enabling SMFD-UNet to be a flexible solution for developing facial image restoration research and useful applications.[67] Training-free Spatially Grounded Geometric Shape Encoding (Technical Report)
Yuhang He
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、通用的2D几何形状编码方法XShapeEnc,通过Zernike正交基与谐波姿态场实现可逆、自适应、高频丰富的紧凑表示,适用于多种形状感知任务。
Details
Motivation: 传统位置编码在1D序列任务中成功,但难以直接扩展到2D空间几何形状;需兼顾形状几何、姿态及神经网络兼容性。 Method: 将2D形状分解为单位圆盘内的归一化几何与姿态向量;姿态转化为单位圆盘内的谐波姿态场;利用Zernike正交基独立或联合编码几何与姿态,并通过频率传播操作增强高频信息。 Result: XShapeEnc具备可逆性、自适应性、频率丰富性等五种优良性质,在理论有效性、计算效率、判别力和任务适用性方面均得到广泛验证。 Conclusion: XShapeEnc是一种基础性工具,推动从1D序列智能向前沿2D空间智能的研究演进。 Abstract: Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.[68] On the Uphill Battle of Image frequency Analysis
Nader Bazyari,Hedieh Sajedi
Main category: cs.CV
TL;DR: 本文是关于逆平方均值漂移聚类算法的后续研究,提出了处理非均匀数据的特例算法,并利用三维快速傅里叶变换分析图像以发现隐藏模式。
Details
Motivation: 解决非均匀数据的聚类问题,并探索图像中潜在的隐藏模式。 Method: 提出逆平方均值漂移算法的特例以处理非均匀数据,并应用三维快速傅里叶变换(3D FFT)分析图像。 Result: 未明确说明具体实验结果或性能指标。 Conclusion: 为非均匀数据聚类和图像隐藏模式挖掘提供了新思路和方法框架。 Abstract: This work is a follow up on the newly proposed clustering algorithm called The Inverse Square Mean Shift Algorithm. In this paper a special case of algorithm for dealing with non-homogenous data is formulated and the three dimensional Fast Fourier Transform of images is investigated with the aim of finding hidden patterns.[69] Mathematical Analysis of Image Matching Techniques
Oleh Samoilenko
Main category: cs.CV
TL;DR: 本文对SIFT和ORB两种经典局部特征匹配算法在卫星影像上的性能进行了分析与实验评估,重点考察关键点数量对匹配内点率的影响。
Details
Motivation: 图像匹配是计算机视觉中的基础问题,在机器人、遥感和地理空间数据分析中有直接应用;然而,现有方法在卫星影像上的表现缺乏系统评估。 Method: 采用统一处理流程(关键点检测、描述子提取、描述子匹配、RANSAC+单应性估计的几何验证),对SIFT和ORB在GPS标注的重叠卫星影像数据集上进行对比实验。 Result: 评估指标为内点率(Inlier Ratio),并分析了提取关键点数量对该指标的影响。 Conclusion: 该研究为卫星影像匹配任务中局部特征算法的选择与调参提供了实证依据。 Abstract: Image matching is a fundamental problem in Computer Vision with direct applications in robotics, remote sensing, and geospatial data analysis. We present an analytical and experimental evaluation of classical local feature-based image matching algorithms on satellite imagery, focusing on the Scale-Invariant Feature Transform (SIFT) and the Oriented FAST and Rotated BRIEF (ORB). Each method is evaluated through a common pipeline: keypoint detection, descriptor extraction, descriptor matching, and geometric verification via RANSAC with homography estimation. Matching quality is assessed using the Inlier Ratio - the fraction of correspondences consistent with the estimated homography. The study uses a manually constructed dataset of GPS-annotated satellite image tiles with intentional overlaps. We examine the impact of the number of extracted keypoints on the resulting Inlier Ratio.[70] Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models
Katerina Katsarou,George Zountsas,Karam Tomotaki-Dawoud,Alexander Ehrenhoefer,Paul Chojecki,David Przewozny,Igor Maximilian Sauer,Amira Mouakher,Sebastian Bosse
Main category: cs.CV
TL;DR: 本文提出了一种基于ViT-LSTM的时空视觉框架,用于手术视频中器械交接事件的检测与方向分类,通过多任务学习和峰值检测实现高精度识别,并用Layer-CAM提升可解释性。
Details
Motivation: 手术器械交接的自动检测对提升手术效率和患者安全至关重要,但因频繁遮挡、背景杂乱及交互事件时序动态性强而极具挑战。 Method: 提出融合Vision Transformer(ViT)与单向LSTM的时空模型,采用统一多任务学习联合预测交接发生与否及方向;利用置信度时间序列结合峰值检测提取离散事件;使用Layer-CAM进行空间归因可视化。 Result: 在肾移植手术数据集上,交接检测F1达0.84,方向分类平均F1为0.72,优于单任务模型和VideoMamba基线,且检测性能相当。 Conclusion: 该框架有效建模手术器械交接的时空动态特性,兼顾高精度与可解释性,为术中智能监控提供了实用解决方案。 Abstract: Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.[71] MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition
Muhammad Imran Sharif,Doina Caragea
Main category: cs.CV
TL;DR: 本文提出MSGL-Transformer模型,通过多尺度全局-局部注意力机制和行为感知调制模块,有效提升基于姿态序列的啮齿类动物社交行为识别性能,在两个公开数据集上均取得SOTA结果。
Details
Motivation: 传统人工标注啮齿类动物行为耗时且易出错,亟需自动、鲁棒、可泛化的自动识别方法。 Method: 提出MSGL-Transformer:轻量级Transformer编码器,含并行短程、中程与全局注意力分支,并引入受SE网络启发的行为感知调制(BAM)模块,对时间嵌入进行行为相关特征增强。 Result: 在RatSI数据集上达75.4%平均准确率(F1=0.745),超越TCN、LSTM等;在CalMS21上达87.1%准确率(F1=0.8745),较HSTWFormer提升10.7%,并优于ST-GCN、MS-G3D等图卷积模型;同一架构仅调整输入维度与类别数即可跨数据集泛化。 Conclusion: MSGL-Transformer验证了多尺度时序建模与行为感知特征调制对姿态驱动行为识别的有效性,具备强泛化能力与实用性。 Abstract: Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.[72] Bootstrapping Sign Language Annotations with Sign Language Models
Colin Lea,Vasileios Baltatzis,Connor Gillis,Raja Kushalnagar,Lorna Quandt,Leah Findlater
Main category: cs.CV
TL;DR: 本文提出了一种伪标注流水线,用于在缺乏高质量标注数据的情况下,为手语视频生成包括时间区间在内的词汇、指拼词和分类符的候选标注,并发布了人工标注的黄金标准基准及大量伪标注数据。
Details
Motivation: AI驱动的手语翻译受限于高质量标注数据的缺乏,现有大型数据集(如ASL STEM Wiki和FLEURS-ASL)仅部分标注,人工标注成本过高导致其未被充分利用。 Method: 构建了一个伪标注流水线:输入手语视频和对应英文文本,结合稀疏预测(来自自研指拼识别器和孤立手语识别器ISR)与K-shot大语言模型方法,输出带时间区间的词汇、指拼词和分类符的排序候选标注;同时建立了简单有效的指拼和ISR基线模型。 Result: 指拼识别在FSBoard上达到6.7%字符错误率(CER),ISR在ASL Citizen数据集上达74% top-1准确率;专业译员为ASL STEM Wiki中近500个视频提供了序列级人工标注(含词汇、分类符、指拼),并释放了该黄金标准及300多小时伪标注数据。 Conclusion: 所提伪标注流水线可显著缓解手语识别领域标注瓶颈,发布的基准与伪标注资源将推动后续研究;基线模型性能已达当前最优,验证了方法有效性。 Abstract: AI-driven sign language interpretation is limited by a lack of high-quality annotated data. New datasets including ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100s of hours of data but remain only partially annotated and thus underutilized, in part due to the prohibitive costs of annotating at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. Our pipeline uses sparse predictions from our fingerspelling recognizer and isolated sign recognizer (ISR), along with a K-Shot LLM approach, to estimate these annotations. In service of this pipeline, we establish simple yet effective baseline fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and on ASL Citizen datasets (74% top-1 accuracy). To validate and provide a gold-standard benchmark, a professional interpreter annotated nearly 500 videos from ASL STEM Wiki with sequence-level gloss labels containing glosses, classifiers, and fingerspelling signs. These human annotations and over 300 hours of pseudo-annotations are being released in supplemental material.[73] VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models
Pavan Kumar Anasosalu Vasu,Cem Koc,Fartash Faghri,Chun-Liang Li,Bo Feng,Zhengfeng Lai,Meng Cao,Oncel Tuzel,Hadi Pouransari
Main category: cs.CV
TL;DR: 本文提出VSAS-Bench,首个面向视觉流式助手(Streaming VLMs)的评测基准与框架,强调‘主动性’和‘一致性’等流式特有指标,并通过大规模实验揭示传统VLM经简单适配即可超越专用流式模型。
Details
Motivation: 现有VLM评测多为离线、单轮问答,忽视流式场景下响应及时性(proactiveness)与跨帧一致性(consistency)等关键能力,缺乏适配实时视觉助手的评估体系。 Method: 构建VSAS-Bench:包含18,000+时序密集标注、多领域多任务数据;设计同步/异步评测协议;定义可解耦的流式能力指标;系统评估不同内存缓冲长度、访问策略与输入分辨率对准确率-延迟权衡的影响。 Result: 发现常规VLM(如Qwen3-VL-4B)经轻量适配即可在异步协议下比肩甚至超越专用流式VLM(如Dispider),提升3%;揭示了关键设计因素对性能的影响规律。 Conclusion: 流式VLM评测需新范式,VSAS-Bench填补了该空白;实证表明流式能力不必然依赖专用架构或训练,为高效部署提供了新思路。 Abstract: Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at https://github.com/apple/ml-vsas-bench.[74] Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach
Huibin Bai,Shuai Li,Hanxiao Zhai,Yanbo Gao,Chong Lv,Yibo Wang,Haipeng Ping,Wei Hua,Xingyu Gao
Main category: cs.CV
TL;DR: 本文提出了一种从特征恢复视角解决单目深度估计问题的新方法,设计了可逆变换增强的间接扩散模块(InvT-IndDiffusion)和辅助视角低层特征增强模块(AV-LFE),显著提升了性能。
Details
Motivation: 现有单目深度估计方法多采用编码器-解码器架构,但其架构局限性及不同层级特征对预测精度的影响尚未被系统评估;作者发现若能提升编码器特征质量,当前框架仍有较大潜力。 Method: 将预训练编码器特征视为退化特征,建模为某理想真值特征的退化结果;提出InvT-IndDiffusion模块在无直接特征监督下,利用稀疏深度图间接监督进行迭代特征恢复,并满足双Lipschitz条件;引入插件式AV-LFE模块融合辅助视角信息以增强局部细节。 Result: 在多个数据集上超越现有最先进方法;在KITTI基准上,相比基线,RMSE指标分别提升4.09%和37.77%(不同训练设置下)。 Conclusion: 从特征恢复角度重构单目深度估计任务是有效的;所提InvT-IndDiffusion与AV-LFE模块具有通用性和即插即用特性,显著提升深度估计精度。 Abstract: Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at https://github.com/whitehb1/IID-RDepth.[75] Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation
Yanbo Gao,Huibin Bai,Huasong Zhou,Xingyu Gao,Shuai Li,Xun Cai,Hui Yuan,Wei Hua,Tian Xie
Main category: cs.CV
TL;DR: 本文提出了一种深度转换尺度卷积(DcSConv)增强的单目深度估计框架,通过建模物体深度与尺度的关系,自适应调整卷积核尺度以缓解单目视频中因物体尺度变化导致的深度-尺度模糊问题,并设计了DcS-F模块融合新旧特征;在KITTI上提升SqRel达11.6%,可即插即用地提升现有CNN方法。
Details
Motivation: 现有自监督单目深度估计方法缺乏对物体因深度变化而引起的尺度变化的显式建模,尤其在单目视频中同一物体尺度连续变化,导致尺度与深度之间的歧义。 Method: 提出Depth-converted-Scale Convolution(DcSConv),将深度先验融入卷积核尺度选择,强调尺度自适应而非形变;并设计Depth-converted-Scale aware Fusion(DcS-F)模块,自适应融合DcSConv特征与传统卷积特征;整体作为即插即用模块嵌入现有CNN深度估计框架。 Result: 在KITTI基准上取得最优性能,SqRel指标最高降低11.6%;消融实验验证了DcSConv和DcS-F模块各自的有效性。 Conclusion: 卷积核的尺度对单目深度估计至关重要,甚至比局部形变更关键;引入深度引导的尺度自适应机制能显著提升模型对场景结构的理解能力,且该方法具有良好的通用性和即插即用性。 Abstract: Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.[76] Weight Group-wise Post-Training Quantization for Medical Foundation Model
Yineng Chen,Peng Huang,Aozhong Zhang,Hui Guo,Penghang Yin,Shu Hu,Shao Lin,Xin Li,Tzu-Jen Kao,Balakrishnan Prabhakaran,MingChing Chang,Xin Wang
Main category: cs.CV
TL;DR: 本文提出了一种无需反向传播的后训练量化算法Permutation-COMQ,通过重排序权重和简单运算提升低比特(2/4/8-bit)量化精度,适用于终端医疗设备。
Details
Motivation: 基础模型在医学图像分析中表现优异,但其大参数量和高计算复杂度限制了其在终端医疗设备上的实时推理应用,亟需轻量化方案。 Method: 提出Permutation-COMQ后训练量化算法:1)仅用点积与舍入操作,避免反向传播与超参调优;2)引入权重感知的层内权重重排序策略,缓解通道缩放导致的精度下降,同时保持通道结构。 Result: 在2-bit、4-bit和8-bit量化设置下均取得最优性能。 Conclusion: Permutation-COMQ是一种高效、免训练、结构保持的量化方法,显著提升低比特量化精度,适合资源受限的医学终端部署。 Abstract: Foundation models have achieved remarkable results in medical image analysis. However, its large network architecture and high computational complexity significantly impact inference speed, limiting its application on terminal medical devices. Quantization, a technique that compresses models into low-bit versions, is a solution to this challenge. In this paper, we propose a post-training quantization algorithm, Permutation-COMQ. It eliminates the need for backpropagation by using simple dot products and rounding operations, thereby removing hyperparameter tuning and simplifying the process. Additionally, we introduce a weight-aware strategy that reorders the weight within each layer to address the accuracy degradation induced by channel-wise scaling during quantization, while preserving channel structure. Experiments demonstrate that our method achieves the best results in 2-bit, 4-bit, and 8-bit quantization.[77] FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction
Jinzhen Han,JinByeong Lee,Hak Han,YeonJu Na,Jae-Joon Lee
Main category: cs.CV
TL;DR: 本文提出FireSenseNet,一种双分支CNN模型,通过CAFIM模块显式建模燃料与天气特征的空间交互,在野火蔓延预测任务中显著优于现有方法,并揭示了评估偏差与关键特征重要性。
Details
Motivation: 现有深度学习方法将异构地理空间输入简单拼接,忽略了静态燃料/地形与动态气象条件之间的物理本质差异,导致建模不准确。 Method: 提出双分支CNN架构FireSenseNet,引入跨注意力特征交互模块(CAFIM),在多个编码器尺度上通过可学习注意力门控建模燃料与天气模态的空变交互;结合蒙特卡洛Dropout实现像素级不确定性量化。 Result: 在Google Next-Day Wildfire Spread基准上F1达0.4176、AUC-PR为0.3435,超越参数量多3.8倍的SegFormer;消融显示CAFIM带来7.1%相对F1提升;发现前一日火场掩膜主导预测,风速在粗时间分辨率下呈噪声特性;指出常见评估捷径使F1虚高超44%。 Conclusion: 显式建模多源地理模态的物理交互对野火预测至关重要;CAFIM有效提升性能;评估需规避捷径偏差;特征重要性分析与不确定性量化增强了模型可信度与可解释性。 Abstract: Accurate prediction of next-day wildfire spread is critical for disaster response and resource allocation. Existing deep learning approaches typically concatenate heterogeneous geospatial inputs into a single tensor, ignoring the fundamental physical distinction between static fuel/terrain properties and dynamic meteorological conditions. We propose FireSenseNet, a dual-branch convolutional neural network equipped with a novel Cross-Attentive Feature Interaction Module (CAFIM) that explicitly models the spatially varying interaction between fuel and weather modalities through learnable attention gates at multiple encoder scales. Through a systematic comparison of seven architectures -- spanning pure CNNs, Vision Transformers, and hybrid designs -- on the Google Next-Day Wildfire Spread benchmark, we demonstrate that FireSenseNet achieves an F1 of 0.4176 and AUC-PR of 0.3435, outperforming all alternatives including a SegFormer with 3.8* more parameters (F1 = 0.3502). Ablation studies confirm that CAFIM provides a 7.1% relative F1 gain over naive concatenation, and channel-wise feature importance analysis reveals that the previous-day fire mask dominates prediction while wind speed acts as noise at the dataset's coarse temporal resolution. We further incorporate Monte Carlo Dropout for pixel-level uncertainty quantification and present a critical analysis showing that common evaluation shortcuts inflate reported F1 scores by over 44%.[78] Needle in a Haystack -- One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology
Swarnadip Chatterjee,Vladimir Basic,Arrigo Capitanio,Orcun Goksel,Joakim Lindblad
Main category: cs.CV
TL;DR: 本文提出使用单类表示学习(OCC)方法(如DSVDD和DROC)在极低见证率(≤1%)下检测罕见恶性细胞,仅需阴性切片训练,无需实例级标注,在骨髓与口腔癌细胞数据集上达到SOTA性能,优于传统MIL及部分监督方法。
Details
Motivation: 恶性细胞在全切片图像中形态多样且极其稀少,导致严重类别不平衡和标注匮乏,传统弱监督方法(如MIL)在极低见证率下泛化能力差。 Method: 采用仅基于阴性补丁训练的单类表示学习方法(DSVDD、DROC),学习正常性的紧凑表征,并在测试时检测偏离;对比FS-SIL、WS-SIL和ItS2CLR。 Result: DSVDD在≤1%见证率下实现实例级异常排序SOTA性能,甚至超越全监督方法;DROC通过分布增强对比学习在极端稀疏场景下也具竞争力。 Conclusion: 单类表示学习是极端稀有恶性细胞检测中比MIL更鲁棒、可解释性更强的优选方案。 Abstract: In computational cytology, detecting malignancy on whole-slide images is difficult because malignant cells are morphologically diverse yet vanishingly rare amid a vast background of normal cells. Accurate detection of these extremely rare malignant cells remains challenging due to large class imbalance and limited annotations. Conventional weakly supervised approaches, such as multiple instance learning (MIL), often fail to generalize at the instance level, especially when the fraction of malignant cells (witness rate) is exceedingly low. In this study, we explore the use of one-class representation learning techniques for detecting malignant cells in low-witness-rate scenarios. These methods are trained exclusively on slide-negative patches, without requiring any instance-level supervision. Specifically, we evaluate two OCC approaches, DSVDD and DROC, and compare them with FS-SIL, WS-SIL, and the recent ItS2CLR method. The one-class methods learn compact representations of normality and detect deviations at test time. Experiments on a publicly available bone marrow cytomorphology dataset (TCIA) and an in-house oral cancer cytology dataset show that DSVDD achieves state-of-the-art performance in instance-level abnormality ranking, particularly in ultra-low witness-rate regimes ($\leq 1\%$) and, in some cases, even outperforming fully supervised learning, which is typically not a practical option in whole-slide cytology due to the infeasibility of exhaustive instance-level annotations. DROC is also competitive under extreme rarity, benefiting from distribution-augmented contrastive learning. These findings highlight one-class representation learning as a robust and interpretable superior choice to MIL for malignant cell detection under extreme rarity.[79] Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation
Jiahao Li,Yang Lu,Yachao Zhang,Fangyong Wang,Yuan Xie,Yanyun Qu
Main category: cs.CV
TL;DR: 本文提出了一种无需迭代训练和模型特定注意力调制的开放词汇语义分割新方法,通过直接解析分布差异来生成语义图,在八个基准数据集上达到SOTA性能。
Details
Motivation: 现有开放词汇语义分割方法依赖耗时的迭代训练或模型特定的注意力调制来优化视觉-语言特征的logits,限制了效率与泛化性。 Method: 提出一个关键假设:分布差异蕴含语义信息,且在同类图像块间一致、跨类间不一致;据此,直接将该分布差异的解析解作为语义分割图,跳过logits优化过程。 Result: 在八个基准数据集上实现最先进(SOTA)性能,同时避免了迭代训练和模型特定注意力调制。 Conclusion: 分布差异本身可直接用于语义分割,该解析式方法更高效、通用,并显著提升性能。 Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.[80] GEAR: GEometry-motion Alternating Refinement for Articulated Object Modeling with Gaussian Splatting
Jialin Li,Bin Fu,Ruiping Wang,Xilin Chen
Main category: cs.CV
TL;DR: 本文提出GEAR框架,通过EM式交替优化在高斯点绘表示中联合建模几何与运动,提升复杂关节物体的高保真重建效果。
Details
Motivation: 高保真交互式数字资产对具身智能和机器人交互至关重要,但现有方法在关节物体重建中存在几何-运动联合优化不稳定、泛化能力差等问题。 Method: 提出GEAR框架,采用EM风格交替优化,在高斯点绘表示中将几何与运动建模为相互依赖组件;将部件分割视为隐变量、关节运动参数作为显变量交替优化;利用2D分割模型提供多视角部件先验,并引入弱监督约束正则化隐变量。 Result: 在多个基准数据集及新构建的GEAR-Multi数据集上,GEAR在几何重建和运动参数估计方面达到SOTA,尤其在多部件复杂关节物体上表现突出。 Conclusion: GEAR有效提升了关节物体重建的稳定性、一致性与泛化能力,为具身智能提供了更可靠的数字资产生成方法。 Abstract: High-fidelity interactive digital assets are essential for embodied intelligence and robotic interaction, yet articulated objects remain challenging to reconstruct due to their complex structures and coupled geometry-motion relationships. Existing methods suffer from instability in geometry-motion joint optimization, while their generalization remains limited on complex multi-joint or out-of-distribution objects. To address these challenges, we propose GEAR, an EM-style alternating optimization framework that jointly models geometry and motion as interdependent components within a Gaussian Splatting representation. GEAR treats part segmentation as a latent variable and joint motion parameters as explicit variables, alternately refining them for improved convergence and geometric-motion consistency. To enhance part segmentation quality without sacrificing generalization, we leverage a vanilla 2D segmentation model to provide multi-view part priors, and employ a weakly supervised constraint to regularize the latent variable. Experiments on multiple benchmarks and our newly constructed dataset GEAR-Multi demonstrate that GEAR achieves state-of-the-art results in geometric reconstruction and motion parameters estimation, particularly on complex articulated objects with multiple movable parts.[81] Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification
Shogo Hamano,Shunya Wakasugi,Tatsuhito Sato,Sayaka Nakamura
Main category: cs.CV
TL;DR: 本文提出CG-CLIP框架,结合文本描述与可学习token,通过Caption-guided Memory Refinement(CMR)和Token-based Feature Extraction(TFE)提升视频行人重识别在高难度场景(如体育、舞蹈)下的性能,并在多个数据集上取得SOTA效果。
Details
Motivation: 现有视频行人重识别方法在多人穿着相似、动作动态性强的高难度场景(如体育、舞蹈)中表现不佳,亟需利用更丰富的语义信息和高效特征建模能力。 Method: 提出CG-CLIP框架,包含两个核心模块:1)Caption-guided Memory Refinement(CMR),利用多模态大模型生成的文本描述精炼身份特征;2)Token-based Feature Extraction(TFE),采用固定长度可学习token与跨注意力机制聚合时空特征,降低计算开销。 Result: 在MARS、iLIDS-VID、SportsVReID和DanceVReID四个数据集上均超越当前SOTA方法,尤其在新构建的高难度SportsVReID和DanceVReID上提升显著。 Conclusion: CG-CLIP通过融合显式文本引导与可学习token建模,有效提升了视频行人重识别在复杂动态场景中的鲁棒性与判别力,验证了多模态协同建模在ReID任务中的潜力。 Abstract: In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.[82] MSCT: Differential Cross-Modal Attention for Deepfake Detection
Fangda Wei,Miao Liu,Yingxue Wang,Jing Wang,Shenghui Zhao,Nan Li
Main category: cs.CV
TL;DR: 本文提出了一种多尺度跨模态变换器编码器(MSCT)用于音视频深度伪造检测,通过多尺度自注意力和差分跨模态注意力提升特征提取与模态对齐能力,在FakeAVCeleb数据集上验证了有效性。
Details
Motivation: 传统多模态伪造检测方法存在特征提取不足和模态对齐偏差的问题。 Method: 提出多尺度跨模态变换器编码器(MSCT),包含多尺度自注意力机制以整合邻近嵌入特征,以及差分跨模态注意力机制以融合多模态特征。 Result: 在FakeAVCeleb数据集上取得了具有竞争力的检测性能。 Conclusion: MSCT结构能有效提升音视频深度伪造检测的准确性和鲁棒性。 Abstract: Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.[83] Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan
Main category: cs.CV
TL;DR: 本文提出Symbiotic-MoE框架,在零参数开销下统一预训练多模态大模型的理解与生成能力,通过模态感知专家解耦和渐进训练策略缓解梯度冲突与路由崩溃,实现理解与生成协同提升。
Details
Motivation: 现有方法(如MoT)虽能缓解多模态大模型在图像生成与理解任务间的梯度冲突,但破坏跨模态协同、导致容量碎片化;亟需一种既能避免灾难性遗忘、又能保持模态间 synergy 的统一训练范式。 Method: 提出Symbiotic-MoE:1)基于原生多模态MoE架构,零参数额外开销;2)模态感知专家解耦——将专家划分为任务专用组,并设共享专家作为多模态语义桥梁;3)渐进训练策略——含差异化学习率与早期梯度屏蔽机制,保护已有知识并转化生成梯度为理解增益。 Result: Symbiotic-MoE在保证图像生成快速收敛的同时,显著提升模型固有理解能力,在MMLU和OCRBench上取得显著性能增益,并验证了跨模态协同效应。 Conclusion: Symbiotic-MoE成功实现了理解与生成能力的共生式联合优化,突破了传统结构隔离范式的局限,为统一多模态预训练提供了新路径。 Abstract: Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.[84] DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics
Hang Zhang,Qijian Tian,Jingyu Gong,Daoguo Dong,Xuhong Wang,Yuan Xie,Xin Tan
Main category: cs.CV
TL;DR: DailyArt提出一种新方法,通过合成物体的完全展开状态来暴露被遮挡的运动线索,从而从单张静态图像中估计关节参数,无需多状态、多视角或显式部件标注。
Details
Motivation: 现有方法难以仅从单张闭合状态图像推断铰接物体的运动学结构,因为关键运动线索常被遮挡,且多数方法依赖多状态观测或额外先验信息。 Method: DailyArt将单图关节估计建模为合成引导的推理问题:先在相同视角下合成物体的最大展开状态以暴露关节约束,再通过观测图像与合成图像的差异联合预测全部关节参数;采用集合预测框架,支持无模板、无多视图、无部件标注的端到端推理。 Result: 在铰接关节估计任务上性能优异,并能以估计关节为条件实现部件级的新状态合成。 Conclusion: DailyArt验证了合成辅助推理范式在单图铰接结构理解中的有效性,拓展了具身AI和世界模型中对物体动态结构的建模能力。 Abstract: Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed-state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi-state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis-mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set-prediction formulation, DailyArt recovers all joints simultaneously without requiring object-specific templates, multi-view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part-level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part-level novel state synthesis conditioned on joints. Project page is available at https://rangooo123.github.io/DaliyArt.github.io/.[85] WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects
Junxiong Liang,Mengwei Bao,Tianxiang Wang,Xinggang Wang,An-An Liu,Ryan Wen Liu
Main category: cs.CV
TL;DR: 本文构建了大规模船舶检测数据集WUTDet,包含10万+图像和38万+标注实例,覆盖多样复杂海事场景,并基于该数据集系统评估了CNN、Transformer和Mamba三类主流检测模型的性能与泛化能力。
Details
Motivation: 现有公开船舶检测数据集在规模、小目标比例和场景多样性方面受限,难以支撑复杂海事环境下检测算法的系统评估与泛化研究。 Method: 构建大规模、多场景、多成像条件的船舶检测数据集WUTDet;在WUTDet上系统评测20种基线模型(CNN/Transformer/Mamba);构建跨数据集测试集Ship-GEN评估泛化性。 Result: Transformer在检测精度(AP)和小目标检测(APs)上最优;CNN推理效率最高,适合实时应用;Mamba在精度与效率间取得较好平衡;WUTDet训练的模型在Ship-GEN上展现出更强泛化能力。 Conclusion: WUTDet为复杂海事场景下的船舶检测算法研究、评估与泛化分析提供了有效数据支撑,已开源。 Abstract: Ship detection for navigation is a fundamental perception task in intelligent waterway transportation systems. However, existing public ship detection datasets remain limited in terms of scale, the proportion of small-object instances, and scene diversity, which hinders the systematic evaluation and generalization study of detection algorithms in complex maritime environments. To this end, we construct WUTDet, a large-scale ship detection dataset. WUTDet contains 100,576 images and 381,378 annotated ship instances, covering diverse operational scenarios such as ports, anchorages, navigation, and berthing, as well as various imaging conditions including fog, glare, low-lightness, and rain, thereby exhibiting substantial diversity and challenge. Based on WUTDet, we systematically evaluate 20 baseline models from three mainstream detection architectures, namely CNN, Transformer, and Mamba. Experimental results show that the Transformer architecture achieves superior overall detection accuracy (AP) and small-object detection performance (APs), demonstrating stronger adaptability to complex maritime scenes; the CNN architecture maintains an advantage in inference efficiency, making it more suitable for real-time applications; and the Mamba architecture achieves a favorable balance between detection accuracy and computational efficiency. Furthermore, we construct a unified cross-dataset test set, Ship-GEN, to evaluate model generalization. Results on Ship-GEN show that models trained on WUTDet exhibit stronger generalization under different data distributions. These findings demonstrate that WUTDet provides effective data support for the research, evaluation, and generalization analysis of ship detection algorithms in complex maritime scenarios. The dataset is publicly available at: https://github.com/MAPGroup/WUTDet.[86] Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
Jingtong Dou,Chuancheng Shi,Jian Wang,Fei Shen,Zhiyong Wang,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出首个模态无关的伪造检测框架(MAF),通过解耦模态特有风格,提取跨模态共性伪造特征,定义弱/强MAF两个泛化维度,并构建DeepModal-Bench基准验证其在未知‘暗模态’上的泛化能力。
Details
Motivation: 现有深度伪造检测方法过度依赖模态特有表层伪影,忽视隐藏于不同物理表征之下的共性伪造知识,导致面对未见‘暗模态’时性能严重下降。 Method: 提出模态无关伪造(MAF)检测框架,显式解耦模态特有风格以提取跨模态共性伪造知识;定义Weak MAF(语义相关模态迁移)与Strong MAF(完全孤立‘暗模态’鲁棒性)两个泛化评估维度;构建DeepModal-Bench多模态伪造检测基准。 Result: 实证证明了通用伪造痕迹的存在;MAF框架在未知模态上实现显著性能突破;DeepModal-Bench为泛化学习方法提供了统一评估平台。 Conclusion: 将多模态取证范式从传统‘特征融合’转向‘模态泛化’,MAF框架为构建通用多模态防御体系提供了开创性技术路径。 Abstract: As generative artificial intelligence evolves, deepfake attacks have escalated from single-modality manipulations to complex, multimodal threats. Existing forensic techniques face a severe generalization bottleneck: by relying excessively on superficial, modality-specific artifacts, they neglect the shared latent forgery knowledge hidden beneath variable physical appearances. Consequently, these models suffer catastrophic performance degradation when confronted with unseen "dark modalities." To break this limitation, this paper introduces a paradigm shift that redefines multimodal forensics from conventional "feature fusion" to "modality generalization." We propose the first modality-agnostic forgery (MAF) detection framework. By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge. Furthermore, we define two progressive dimensions to quantify model generalization: transferability toward semantically correlated modalities (Weak MAF), and robustness against completely isolated signals of "dark modality" (Strong MAF). To rigorously assess these generalization limits, we introduce the DeepModal-Bench benchmark, which integrates diverse multimodal forgery detection algorithms and adapts state-of-the-art generalized learning methods. This study not only empirically proves the existence of universal forgery traces but also achieves significant performance breakthroughs on unknown modalities via the MAF framework, offering a pioneering technical pathway for universal multimodal defense.[87] RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
Liang Yao,Shengxiang Xu,Fan Liu,Chuanyi Zhang,Bishun Yao,Rui Min,Yongjun Li,Chaoqian Ouyang,Shimin Di,Min-Ling Zhang
Main category: cs.CV
TL;DR: 本文提出RemoteAgent框架,通过结合多模态大语言模型(MLLM)与外部工具,解决遥感领域中用户模糊自然语言查询与高精度视觉分析任务之间的鸿沟;引入VagueEO数据集进行强化微调,使MLLM能自主处理粗粒度任务,并智能调用工具完成细粒度预测。
Details
Motivation: 地球观测系统用户常以模糊自然语言表达需求,而现有MLLM难以直接生成密集空间预测结果,传统代理框架又存在工具调用低效、未充分利用MLLM能力的问题。 Method: 提出RemoteAgent代理框架,构建人类中心的VagueEO模糊指令数据集,利用其对MLLM进行强化微调,使其成为能自主处理图像级和稀疏区域级任务的认知核心;仅在必要时通过Model Context Protocol调用专用工具执行密集像素级预测。 Result: RemoteAgent在意图识别上表现稳健,在多种遥感任务中达到具有竞争力的性能,同时显著提升计算效率与MLLM能力利用率。 Conclusion: RemoteAgent通过尊重MLLM固有能力边界、人机协同式任务分发,为模糊自然语言驱动的遥感智能分析提供了高效、可扩展的新范式。 Abstract: Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.[88] ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
Zihao Liu,Xiaoyu Wu,Wenna Li,Jianqin Wu,Linlin Yang
Main category: cs.CV
TL;DR: 本文提出ESOM,一种无需训练的高效流式开放世界视频异常检测模型,解决了现有MLLM方法在效率、流式处理和动态异常定义支持上的不足,并引入了OpenDef-Bench新基准进行评估。
Details
Motivation: 现有基于多模态大语言模型(MLLM)的开放世界视频异常检测方法存在部署效率低、不支持流式处理、难以适配动态变化的异常定义等三大问题。 Method: 提出ESOM模型,包含定义归一化模块(减少幻觉)、帧间匹配的帧内令牌合并模块(压缩冗余视觉令牌)、混合流式记忆模块(支持高效因果推理)以及概率评分模块(将区间级文本输出转化为帧级异常分数);同时构建OpenDef-Bench基准。 Result: ESOM在单GPU上实现实时推理,在异常时间定位、分类及描述生成任务上达到SOTA性能。 Conclusion: ESOM是一种高效、免训练、支持流式处理与动态异常定义的开放世界视频异常检测框架,兼具实用性与泛化能力。 Abstract: Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.[89] Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models
Gexin Huang,Anqi Li,Yusheng Tan,Beidi Zhao,Gang Wang,Gaozu Hua,Xiaoxiao Li
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、无需重训练的模型融合策略LogitProd,通过在logit层对多个独立训练的病理基础模型进行样本自适应加权融合,在22个基准任务中显著提升性能,平均提升约3%,且训练成本仅为特征融合方法的1/12。
Details
Motivation: 病理基础模型快速增长导致模型选择困难:单一模型无法在所有下游任务上表现最优,而逐一适配和验证大量候选模型成本过高。 Method: 提出LogitProd融合策略,将多个已训练好的基础模型视为固定专家,仅在其滑片级logits输出上学习样本自适应的加权乘积融合权重;不需重训练编码器,也不需跨异构骨干网络对齐特征空间,并提供了理论保证。 Result: 在22个涵盖WSI分类、切片分类、基因突变预测和离散时间生存建模的基准任务中,LogitProd在20个任务上排名第一,平均性能比最强单模型提升约3%,训练成本约为特征融合方法的1/12。 Conclusion: LogitProd是一种高效、即插即用的多模型融合方案,能以极低成本显著提升异构病理基础模型管线的整体性能。 Abstract: Pathology foundation models (FMs) have become central to computational histopathology, offering strong transfer performance across a wide range of diagnostic and prognostic tasks. The rapid proliferation of pathology foundation models creates a model-selection bottleneck: no single model is uniformly best, yet exhaustively adapting and validating many candidates for each downstream endpoint is prohibitively expensive. We address this challenge with a lightweight and novel model fusion strategy, LogitProd, which treats independently trained FM-based predictors as fixed experts and learns sample-adaptive fusion weights over their slide-level outputs. The fusion operates purely on logits, requiring no encoder retraining and no feature-space alignment across heterogeneous backbones. We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. We systematically evaluate LogitProd on \textbf{22} benchmarks spanning WSI-level classification, tile-level classification, gene mutation prediction, and discrete-time survival modeling. LogitProd ranks first on 20/22 tasks and improves the average performance across all tasks by ~3% over the strongest single expert. LogitProd enables practitioners to upgrade heterogeneous FM-based pipelines in a plug-and-play manner, achieving multi-expert gains with $\sim$12$\times$ lower training cost than feature-fusion alternatives.[90] Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video
Chanhyuk Choi,Taesoo Kim,Donggyu Lee,Siyeol Jung,Taehwan Kim
Main category: cs.CV
TL;DR: 本文提出了一种跨模态情感迁移方法C-MET,通过建模语音与视觉特征空间间的情感语义向量,实现更准确、更具表现力的说话人脸视频生成,尤其支持未见过的扩展情感(如讽刺)。
Details
Motivation: 现有方法在情感编辑中存在局限:标签法难以覆盖广泛情感,音频法因语音中情感与语言内容纠缠而难准确表达目标情感,图像法依赖高质量参考图且难以获取扩展情感的参考数据。 Method: 提出Cross-Modal Emotion Transfer(C-MET),利用大规模预训练音频编码器和解耦的人脸表情编码器,学习跨模态的情感语义向量(表征不同情绪嵌入间的差异),从而基于语音生成对应面部表情。 Result: 在MEAD和CREMA-D数据集上,情感准确率较SOTA提升14%,并能生成包含未见扩展情感的逼真说话人脸视频。 Conclusion: C-MET有效解耦语音中的情感信息,实现灵活、精准、泛化性强的跨模态情感驱动人脸生成,为情感可控的 talking face 生成提供了新范式。 Abstract: Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/[91] Image-Guided Geometric Stylization of 3D Meshes
Changwoon Choi,Hyunsoo Lee,Clément Jambon,Yael Vinker,Young Min Kim
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练扩散模型的几何风格化框架GeoStyle,可将3D网格大幅变形以匹配参考图像的风格,同时保持拓扑结构与部件语义。
Details
Motivation: 现有生成模型难以支持超出数据分布的大胆几何变形,且对隐式控制信号(如上下文描述)支持有限;风格定义模糊,需借助强大先验提取抽象风格表征。 Method: 利用预训练扩散模型提取图像风格表征;设计粗到细的网格变形流程;提出近似VAE编码器,从网格渲染中高效可靠地获取梯度。 Result: 能生成反映参考图像独特几何特征(如夸张姿态、轮廓)的风格化3D网格;在保持原始网格拓扑和部件语义的同时实现多样化几何变化。 Conclusion: GeoStyle为艺术化3D内容创作提供了新范式,实现了图像驱动的可控、大幅度、语义保持的3D几何风格迁移。 Abstract: Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our method can create stylized 3D meshes that reflect unique geometric features of the pictured assets, such as expressive poses and silhouettes, thereby supporting the creation of distinctive artistic 3D creations. Project page: https://changwoonchoi.github.io/GeoStyle[92] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models
Shaotian Li,Shangze Li,Chuancheng Shi,Wenhua Wu,Yanqiu Wu,Xiaohan Yu,Fei Shen,Tat-Seng Chua
Main category: cs.CV
TL;DR: 本文提出LAKE框架,无需训练即可从预训练视觉语言模型中挖掘潜在的异常知识,通过识别稀疏的异常敏感神经元,构建紧凑的正常性表征,在工业异常检测基准上达到SOTA性能,并提供神经元级可解释性。
Details
Motivation: 现有方法将大视觉语言模型视为黑箱特征提取器,需借助外部适配器或记忆库获取异常知识;本文质疑该假设,认为异常知识已内嵌于预训练模型中但处于潜伏未激活状态。 Method: 提出无训练框架LAKE,利用少量正常样本识别并激发模型中稀疏的异常敏感神经元,融合视觉结构偏差与跨模态语义激活,构建紧凑正常性表征。 Result: 在多个工业异常检测基准上达到当前最优性能,同时提供内在的、神经元级别的可解释性。 Conclusion: 异常检测应被重新定义为对预训练模型中潜在知识的定向激活,而非下游任务知识的额外习得。 Abstract: Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs as black-box feature extractors, assuming that anomaly-specific knowledge must be acquired through external adapters or memory banks. In this paper, we challenge this assumption by arguing that anomaly knowledge is intrinsically embedded within pre-trained models but remains latent and under-activated. We hypothesize that this knowledge is concentrated within a sparse subset of anomaly-sensitive neurons. To validate this, we propose latent anomaly knowledge excavation (LAKE), a training-free framework that identifies and elicits these critical neuronal signals using only a minimal set of normal samples. By isolating these sensitive neurons, LAKE constructs a highly compact normality representation that integrates visual structural deviations with cross-modal semantic activations. Extensive experiments on industrial AD benchmarks demonstrate that LAKE achieves state-of-the-art performance while providing intrinsic, neuron-level interpretability. Ultimately, our work advocates for a paradigm shift: redefining anomaly detection as the targeted activation of latent pre-trained knowledge rather than the acquisition of a downstream task.[93] HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
Qihui Zhu,Tao Zhang,Yuchen Wang,Zijian Wen,Mengjie Zhang,Shuangwu Chen,Xiaobin Tan,Jian Yang,Yang Liu,Zhenhua Dong,Xianzhi Yu,Yinfei Pan
Main category: cs.CV
TL;DR: 本文提出HAWK方法,一种无需训练的头重要性感知视觉令牌剪枝技术,用于多模态大语言模型(MLLMs),显著降低推理开销并保持高精度。
Details
Motivation: 现有视觉令牌剪枝方法假设所有注意力头对视觉理解贡献相同,但实际中不同头捕获不同视觉语义、作用各异;同时MLLMs因视觉令牌激增导致推理耗时和计算开销大,难以满足实时或资源受限场景需求。 Method: 提出HAWK方法,利用头重要性权重与文本引导注意力联合评估视觉令牌重要性,实现任务相关令牌的保留与冗余令牌的剪枝;全程无需额外训练,可即插即用地适配多种MLLMs。 Result: 在多个主流视觉-语言基准上达到SOTA精度;应用于Qwen2.5-VL时,在剪枝80.2%视觉令牌后仍保持96.0%原始精度,端到端延迟降至原74.4%,GPU显存占用显著下降。 Conclusion: HAWK是一种高效、通用、免训练的视觉令牌剪枝方法,兼顾精度与效率,在真实部署场景中具有强实用性与扩展性。 Abstract: In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.[94] AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models
Hazza Mahmood,Yongqiang Yu,Rao Anwer
Main category: cs.CV
TL;DR: 本文提出了AgriChain数据集(约1.1万张专家标注的植物病害叶片图像,含疾病标签、置信度评分和专家验证的链式推理理由),并基于其微调Qwen2.5-VL-3B得到AgriChain-VL3B模型,在病害诊断任务中实现73.1% top-1准确率,优于多个强基线模型,显著提升准确性与可解释性。
Details
Motivation: 现有视觉语言模型在真实农业场景中难以兼顾植物病害诊断的准确性与可解释性。 Method: 构建专家校验的AgriChain数据集(含疾病标签、置信度评分和GPT-4o生成+工程师验证的链式推理),并在其上微调Qwen2.5-VL-3B,实现疾病预测与可视化推理联合生成。 Result: 在1000张图像测试集上,AgriChain-VL3B达到73.1% top-1准确率(macro F1=0.466;weighted F1=0.655),优于Gemini 1.5 Flash、Gemini 2.5 Pro和GPT-4o Mini;生成解释高度契合专家推理,能稳定引用关键视觉线索。 Conclusion: 专家验证的推理监督能显著提升VLM在农业病害诊断中的准确性与可解释性,弥合通用多模态模型与人类专家之间的鸿沟,推动可信、可全球部署的农业AI发展。 Abstract: Accurate and interpretable plant disease diagnosis remains a major challenge for vision-language models (VLMs) in real-world agriculture. We introduce AgriChain, a dataset of approximately 11,000 expert-curated leaf images spanning diverse crops and pathologies, each paired with (i) a disease label, (ii) a calibrated confidence score (High/Medium/Low), and (iii) an expert-verified chain-of-thought (CoT) rationale. Draft explanations were first generated by GPT-4o and then verified by a professional agricultural engineer using standardized descriptors (e.g., lesion color, margin, and distribution). We fine-tune Qwen2.5-VL-3B on AgriChain, resulting in a specialized model termed AgriChain-VL3B, to jointly predict diseases and generate visually grounded reasoning. On a 1,000-image test set, our CoT-supervised model achieves 73.1% top-1 accuracy (macro F1 = 0.466; weighted F1 = 0.655), outperforming strong baselines including Gemini 1.5 Flash, Gemini 2.5 Pro, and GPT-4o Mini. The generated explanations align closely with expert reasoning, consistently referencing key visual cues. These findings demonstrate that expert-verified reasoning supervision significantly enhances both accuracy and interpretability, bridging the gap between generic multimodal models and human expertise, and advancing trustworthy, globally deployable AI for sustainable agriculture. The dataset and code are publicly available at: https://github.com/hazzanabeel12-netizen/agrichain[95] LPM 1.0: Video-based Character Performance Model
Ailing Zeng,Casper Yang,Chauncey Ge,Eddie Zhang,Garvey Xu,Gavin Lin,Gilbert Gu,Jeremy Pi,Leo Li,Mingyi Shi,Sheng Bi,Steven Tang,Thorn Hang,Tobey Guo,Vincent Li,Xin Tong,Yikang Li,Yuchen Sun,Yue,Zhao,Yuhan Lu,Yuwei Li,Zane Zhang,Zeshi Yang,Zi Ye
Main category: cs.CV
TL;DR: 本文提出LPM 1.0(大型性能模型),一种面向单人全双工音视频对话场景的生成模型,通过构建高质量多模态人类行为数据集、训练17B参数扩散Transformer(Base LPM)并蒸馏为流式因果生成器(Online LPM),在高表现力、实时推理与长时身份稳定性三者间取得平衡,解决了‘性能三难困境’;并在新提出的LPM-Bench基准上达到SOTA。
Details
Motivation: 现有视频生成模型难以同时满足高表现力、实时推理和长时身份稳定性,即‘性能三难困境’;而对话是最具挑战性且最全面的角色性能建模场景,亟需端到端、可控、稳定、实时的音视频角色生成方法。 Method: 1)构建严格筛选、听-说配对、身份感知多参考提取的人类中心多模态数据集;2)训练17B参数的多模态条件扩散Transformer(Base LPM)以实现高可控性与身份一致性;3)将其蒸馏为低延迟、无限长度的因果流式生成器(Online LPM);4)提出首个交互式角色性能评测基准LPM-Bench。 Result: LPM 1.0可在给定角色图像及身份参考下,实时生成用户语音驱动的倾听视频与文本提示控制的说话视频,支持无限长度、身份稳定的交互;在LPM-Bench所有维度均达SOTA,且保持实时推理能力。 Conclusion: LPM 1.0首次系统性地解决了音视频角色性能生成中的三难困境,为对话智能体、直播虚拟人和游戏NPC提供了可落地的视觉引擎,并建立了该领域首个标准化评测基准。 Abstract: Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.[96] FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding
Jinghan Yang,Yihe Fan,Xudong Pan,Min Yang
Main category: cs.CV
TL;DR: 本文提出FlowGuard,一种在扩散模型生成过程中实时检测NSFW内容的跨模型框架,通过线性近似潜空间解码和课程学习,在中间去噪步骤中实现早期、高效、准确的不安全内容识别。
Details
Motivation: 现有NSFW检测方法(前置基于文本提示、后置基于最终图像)存在提示-图像安全不一致或无法处理中间噪声图像的问题,难以兼顾准确性与效率。 Method: 提出FlowGuard框架:1)设计潜空间解码的线性近似以缓解早期噪声干扰;2)采用课程学习策略稳定训练;3)在扩散过程的中间去噪步进行跨模型NSFW检测。 Result: 在涵盖9种扩散主干模型的跨模型基准上,FlowGuard在分布内/外场景下F1分数较现有方法提升超30%,峰值GPU显存降低97%以上,投影时间从8.1秒降至0.2秒。 Conclusion: FlowGuard首次实现了高精度、低开销的扩散模型生成中NSFW内容实时检测,显著提升了生成式AI的安全性与计算效率。 Abstract: Diffusion-based image generation models have advanced rapidly but pose a safety risk due to their potential to generate Not-Safe-For-Work (NSFW) content. Existing NSFW detection methods mainly operate either before or after image generation. Pre-generation methods rely on text prompts and struggle with the gap between prompt safety and image safety. Post-generation methods apply classifiers to final outputs, but they are poorly suited to intermediate noisy images. To address this, we introduce FlowGuard, a cross-model in-generation detection framework that inspects intermediate denoising steps. This is particularly challenging in latent diffusion, where early-stage noise obscures visual signals. FlowGuard employs a novel linear approximation for latent decoding and leverages a curriculum learning approach to stabilize training. By detecting unsafe content early, FlowGuard reduces unnecessary diffusion steps to cut computational costs. Our cross-model benchmark spanning nine diffusion-based backbones shows the effectiveness of FlowGuard for in-generation NSFW detection in both in-distribution and out-of-distribution settings, outperforming existing methods by over 30% in F1 score while delivering transformative efficiency gains, including slashing peak GPU memory demand by over 97% and projection time from 8.1 seconds to 0.2 seconds compared to standard VAE decoding.[97] ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
Boyuan Wang,Xiaofeng Wang,Yongkang Li,Zheng Zhu,Yifan Chang,Angen Ye,Guosheng Zhao,Chaojun Ni,Guan Huang,Yijie Ren,Yueqi Duan,Xingang Wang
Main category: cs.CV
TL;DR: ReconPhys是一种无需人工标注、端到端的前馈框架,利用双分支自监督学习,从单目视频中联合估计物理属性并重建3D高斯点阵,显著提升重建精度与预测性能,并实现秒级推理。
Details
Motivation: 现有非刚性物体重建方法依赖可微渲染进行逐场景优化,需昂贵调参或人工标注,实用性与泛化性受限。 Method: 提出ReconPhys框架,采用双分支架构,通过自监督策略联合学习物理属性估计与3D高斯点阵重建,输入为单目视频,输出几何、外观及物理属性。 Result: 在合成数据集上,未来帧预测PSNR达21.64(SOTA优化基线仅13.27),Chamfer距离从0.349降至0.004,推理时间小于1秒(对比数小时)。 Conclusion: ReconPhys首次实现了物理感知的快速、通用、前馈式非刚性重建,支持机器人与图形学中仿真就绪资产的高效生成。 Abstract: Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for per-scene optimization, recovering geometry and dynamics but requiring expensive tuning or manual annotation, which limits practicality and generalizability. To address this, we propose ReconPhys, the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. Our method employs a dual-branch architecture trained via a self-supervised strategy, eliminating the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes. Experiments on a large-scale synthetic dataset demonstrate superior performance: our method achieves 21.64 PSNR in future prediction compared to 13.27 by state-of-the-art optimization baselines, while reducing Chamfer Distance from 0.349 to 0.004. Crucially, ReconPhys enables fast inference (<1 second) versus hours required by existing methods, facilitating rapid generation of simulation-ready assets for robotics and graphics.[98] Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition
Xuemei Jia,Jiawei Du,Hui Wei,Jun Chen,Joey Tianyi Zhou,Zheng Wang
Main category: cs.CV
TL;DR: 本文提出了一种强化引导的合成数据生成框架,用于在隐私敏感、数据稀缺场景下提升生成模型性能,通过冷启动适配、多目标奖励优化和动态样本选择,显著提高生成保真度与下游任务准确率。
Details
Motivation: 高保真生成模型在隐私敏感场景中需求迫切,但受限于法规与版权,真实数据获取困难,导致模型开发受阻,形成‘数据少→模型差→更缺数据’的恶性循环。 Method: 提出强化引导的合成数据生成框架:1)冷启动适配预训练生成器至目标域;2)设计联合优化语义一致性、覆盖多样性与表达丰富性的多目标奖励;3)在下游训练中引入动态样本选择机制,优先使用高实用性合成样本。 Result: 在多个基准数据集上实验表明,该框架显著提升了生成保真度与分类准确率,并在小样本场景下对新类别表现出强泛化能力。 Conclusion: 所提框架有效打破了数据稀缺与生成质量低之间的负向循环,为隐私敏感领域的生成建模提供了可扩展、任务驱动的新范式。 Abstract: High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.[99] Sampling-Aware 3D Spatial Analysis in Multiplexed Imaging
Ido Harlev,Tamar Oukhanov,Raz Ben-Uri,Leeat Keren,Shai Bagon
Main category: cs.CV
TL;DR: 本文研究了采样几何形状对空间统计稳定性的影响,并提出了一种几何感知的重建模块,支持从稀疏序列切片中进行稳定、一致的三维空间分析。
Details
Motivation: 高通量多重显微镜虽能实现单细胞分辨率的空间组织表征,但多数分析仍局限于二维切片,而获取密集三维空间蛋白质组数据成本高、技术难,导致研究者常需在2D与稀疏3D之间权衡。 Method: 通过受控仿真分析不同采样几何对空间统计(如细胞聚类、细胞互作)稳定性的影响;提出一种基于表型与邻近约束跨层匹配细胞投影、并利用细胞类型特异性形态先验恢复单细胞三维坐标的重建模块;进一步量化切片间距、覆盖度与冗余度间的权衡关系。 Result: 发现平面采样能稳定估计全局细胞丰度,但局部统计(如细胞互作、邻域关系)方差大,尤其对稀有或局域化细胞群;所提重建方法在公开IMC数据集上验证有效,并在自建CODEX数据中成功支持可靠的结构级三维分析。 Conclusion: 本文提供了诊断工具与实践指南,帮助研究者判断何时二维采样已足够,何时需采用稀疏三维重建,从而在有限成像预算下最大化空间分析可靠性与生物学洞察力。 Abstract: Highly multiplexed microscopy enables rich spatial characterization of tissues at single-cell resolution, yet most analyses rely on two-dimensional sections despite inherently three-dimensional tissue organization. Acquiring dense volumetric data in spatial proteomics remains costly and technically challenging, leaving practitioners to choose between 2D sections or 3D serial sections under limited imaging budgets. In this work, we study how sampling geometry impacts the stability of commonly used spatial statistics, and we introduce a geometry-aware reconstruction module that enables sparse yet consistent 3D analysis from serial sections. Using controlled simulations, we show that planar sampling reliably recovers global cell-type abundance but exhibits high variance for local statistics such as cell clustering and cell-cell interactions, particularly for rare or spatially localized populations. We observe consistent behavior in real multiplexed datasets, where interaction metrics and neighborhood relationships fluctuate substantially across individual sections. To support sparse 3D analysis in practice, we present a reconstruction approach that links cell projections across adjacent sections using phenotype and proximity constraints and recovers single-cell 3D centroids using cell-type-specific shape priors. We further analyze the trade-off between section spacing, coverage, and redundancy, identifying acquisition regimes that maximize reconstruction utility under fixed imaging budgets. We validate the reconstruction module on a public imaging mass cytometry dataset with dense axial sampling and demonstrate its downstream utility on an in-house CODEX dataset by enabling structure-level 3D analyses that are unreliable in 2D. Together, our results provide diagnostic tools and practical guidance for deciding when 2D sampling suffices and when sparse 3D reconstruction is warranted.[100] AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
Jiaming Su,Tengchao Yang,Ruikang Zhang,Zhengan Yan,Haoyu Sun,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出AnomalyAgent,一种具备自我反思、知识检索和迭代优化能力的异常合成智能体,通过多工具协同与两阶段训练框架,在MVTec-AD数据集上显著提升工业异常生成的真实性与多样性,并在多项指标上超越零样本SOTA方法。
Details
Motivation: 解决现有工业异常合成方法依赖单步生成、缺乏复杂推理与迭代优化能力,导致生成异常语义真实性不足的问题。 Method: 设计AnomalyAgent智能体,集成Prompt生成、图像生成、质量评估、知识检索和掩码生成五种工具,构建闭环优化流程;采用基于真实异常图像构建的结构化轨迹进行两阶段训练(监督微调+强化学习),并引入任务奖励、反思奖励和行为奖励三重奖励机制。 Result: 在MVTec-AD上,AnomalyAgent达到IS/IC-L为2.10/0.33,ResNet34分类准确率达57.0%,UNet在图像/像素级AP达99.3%/74.2%,全面超越零样本SOTA方法。 Conclusion: AnomalyAgent通过引入智能体范式与闭环优化机制,有效提升了工业异常合成的质量与多样性,为数据稀缺下的异常检测提供了新思路。 Abstract: Industrial anomaly generation is a crucial method for alleviating the data scarcity problem in anomaly detection tasks. Most existing anomaly synthesis methods rely on single-step generation mechanisms, lacking complex reasoning and iterative optimization capabilities, making it difficult to generate anomaly samples with high semantic realism. We propose AnomalyAgent, an anomaly synthesis agent with self-reflection, knowledge retrieval, and iterative refinement capabilities, aiming to generate realistic and diverse anomalies. Specifically, AnomalyAgent is equipped with five tools: Prompt Generation (PG), Image Generation (IG), Quality Evaluation (QE), Knowledge Retrieval (KR), and Mask Generation (MG), enabling closed-loop optimization. To improve decision-making and self-reflection, we construct structured trajectories from real anomaly images and design a two-stage training framework: supervised fine-tuning followed by reinforcement learning. This process is driven by a three-part reward mechanism: (1) task rewards to supervise the quality and location rationality of generated anomalies; (2) reflection rewards to train the model's ability to improve anomaly synthesis prompt; (3) behavioral rewards to ensure adherence to the trajectory. On the MVTec-AD dataset, AnomalyAgent achieves IS/IC-L of 2.10/0.33 for anomaly generation, 57.0% classification accuracy using ResNet34, and 99.3%/74.2% AP at the image/pixel level using a simple UNet, surpassing all zero-shot SOTA methods. The code and data will be made publicly available.[101] PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation
Dingwen Xiao,Weiming Zhang,Shiqi Wen,Lin Wang
Main category: cs.CV
TL;DR: 本文提出PanoSAM2,一种基于SAM2的轻量级360视频目标分割框架,通过全景感知解码器、失真引导掩码损失和长短时记忆模块,有效解决球面投影失真、语义不一致及记忆稀疏问题,在360VOTS和PanoVOS上分别提升5.6和6.7点。
Details
Motivation: 现有360视频目标分割(360VOS)面临高质量标注数据稀缺、球面投影失真、左右边界语义不一致以及SAM2内存中目标掩码信息稀疏等挑战,直接迁移SAM2效果差。 Method: 提出PanoSAM2框架:1)Pano-Aware解码器,采用缝合一致感受野与迭代失真优化,保障0/360度边界连续性;2)失真引导掩码损失,按失真程度加权像素损失;3)长短时记忆模块,用紧凑长时指针重实例化并对其短时记忆,增强时序一致性。 Result: 在360VOTS和PanoVOS数据集上,相比SAM2分别提升5.6和6.7点,验证了方法有效性。 Conclusion: PanoSAM2通过轻量、失真感知与记忆增强的适配策略,成功将SAM2迁移至360VOS任务,在保持用户友好提示交互的同时显著提升性能。 Abstract: 360 video object segmentation (360VOS) aims to predict temporally-consistent masks in 360 videos, offering full-scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high-quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 -- with its design of memory module -- shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left-right sides, and sparse object mask information in SAM2's memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion- and memory-aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2's user-friendly prompting design. Concretely, to tackle the projection distortion and semantic inconsistency issues, we propose a Pano-Aware Decoder with seam-consistent receptive fields and iterative distortion refinement to maintain continuity across the 0/360 degree boundary. Meanwhile, a Distortion-Guided Mask Loss is introduced to weight pixels by distortion magnitude, stressing stretched regions and boundaries. To address the object sparsity issue, we propose a Long-Short Memory Module to maintain a compact long-term object pointer to re-instantiate and align short-term memories, thereby enhancing temporal coherence. Extensive experiments show that PanoSAM2 yields substantial gains over SAM2: +5.6 on 360VOTS and +6.7 on PanoVOS, showing the effectiveness of our method.[102] ParkSense: Where Should a Delivery Driver Park? Leveraging Idle AV Compute and Vision-Language Models
Die Hu,Henan Li
Main category: cs.CV
TL;DR: 本文提出ParkSense框架,利用自动驾驶车辆在低风险状态下的闲置计算资源运行视觉语言模型(VLM),结合卫星与街景图像识别商户入口和合法停车区,以解决外卖配送中精准停车问题(DAPP),提升司机收入。
Details
Motivation: 外卖配送中寻找停车位耗时过长,但现有系统未针对商户入口进行精确停车点选择。 Method: 提出ParkSense框架,在AV低风险状态(如等红灯、拥堵、停车场缓行)下复用闲置算力,运行量化7B Vision-Language Model(VLM)于预缓存的卫星与街景图像,识别商户入口与合法停车区;形式化定义Delivery-Aware Precision Parking(DAPP)问题。 Result: 量化7B VLM在HW4级硬件上可在4–8秒内完成推理;估算美国外卖司机年均增收3000–8000美元;识别出自动驾驶、计算机视觉与末端物流交叉领域的5个开放研究方向。 Conclusion: ParkSense有效利用闲置算力解决实际物流痛点,验证了DAPP问题的可行性与经济价值,并开辟了多学科交叉新研究方向。 Abstract: Finding parking consumes a disproportionate share of food delivery time, yet no system addresses precise parking-spot selection relative to merchant entrances. We propose ParkSense, a framework that repurposes idle compute during low-risk AV states -- queuing at red lights, traffic congestion, parking-lot crawl -- to run a Vision-Language Model (VLM) on pre-cached satellite and street view imagery, identifying entrances and legal parking zones. We formalize the Delivery-Aware Precision Parking (DAPP) problem, show that a quantized 7B VLM completes inference in 4-8 seconds on HW4-class hardware, and estimate annual per-driver income gains of 3,000-8,000 USD in the U.S. Five open research directions are identified at this unexplored intersection of autonomous driving, computer vision, and last-mile logistics.[103] Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction
Yuanhong Zhang,Zhaoyang Wang,Xin Zhang,Weizhan Zhang,Joey Tianyi Zhou
Main category: cs.CV
TL;DR: 本文提出MESA框架,通过受控且选择性的潜在干预来缓解大型视觉语言模型(LVLMs)中的幻觉问题,同时保持模型原有的生成行为。
Details
Motivation: 现有缓解LVLM幻觉的方法常改变生成行为,导致输出变短、词元分布偏移,根源在于干扰信号纠缠。 Method: 提出MESA——一种即插即用的框架,实现对幻觉相关响应的定向潜在空间干预,同时保留原始词元分布。 Result: 在多种生成与判别基准上实验表明,MESA在降低幻觉的同时更优地保持生成行为,性能超越多个LVLM家族上的先前方法。 Conclusion: MESA是一种有效、通用且不侵入式的幻觉缓解方案,兼顾准确性与生成自然性。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across cross-modal tasks but remain hindered by hallucinations, producing textual outputs inconsistent with visual content. Existing methods mitigate hallucinations but often alter generation behavior, resulting in shorter outputs and shifted token distributions, especially in latent space steering approaches. We identify that this issue stems from entangled steering signals, where suppressing hallucinations inadvertently disrupts the model's intrinsic generation behavior. To address this, we propose MESA, an effective plug-and-play framework that performs controlled and selective latent intervention for hallucination mitigation. Specifically, MESA targets hallucination-relevant responses while preserving the model's original token distribution, enabling effective hallucination reduction without compromising generation behavior. Extensive experiments across diverse generative and discriminative benchmarks demonstrate that MESA consistently reduces hallucinations while better preserving generation behavior, outperforming prior methods across multiple LVLM families.[104] Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Weiming Zhang,Dingwen Xiao,Songyue Guo,Guangyu Xiang,Shiqi Wen,Minwei Zhao,Lei Chen,Lin Wang
Main category: cs.CV
TL;DR: 本文提出Tarot-SAM3,一种无需训练的框架,通过表达推理解释器(ERI)和掩码自优化(MSR)两阶段,提升SAM3在任意指代表达分割(RES)任务中的鲁棒性与泛化能力,尤其适用于显式与隐式表达及开放世界场景。
Details
Motivation: 现有指代表达分割(RES)方法依赖大量标注数据,且难以兼顾显式与隐式表达;SAM3虽在可提示分割中表现优异,但对长或隐式语言表达效果差,且直接耦合MLLM导致结果过度依赖其推理能力、缺乏对分割结果的优化。 Method: 提出Tarot-SAM3无训练框架:第一阶段为表达推理解释器(ERI),通过推理辅助的多样化提示选项实现结构化解析与评估感知重述,生成鲁棒异构提示以驱动SAM3;第二阶段为掩码自优化(MSR),基于DINOv3特征关系选择最优掩码并进行自修正,校正过分割与欠分割。 Result: 在显式、隐式RES基准及开放世界场景中均取得强性能;消融实验验证了ERI与MSR两阶段的有效性。 Conclusion: Tarot-SAM3有效克服了SAM3在RES任务中对复杂语言表达适应性差、缺乏后处理优化的问题,实现了无需训练、泛化性强、鲁棒性高的指代表达分割新范式。 Abstract: Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.[105] Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation
Hina Kogure,Kei Katsumata,Taiki Miyanishi,Komei Sugiura
Main category: cs.CV
TL;DR: 本文提出Stitch4D框架,用于解决稀疏多视角城市环境下的4D重建问题,通过合成中间桥接视图和联合优化真实与合成观测,在统一坐标系下保证跨位置一致性,显著提升几何完整性与动态连续性。
Details
Motivation: 现有4D重建方法依赖密集重叠视图,在实际城市中多为稀疏、非重叠的多位置相机配置,导致中间区域重建失败和时序伪影,该问题尚未被充分研究。 Method: Stitch4D包含两部分:(i) 合成中间桥接视图以增强空间约束和覆盖;(ii) 在统一坐标系下联合优化真实与合成观测,并施加显式的跨位置一致性约束。 Result: 在自建CARLA基准U-S4D上实验表明,Stitch4D显著优于主流4D重建基线,重建几何更完整、动态更平滑,视觉质量更高。 Conclusion: 恢复中间空间覆盖对稀疏城市环境下的稳定4D重建至关重要,Stitch4D为此提供了有效且统一的解决方案。 Abstract: Dynamic urban environments are often captured by cameras placed at spatially separated locations with little or no view overlap. However, most existing 4D reconstruction methods assume densely overlapping views. When applied to such sparse observations, these methods fail to reconstruct intermediate regions and often introduce temporal artifacts. To address this practical yet underexplored sparse multi-location setting, we propose Stitch4D, a unified 4D reconstruction framework that explicitly compensates for missing spatial coverage in sparse observations. Stitch4D (i) synthesizes intermediate bridge views to densify spatial constraints and improve spatial coverage, and (ii) jointly optimizes real and synthesized observations within a unified coordinate frame under explicit inter-location consistency constraints. By restoring intermediate coverage before optimization, Stitch4D prevents geometric collapse and reconstructs coherent geometry and smooth scene dynamics even in sparsely observed environments. To evaluate this setting, we introduce Urban Sparse 4D (U-S4D), a CARLA-based benchmark designed to assess spatiotemporal alignment under sparse multi-location configurations. Experimental results on U-S4D show that Stitch4D surpasses representative 4D reconstruction baselines and achieves superior visual quality. These results indicate that recovering intermediate spatial coverage is essential for stable 4D reconstruction in sparse urban environments.[106] Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting
Tao Hana,Zhibin Wen,Zhenghao Chen,Fenghua Lin,Junyu Gao,Song Guo,Lei Bai
Main category: cs.CV
TL;DR: 本文提出了一种基于3D高斯点绘与尺度感知注意力机制的新型气象预测框架GSSA-ViT,支持任意分辨率预报与灵活降尺度,显著提升多尺度大气场建模效率与泛化能力。
Details
Motivation: AI驱动的数值天气预报(NWP)虽快,但生成高分辨率输出仍受限于多尺度适应性差和数据表征低效。 Method: 将经纬度网格点建模为3D高斯中心;引入生成式3D高斯参数预测(协方差、属性、不透明度);设计尺度感知注意力模块以建模跨尺度依赖关系,支持连续分辨率适配。 Result: 在ERA5上实现87个大气变量的任意分辨率预报;在ERA5和CMIP6上降尺度性能优于现有方法;首次将生成式3D高斯建模与尺度感知注意力结合用于统一多尺度NWP。 Conclusion: GSSA-ViT为高分辨率、多尺度大气预测与降尺度提供了高效可扩展的新范式。 Abstract: While AI-based numerical weather prediction (NWP) enables rapid forecasting, generating high-resolution outputs remains computationally demanding due to limited multi-scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting-based scale-aware vision transformer (GSSA-ViT), a novel framework for arbitrary-resolution forecasting and flexible downscaling of high-dimensional atmospheric fields. Specifically, latitude-longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale-aware attention module is designed to capture cross-scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale-aware attention for unified multi-scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high-resolution, multi-scale atmospheric prediction and downscaling. Code is available at: https://github.com/binbin2xs/weather-GS.[107] Shortcut Learning in Glomerular AI: Adversarial Penalties Hurt, Entropy Helps
Mohammad Daouk,Jan Ulrich Becker,Neeraja Kambham,Anthony Chang,Hien Nguyen,Chandra Mohan
Main category: cs.CV
TL;DR: 本文研究了肾病理AI中染色变异导致的分布偏移和捷径学习问题,提出了一种无需染色或站点标签的无监督熵正则化方法,在多中心、多染色数据集上验证了其对狼疮性肾炎肾小球病变分类的鲁棒性和有效性。
Details
Motivation: 染色变异性是肾病理AI中普遍存在的分布偏移源和潜在捷径学习原因,需探究模型是否利用染色作为捷径,并在无染色/中心标签条件下缓解该偏差。 Method: 构建包含三个中心、四种染色(PAS、H&E、Jones、Trichrome)的9674张肾小球图像块多中心多染色数据集;采用贝叶斯CNN/ViT主干网络与蒙特卡洛Dropout,对比三种设置:(1)仅染色分类;(2)带监督染色损失的双头联合预测;(3)染色头使用无标签熵最大化正则化的双头模型。 Result: (1)染色身份极易学习,证实其为强捷径;(2)染色监督强度与符号显著影响染色性能,但对病变分类指标几乎无影响,表明当前数据集本身具有一定抗捷径能力;(3)熵正则化使染色预测接近随机水平,且不损害病变分类准确率与校准性。 Conclusion: 精心构建的多染色数据集本身可具备内在染色鲁棒性;结合无标签熵正则化的贝叶斯双头架构是一种简单、可部署的防护策略,可应对肾小球AI中潜在的染色相关漂移。 Abstract: Stain variability is a pervasive source of distribution shift and potential shortcut learning in renal pathology AI. We ask whether lupus nephritis glomerular lesion classifiers exploit stain as a shortcut, and how to mitigate such bias without stain or site labels. We curate a multi-center, multi-stain dataset of 9{,}674 glomerular patches (224$\times$224) from 365 WSIs across three centers and four stains (PAS, H\&E, Jones, Trichrome), labeled as proliferative vs.\ non-proliferative. We evaluate Bayesian CNN and ViT backbones with Monte Carlo dropout in three settings: (1) stain-only classification; (2) a dual-head model jointly predicting lesion and stain with supervised stain loss; and (3) a dual-head model with label-free stain regularization via entropy maximization on the stain head. In (1), stain identity is trivially learnable, confirming a strong candidate shortcut. In (2), varying the strength and sign of stain supervision strongly modulates stain performance but leaves lesion metrics essentially unchanged, indicating no measurable stain-driven shortcut learning on this multi-stain, multi-center dataset, while overly adversarial stain penalties inflate predictive uncertainty. In (3), entropy-based regularization holds stain predictions near chance without degrading lesion accuracy or calibration. Overall, a carefully curated multi-stain dataset can be inherently robust to stain shortcuts, and a Bayesian dual-head architecture with label-free entropy regularization offers a simple, deployment-friendly safeguard against potential stain-related drift in glomerular AI.[108] ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
Jiayang Xu,Fan Zhuo,Majun Zhang,Changhao Pan,Zehan Wang,Siyu Chen,Xiaoda Yang,Tao Jin,Zhou Zhao
Main category: cs.CV
TL;DR: ImVideoEdit是一种仅用图像对训练视频编辑能力的高效框架,通过冻结3D注意力模块、引入Predict-Update Spatial Difference Attention和Text-Guided Dynamic Semantic Gating机制,在极低计算开销下实现媲美大规模视频训练模型的编辑质量与时间一致性。
Details
Motivation: 现有视频编辑模型依赖昂贵且稀缺的配对视频数据,限制了其可扩展性;而多数视频编辑任务本质上可解耦为保持时序动态、仅修改空间内容的过程。 Method: 提出ImVideoEdit框架:1)将图像视为单帧视频,冻结预训练3D注意力模块以保留时序动态;2)设计Predict-Update Spatial Difference Attention模块渐进提取并注入空间差异;3)采用Text-Guided Dynamic Semantic Gating机制实现无需显式掩码的文本驱动自适应编辑。 Result: 仅用13K图像对训练5轮、计算开销极低,仍达到与基于大量视频数据训练的大模型相当的编辑保真度和时间一致性。 Conclusion: 证明仅用图像对即可高效学习高质量视频编辑能力,为降低视频编辑模型训练成本与提升实用性提供了新范式。 Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.[109] TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning
Yifei Gong,Xing Wu,Wenda Liu,Kang Tu
Main category: cs.CV
TL;DR: 本文提出ToolCAD框架,利用大语言模型作为工具使用代理实现文本到CAD建模,通过交互式CAD建模环境和在线课程强化学习提升模型能力。
Details
Motivation: 目前尚无研究探讨工具使用型大语言模型如何最优地与CAD引擎交互,阻碍了基于LLM的文本到CAD建模系统的发展。 Method: 提出ToolCAD框架,构建交互式CAD建模训练环境,并采用端到端后训练策略结合在线课程强化学习,使LLM能生成精细化的CAD建模思维链(CAD-CoT)并成长为熟练的CAD工具使用者。 Result: ToolCAD成功填补了开源大语言模型在CAD工具使用代理领域的应用与训练空白,使其性能可媲美专有模型。 Conclusion: ToolCAD为构建更易获取、更鲁棒的自主文本到CAD建模系统铺平了道路。 Abstract: Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.[110] DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
Gyanendra Das,Sai Satyam Jena
Main category: cs.CV
TL;DR: 本文提出动态子空间概念对齐(DSCA)方法,通过将视觉语言模型的表征空间分解为正交语义子空间,在结构上隔离概念,实现精准、无干扰的持续知识编辑,显著提升编辑稳定性与知识保留能力。
Details
Motivation: 现有视觉语言模型(VLM)知识编辑方法在共享表征空间中操作,概念纠缠导致编辑相互干扰,难以支持稳定、长期的连续编辑任务。 Method: 提出Dynamic Subspace Concept Alignment(DSCA),利用增量聚类与PCA对联合视觉语言表征构建正交语义子空间,并在这些子空间中执行编辑;结合多目标损失函数保障任务保真度、编辑局部性与跨模态对齐;基础模型参数冻结。 Result: 单次编辑成功率98%,1000次连续编辑后仍保持>95%成功率,幻觉降低3–5%,在持续指令调优基准上取得最佳向后迁移(BWT)分数,整体稳定性与知识保留能力达SOTA。 Conclusion: 结构化地分离语义子空间是缓解VLM持续编辑中概念干扰与灾难性遗忘的关键,DSCA将知识隔离从训练目标升华为架构属性,为终身编辑提供了新范式。 Abstract: Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.[111] Lighting-grounded Video Generation with Renderer-based Agent Reasoning
Ziqi Cai,Taoyu Yang,Zheng Chang,Si Li,Han Jiang,Shuchen Weng,Boxin Shi
Main category: cs.CV
TL;DR: 本文提出LiVER框架,通过显式3D场景属性(布局、光照、相机轨迹)控制扩散模型生成视频,实现高保真、时序一致且可编辑的可控视频生成。
Details
Motivation: 现有视频扩散模型在布局、光照和相机轨迹等关键场景因素上缺乏显式、解耦的控制能力,难以满足电影制作和虚拟制片等对精细场景控制的需求。 Method: 提出基于3D统一表示渲染控制信号以解耦场景属性;构建大规模密集标注3D场景数据集;设计轻量级条件模块与渐进训练策略,将3D控制信号融入基础视频扩散模型;开发能将用户指令自动转化为3D控制信号的场景代理。 Result: 在图像到视频、视频到视频任务中实现SOTA级照片真实感与时序一致性,并支持对布局、光照、相机轨迹等场景因素的精确、解耦控制。 Conclusion: LiVER为可控视频生成设立了新标准,显著提升了扩散模型在专业创作场景中的实用性与可控性。 Abstract: Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.[112] Object-Centric Stereo Ranging for Autonomous Driving: From Dense Disparity to Census-Based Template Matching
Qihao Huang
Main category: cs.CV
TL;DR: 本文提出了一种融合稠密立体匹配、目标中心的Census模板匹配和单目几何先验的综合双目测距系统,重点在于一种GPU加速、面向目标的稀疏立体匹配算法及在线标定优化框架,实现了高速、鲁棒、实时的长距车辆深度估计。
Details
Motivation: 传统稠密立体匹配方法(如BM/SGM)在自动驾驶长距车辆检测中存在计算开销大、对相机辐射差异敏感、远距离精度差等问题,亟需更鲁棒高效的测距方案。 Method: 构建统一的检测-测距-跟踪流水线,集成三种互补深度估计方法:1)稠密BM/SGM;2)基于Census的目标中心模板匹配(含远近分治、前后向验证、遮挡感知采样、多块鲁棒聚合);3)单目几何先验;并设计在线标定精调框架(自动校正偏移搜索、雷达双目投票修正视差、目标级雷达-双目关联)。 Result: 系统在夜间、雨天、光照变化等复杂驾驶场景下实现鲁棒测距,并通过异步GPU流水线达成实时性能。 Conclusion: 所提融合方法与在线标定机制显著提升了长距双目测距的精度、鲁棒性与实时性,为自动驾驶感知提供了实用化解决方案。 Abstract: Accurate depth estimation is critical for autonomous driving perception systems, particularly for long range vehicle detection on highways. Traditional dense stereo matching methods such as Block Matching (BM) and Semi Global Matching (SGM) produce per pixel disparity maps but suffer from high computational cost, sensitivity to radiometric differences between stereo cameras, and poor accuracy at long range where disparity values are small. In this report, we present a comprehensive stereo ranging system that integrates three complementary depth estimation approaches: dense BM/SGM disparity, object centric Census based template matching, and monocular geometric priors, within a unified detection ranging tracking pipeline. Our key contribution is a novel object centric Census based template matching algorithm that performs GPU accelerated sparse stereo matching directly within detected bounding boxes, employing a far close divide and conquer strategy, forward backward verification, occlusion aware sampling, and robust multi block aggregation. We further describe an online calibration refinement framework that combines auto rectification offset search, radar stereo voting based disparity correction, and object level radar stereo association for continuous extrinsic drift compensation. The complete system achieves real time performance through asynchronous GPU pipeline design and delivers robust ranging across diverse driving conditions including nighttime, rain, and varying illumination.[113] DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction
Tingxi Chen,Zhengxue Cheng,Houqiang Zhong,Su Wang,Rong Xie,Li Song
Main category: cs.CV
TL;DR: 本文提出DP-DeGauss框架,用于第一人称视角下的动态4D场景重建,通过概率高斯分解实现背景、手部与物体的显式解耦,显著提升重建质量与可编辑性。
Details
Motivation: 现有方法难以处理第一人称视频中复杂的自运动、遮挡及手物交互,且假设固定视角或简单合并动态前景,无法有效解耦动态成分。 Method: 提出动态概率高斯分解框架DP-DeGauss:基于COLMAP初始化统一3D高斯集,为每个高斯附加可学习类别概率,并通过类别特定分支进行背景、手部和物体的差异化形变建模;引入类别掩码、亮度与光流控制以增强静态渲染与动态重建。 Result: 在PSNR上平均超越基线+1.70dB,并提升SSIM与LPIPS指标;首次实现背景、手、物体三组件的显式、细粒度解耦。 Conclusion: DP-DeGauss为第一人称4D重建提供了更鲁棒、可解释与可编辑的建模范式,推动AR/VR与具身智能中的场景理解发展。 Abstract: Egocentric video is crucial for next-generation 4D scene reconstruction, with applications in AR/VR and embodied AI. However, reconstructing dynamic first-person scenes is challenging due to complex ego-motion, occlusions, and hand-object interactions. Existing decomposition methods are ill-suited, assuming fixed viewpoints or merging dynamics into a single foreground. To address these limitations, we introduce DP-DeGauss, a dynamic probabilistic Gaussian decomposition framework for egocentric 4D reconstruction. Our method initializes a unified 3D Gaussian set from COLMAP priors, augments each with a learnable category probability, and dynamically routes them into specialized deformation branches for background, hands, or object modeling. We employ category-specific masks for better disentanglement and introduce brightness and motion-flow control to improve static rendering and dynamic reconstruction. Extensive experiments show that DP-DeGauss outperforms baselines by +1.70dB in PSNR on average with SSIM and LPIPS gains. More importantly, our framework achieves the first and state-of-the-art disentanglement of background, hand, and object components, enabling explicit, fine-grained separation, paving the way for more intuitive ego scene understanding and editing.[114] SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
Yunnan Wang,Kecheng Zheng,Jianyuan Wang,Minghao Chen,David Novotny,Christian Rupprecht,Yinghao Xu,Xing Zhu,Wenjun Zeng,Xin Jin,Yujun Shen
Main category: cs.CV
TL;DR: 本文介绍了SceneScribe-1M,一个包含百万级野外视频的大规模多模态数据集,每段视频均配有文本描述、相机参数、深度图和3D点轨迹,旨在同时推动3D感知与视频生成研究。
Details
Motivation: 现有数据集仅分别支持3D理解或视频生成,缺乏能同时支撑两大领域的大规模统一资源。 Method: 构建SceneScribe-1M数据集,包含100万野外视频,并为每段视频提供文本描述、相机参数、密集深度图和一致的3D点轨迹标注;设计涵盖感知与生成任务的多维度基准测试。 Result: 在单目深度估计、场景重建、动态点跟踪、文本到视频合成(含/不含相机控制)等任务上建立了新基准,验证了数据集的实用性与泛化能力。 Conclusion: SceneScribe-1M填补了3D感知与视频合成交叉领域的数据空白,开源后有望成为推动具身智能与可控视频生成研究的重要基础设施。 Abstract: The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.[115] MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
Zile Guo,Zhan Chen,Enze Zhu,Kan Wei,Yongkang Zou,Xiaoxuan Liu,Lei Wang
Main category: cs.CV
TL;DR: 本文提出MotionScape——首个面向世界模型训练的大规模、高动态、真实无人机视角视频数据集,包含30+小时4K视频、4.5M帧,配准6-DoF相机轨迹与细粒度语言描述,通过多阶段自动化流水线构建;实验证明其能显著提升世界模型对复杂3D动态和大视角变化的建模能力,增强无人机智能体的导航与决策性能。
Details
Motivation: 现有世界模型在高度动态的无人机视角下难以保持时空物理一致性,主因是训练数据存在分布偏差:主流数据集(如自动驾驶、人眼视角)仅覆盖受限2.5D运动,缺乏真实6自由度无人机运动先验。 Method: 构建MotionScape数据集:采集大规模真实无人机视频;设计自动化多阶段处理流程,融合CLIP相关性过滤、时间分割、鲁棒视觉SLAM恢复6-DoF轨迹、大语言模型驱动语义标注;确保视频帧在语义与几何上与轨迹及语言描述严格对齐。 Result: MotionScape含30+小时4K视频、4.5M帧,提供精准6-DoF轨迹与细粒度语言描述;实验表明,基于该数据集训练/微调的世界模型在复杂3D动态预测与大视角变换建模上显著提升,增强无人机导航与决策能力。 Conclusion: MotionScape填补了面向无人机世界建模的高质量动态数据空白,其语义-几何联合对齐范式为提升具身智能体在开放环境中物理一致性的建模能力提供了新基准与实用基础。 Abstract: Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape[116] SAT: Selective Aggregation Transformer for Image Super-Resolution
Dinh Phu Tran,Thao Do,Saad Wazir,Seongah Kim,Seon Kwon Kim,Daeyoung Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为Selective Aggregation Transformer (SAT) 的新Transformer架构,通过密度驱动的Token聚合算法选择性地聚合Key-Value矩阵,在大幅降低计算量(Token减少97%,FLOPs降低27%)的同时扩大感受野、保持查询分辨率和高频细节,从而在图像超分辨率任务中超越SOTA方法PFT(+0.22dB)。
Details
Motivation: 传统Transformer在图像超分辨率中因自注意力的二次计算复杂度而难以兼顾效率与全局建模;现有窗口注意力方法虽提升效率,但感受野受限。 Method: 提出Selective Aggregation Transformer(SAT),核心是Density-driven Token Aggregation算法:基于密度与孤立性度量,将特征图聚类并用单个聚合Token表征每个簇,仅压缩Key-Value矩阵(减少97% token),保留完整Query分辨率,实现高效长程依赖建模。 Result: 在图像超分辨率任务上,SAT相比SOTA方法PFT提升最高达0.22dB PSNR,同时FLOPs最多降低27%。 Conclusion: SAT通过选择性Token聚合在显著降低计算成本的前提下,有效扩展模型感受野并保留关键高频信息,实现了效率与性能的更好平衡,为高效Transformer设计提供了新思路。 Abstract: Transformer-based approaches have revolutionized image super-resolution by modeling long-range dependencies. However, the quadratic computational complexity of vanilla self-attention mechanisms poses significant challenges, often leading to compromises between efficiency and global context exploitation. Recent window-based attention methods mitigate this by localizing computations, but they often yield restricted receptive fields. To mitigate these limitations, we propose Selective Aggregation Transformer (SAT). This novel transformer efficiently captures long-range dependencies, leading to an enlarged model receptive field by selectively aggregating key-value matrices (reducing the number of tokens by 97\%) via our Density-driven Token Aggregation algorithm while maintaining the full resolution of the query matrix. This design significantly reduces computational costs, resulting in lower complexity and enabling scalable global interactions without compromising reconstruction fidelity. SAT identifies and represents each cluster with a single aggregation token, utilizing density and isolation metrics to ensure that critical high-frequency details are preserved. Experimental results demonstrate that SAT outperforms the state-of-the-art method PFT by up to 0.22dB, while the total number of FLOPs can be reduced by up to 27\%.[117] Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments
Yun Zhu,Jianjun Qian,Jian Yang,Jin Xie,Na Zhao
Main category: cs.CV
TL;DR: 本文提出FI3Det,首个面向少样本增量式3D目标检测的框架,利用视觉语言模型(VLM)学习未见类别知识,在ScanNet V2和SUN RGB-D数据集上显著优于基线方法。
Details
Motivation: 现有增量3D检测方法依赖大量新类别标注,难以适应动态室内环境中对少量样本即能泛化的实际需求。 Method: 提出FI3Det框架:1)VLM引导的未知物体学习模块,挖掘未知物体并提取2D语义特征与类无关3D框;2)基于空间位置与框内特征一致性的加权机制抑制噪声;3)门控多模态原型印刻模块,融合对齐的2D语义与3D几何特征构建类别原型并进行分类打分。 Result: 在ScanNet V2和SUN RGB-D上,FI3Det在批量与序列两种评估设置下均取得显著且稳定提升,为首个少样本增量3D检测方法。 Conclusion: FI3Det有效缓解了增量3D检测对大量标注的依赖,通过VLM与多模态原型建模实现了仅需少量样本即可检测新类别的能力,推动了具身智能在动态环境中的实用化进展。 Abstract: Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.[118] SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
Felix Embacher,Jonas Uhrig,Marius Cordts,Markus Enzweiler
Main category: cs.CV
TL;DR: 本文提出了SearchAD,一个用于自动驾驶(AD)的大规模罕见图像检索数据集,包含42.3万帧、覆盖90个罕见类别的51.3万个高质量人工标注框,支持文本/图像到图像检索、少样本学习和多模态模型微调;实验表明文本方法优于图像方法,但整体检索性能仍有待提升。
Details
Motivation: 从大规模数据集中高效检索罕见且安全关键的驾驶场景,是构建鲁棒自动驾驶系统的关键挑战;现有基准聚焦实例级检索,缺乏面向长尾语义检索的标准化数据集。 Method: 构建SearchAD数据集:整合11个现有数据集,人工标注423k帧中513k+边界框,覆盖90个极罕见类别(部分少于50次),设计明确的数据划分以支持文本-图像/图像-图像检索、少样本学习及多模态模型微调,并开展全面评估。 Result: 文本检索方法因更强的语义对齐能力优于图像方法;直接对齐空间视觉特征与语言的模型在零样本设置下表现最佳;微调基线显著提升性能,但绝对检索能力仍不理想。 Conclusion: SearchAD是首个面向自动驾驶数据筛选与长尾感知研究的大规模检索驱动基准,推动了语义图像检索在AD领域的标准化与技术发展。 Abstract: Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/[119] Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
Xuezhen Tu,Jingyu Wu,Fangyu Kang,Qingpeng Nong,Kaijin Zhang,Chaoyue Niu,Fan Wu
Main category: cs.CV
TL;DR: 本文提出Bridge-STG框架,通过解耦时空定位并引入语义桥接与查询引导机制,解决视频时空定位中时空对齐纠缠和双域视觉令牌冗余两大挑战,在多个基准上达到SOTA性能。
Details
Motivation: 现有多模态大语言模型在时空视频定位任务中面临两个核心挑战:时空对齐纠缠(因将异构子任务耦合于同一自回归输出空间)和双域视觉令牌冗余(目标对象在时空维度均稀疏,导致绝大多数视觉令牌无关)。 Method: 提出Bridge-STG端到端框架,包含两个关键设计:1)时空语义桥接(STSB)机制结合显式时间对齐(ETA),将MLLM的时间推理上下文蒸馏为增强型桥接查询;2)查询引导的空间定位(QGSL)模块,利用该查询驱动专用空间解码器,并引入多层交互查询与正负帧采样策略。 Result: 在VidSTG上平均m_vIoU从26.4提升至34.3,显著优于现有MLLM方法;并在多种细粒度视频理解任务中展现强跨任务迁移能力。 Conclusion: Bridge-STG通过解耦时空定位并构建鲁棒语义接口,有效缓解了时空纠缠与视觉冗余问题,验证了其在视频语言联合理解中的有效性与泛化性。 Abstract: Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM's temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m\_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.[120] Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI
Minh Sao Khue Luu,Evgeniy N. Pavlovskiy,Bair N. Tuchinov
Main category: cs.CV
TL;DR: 本文提出了一种名为CATMIL的统一目标函数,通过引入基于连通组件自适应加权的Tversky损失和基于多实例学习(MIL)的病灶级监督项,增强nnU-Net基础分割损失,在高度类别不平衡的多发性硬化小病灶分割任务中显著提升小病灶召回率并控制假阳性。
Details
Motivation: 针对多发性硬化(MS)MRI图像中小病灶分割严重类别不平衡、标准分割损失难以兼顾病灶检测与分割精度的问题,需在体素级分割基础上引入更高层次(如病灶实例级)监督以提升小病灶检出能力。 Method: 提出CATMIL联合损失函数:1)Component-Adaptive Tversky(CAT)——依据连通组件重加权体素损失,缓解大小病灶贡献失衡;2)Multiple Instance Learning(MIL)项——提供病灶实例级监督,鼓励模型检出每个病灶;二者与标准nnU-Net损失联合优化。在MSLesSeg数据集上采用5折交叉验证、固定nnU-Net架构进行评估。 Result: CATMIL在MSLesSeg上取得最优综合性能:Dice达0.7834,边界误差降低;小病灶召回率显著提升、假阴性减少;同时保持最低假阳性体积。 Conclusion: 将组件级与病灶级监督统一融入目标函数,是一种有效且实用的小病灶分割改进策略,尤其适用于高度不平衡的医学图像分割场景。 Abstract: We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \href{https://github.com/luumsk/SmallLesionMRI}{this url}.[121] Rotation Equivariant Convolutions in Deformable Registration of Brain MRI
Arghavan Rezvani,Kun Han,Anthony T. Wu,Pooya Khosravi,Xiaohui Xie
Main category: cs.CV
TL;DR: 本文提出将旋转等变卷积引入可变形脑部MRI配准网络,通过在三个基线架构中替换编码器验证其优势:提升配准精度、减少参数量、增强对旋转输入的鲁棒性,并提高小样本下的性能。
Details
Motivation: CNN缺乏旋转等变性,无法有效利用解剖结构(尤其是脑MRI)中固有的旋转对称性,限制了配准性能。 Method: 将旋转等变卷积集成到可变形脑MRI配准网络中,用等变编码器替换三个基线架构中的标准编码器,并在多个公开脑MRI数据集上评估。 Result: 等变编码器在配准精度、参数量、对旋转输入的鲁棒性及小样本训练效率三方面均优于基线模型。 Conclusion: 引入几何先验(如旋转等变性)是构建更鲁棒、准确和高效配准模型的关键步骤。 Abstract: Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant convolutions into deformable brain MRI registration networks. We evaluate this approach by replacing standard encoders with equivariant ones in three baseline architectures, testing on multiple public brain MRI datasets. Our experiments demonstrate that equivariant encoders have three key advantages: 1) They achieve higher registration accuracy while reducing network parameters, confirming the benefit of this anatomical inductive bias. 2) They outperform baselines on rotated input pairs, demonstrating robustness to orientation variations common in clinical practice. 3) They show improved performance with less training data, indicating greater sample efficiency. Our results demonstrate that incorporating geometric priors is a critical step toward building more robust, accurate, and efficient registration models.[122] Small Vision-Language Models are Smart Compressors for Long Video Understanding
Junjie Fei,Jun Chen,Zechun Liu,Yunyang Xiong,Chong Zhou,Wei Wen,Junlin Han,Mingchen Zhuge,Saksham Suri,Qi Qian,Shuming Liu,Lemeng Wu,Raghuraman Krishnamoorthi,Vikas Chandra,Mohamed Elhoseiny,Chenchen Zhu
Main category: cs.CV
TL;DR: Tempo是一种针对长视频理解的高效查询感知框架,利用小型视觉语言模型(SVLM)进行早期跨模态蒸馏,并通过无训练的自适应令牌分配(ATA)机制,在严格令牌预算下实现意图对齐的动态压缩,显著提升小时级视频理解性能。
Details
Motivation: 现有方法受限于多模态大模型的上下文长度瓶颈,密集视觉流易耗尽token预算并加剧‘中间丢失’现象;启发式采样策略(如稀疏采样或均匀池化)盲目牺牲关键帧信息、浪费带宽于无关背景,无法兼顾保真度与效率。 Method: 提出Tempo框架:1)使用小型视觉语言模型(SVLM)作为局部时间压缩器,将token缩减建模为前向一次的早期跨模态蒸馏;2)引入无训练、O(1)复杂度的自适应令牌分配(ATA),基于SVLM零样本相关性先验和语义前置特性,动态为查询关键片段分配高密度token,同时将冗余片段压缩为极简时间锚点以保留全局叙事结构。 Result: 在极端长视频基准LVBench(4101秒)上,6B架构在8K视觉token严格预算下达52.3分,超越GPT-4o与Gemini 1.5 Pro;扩展至2048帧时提升至53.7分;支持0.5–16 token/帧的激进动态压缩,且压缩率远低于理论极限。 Conclusion: 真正长视频理解的关键在于意图驱动的高效压缩,而非盲目扩大上下文窗口;Tempo证明了在严苛token预算下,通过查询感知与因果保持的动态压缩可实现小时级视频的鲁棒理解。 Abstract: Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.[123] Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection
Jun Li,Yingying Shi,Zhixuan Ruan,Nan Guo,Jianhua Xu
Main category: cs.CV
TL;DR: 本文提出MDDCNet,结合可变形空洞卷积与Mamba模型,提升交通场景中多尺度目标(尤其是小目标)的检测精度。
Details
Motivation: 现有基于Mamba的方法难以有效捕获小目标的局部细节,且状态空间模型缺乏层次化特征表达和跨尺度交互能力,导致复杂交通场景下检测性能受限。 Method: 提出MDDCNet:1)混合主干网络融合多尺度可变形空洞卷积(MSDDC)与Mamba模块;2)通道增强型前馈网络(CE-FFN)强化通道交互;3)基于Mamba的注意力聚合特征金字塔网络(A²FPN)增强多尺度特征融合。 Result: 在多个公开基准与真实交通数据集上显著优于当前先进检测器。 Conclusion: MDDCNet通过协同建模局部细节与全局语义、增强跨尺度与通道交互,有效提升了复杂交通场景下的多尺度目标检测性能。 Abstract: In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.[124] Can Vision Language Models Judge Action Quality? An Empirical Evaluation
Miguel Monte e Freitas,Rui Henriques,Ricardo Rei,Pedro Henrique Martins
Main category: cs.CV
TL;DR: 本文对视觉语言模型(VLMs)在动作质量评估(AQA)任务中的实际性能进行了系统性评测,发现当前SOTA模型表现仅略高于随机水平,存在预测偏差和语言敏感性等根本性局限,提示VLM在细粒度运动质量判断上存在本质困难。
Details
Motivation: 尽管视觉语言模型(VLMs)在动作质量评估(AQA)中潜力巨大,但其在该领域的实际性能尚缺乏系统评估,亟需建立可靠基准并识别关键失败模式。 Method: 对多个SOTA VLM(如Gemini 3.1 Pro、Qwen3-VL、InternVL3.5)在不同活动领域(健身、花样滑冰、跳水)、任务设定、视觉表征及提示策略下进行综合评测;分析预测分布以识别系统性偏差;尝试对比式任务重构等缓解策略。 Result: 所有模型在AQA任务上仅略优于随机猜测;引入骨架信息、接地指令、推理结构或上下文学习仅带来零星提升;发现两大系统性偏差:无视视觉证据倾向预测‘正确执行’、易受表面语言表述影响;对比式任务重构收效甚微。 Conclusion: VLM当前在AQA任务上的局限并非仅由可识别偏差导致,而是源于对细粒度运动质量判断的根本性能力缺失;本研究为后续VLM-AQA研究提供了严格基线与关键失败模式清单。 Abstract: Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.[125] LINE: LLM-based Iterative Neuron Explanations for Vision Models
Vladimir Zaigrajew,Michał Piechota,Gaspar Sekula,Przemysław Biecek
Main category: cs.CV
TL;DR: 本文提出LINE方法,一种无需训练的迭代式开放词汇概念标注技术,用于解释视觉模型中单个神经元所编码的概念。该方法在黑盒设置下,结合大语言模型和文生图生成器,通过激活历史指导概念的提出与优化,显著提升了概念标注性能,并能评估神经元的多义性及提供可视化解释。
Details
Motivation: 现有神经元概念标注方法受限于预定义概念词表或生成过于具体的概念描述,难以捕捉高阶、全局概念,影响对深度神经网络决策过程的理解和AI安全性保障。 Method: 提出LINE方法,一种无需训练的迭代式开放词汇概念标注框架;在严格黑盒设置下,利用大语言模型和文本到图像生成器构成闭环,依据神经元激活历史迭代提出并优化概念。 Result: LINE在多个模型架构上达到SOTA性能,在ImageNet和Places365数据集上AUC分别提升最多0.18和0.05;平均发现29%预定义大规模词表未覆盖的新概念;同时提供完整生成历史,支持多义性评估和媲美梯度类激活最大化方法的可视化解释。 Conclusion: LINE是一种高效、通用且可解释性强的神经元概念标注新范式,突破了传统方法对词表和训练依赖的限制,为深度神经网络的可解释性与AI安全研究提供了新工具。 Abstract: Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.[126] Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
Marcel Gröpl,Jaewoo Jung,Seungryong Kim,Marc Pollefeys,Sunghwan Hong
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、基于模型内在不确定性的视觉定位方法,通过反向传播token预测熵到视觉token嵌入来生成相关性图,并结合多区域提取与迭代缩放重定位策略,显著提升了细粒度视觉细节理解和多线索组合查询的性能。
Details
Motivation: 预训练视觉语言模型在依赖微小视觉细节或多区域线索组合的任务(如文档理解、组合式查询)上仍表现不佳,亟需更精准、可解释的视觉定位能力。 Method: 提出一种训练无关、模型内在的定位方法:利用模型下一词预测分布的熵作为不确定性监督信号,反向传播至视觉token嵌入生成熵梯度相关图;进一步提取并排序多个连贯区域以支持多证据查询,并设计带空间熵停止准则的迭代缩放-重定位流程。 Result: 在七个基准数据集、四种VLM架构上均一致优于现有方法,尤其在细节关键型和高分辨率场景下提升最大,同时生成更具可解释性的证据定位结果。 Conclusion: 不确定性驱动的测试时证据检索范式有效弥补了VLM在细粒度推理与多区域协同理解上的不足,为无需额外训练或外部模块的模型内在可解释性提供了新路径。 Abstract: Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.[127] 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
Hongcan Xiao,Xinyue Xiao,Yilin Wang,Yue Zhang,Yonggang Qi
Main category: cs.CV
TL;DR: 本文提出3DrawAgent,一种无需训练、基于语言驱动的3D草图生成框架,利用大语言模型(LLM)在几何反馈下逐段绘制3D贝塞尔曲线,并通过相对经验优化策略实现无参数更新的自我提升。
Details
Motivation: 自然语言生成3D草图仍具挑战性,现有方法多依赖监督学习或参数更新,缺乏训练自由、具备空间理解能力的生成框架。 Method: 提出3DrawAgent框架:利用LLM解析文本并生成3D贝塞尔曲线;引入相对经验优化策略,基于CLIP感知奖励与LLM细粒度评估构建优劣草图对,适配GRPO范式,在不更新参数前提下进行黑箱强化以增强3D感知。 Result: 实验表明该方法能从多样化文本提示中生成复杂、连贯的3D贝塞尔草图,展现出涌现的几何推理能力,并可泛化至新形状。 Conclusion: 3DrawAgent确立了一种无需训练的3D草图智能新范式,为提升模型空间理解与生成能力提供了可扩展、免参数更新的技术路径。 Abstract: Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.[128] What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric
Mohamed Amine Kerkouri,Marouane Tliba,Bin Wang,Aladine Chetouani,Ulas Bagci,Alessandro Bruno
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉语言模型(VLM)的语义扫视路径相似性框架,通过将注视点编码为文本描述并计算语义相似性,补充了传统仅依赖空间/时间对齐的扫视路径分析方法。
Details
Motivation: 现有扫视路径相似性度量主要关注空间和时间对齐,忽视了被注意图像区域之间的语义等价性。 Method: 利用视觉语言模型对每个注视点在受控视觉上下文(patch-based 和 marker-based)下进行编码,生成简洁文本描述,并聚合为扫视路径级表征;随后采用嵌入式和词法NLP指标计算语义相似性,并与MultiMatch、DTW等经典空间度量对比。 Result: 实验表明语义相似性能捕捉与几何对齐部分独立的变异,揭示出空间上差异大但内容高度一致的案例;上下文编码方式影响描述保真度和指标稳定性。 Conclusion: 多模态基础模型可实现可解释、内容感知的经典扫视路径分析扩展,为ETRA社区的眼动研究提供互补维度。 Abstract: Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.[129] Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images
Francesca Fati,Alberto Rota,Adriana V. Gregory,Anna Catozzo,Maria C. Giuliano,Mrinal Dhar,Luigi De Vitis,Annie T. Packard,Francesco Multinu,Elena De Momi,Carrie L. Langstraat,Timothy L. Kline
Main category: cs.CV
TL;DR: 本文提出了一种基于DINOv3视觉Transformer的标签高效超声附件肿块分割框架,结合DPT式解码器,在小样本和域偏移场景下显著优于传统全监督CNN模型。
Details
Motivation: 超声附件肿块评估存在主观性强、观察者间差异大等问题;传统全监督分割模型依赖大量像素级标注,且在医学影像常见的域偏移下性能下降。 Method: 采用预训练的DINOv3作为骨干网络提取鲁棒语义先验,结合Dense Prediction Transformer(DPT)风格解码器进行多尺度特征重组,实现全局语义与精细空间细节融合。 Result: 在7777帧临床超声图像上达到Dice 0.945,边界精度提升(95% Hausdorff距离降低11.4%);仅用25%数据时仍显著优于全监督基线。 Conclusion: 利用大规模自监督预训练基础模型可有效缓解医学图像分割中标注稀缺与域偏移问题,为临床数据受限环境提供高效可行方案。 Abstract: Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA[130] OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks
Wenbo Hu,Xin Chen,Yan Gao-Tian,Yihe Deng,Nanyun Peng,Kai-Wei Chang
Main category: cs.CV
TL;DR: 本文提出G²RPO,一种新的强化学习训练目标,通过非线性分布匹配使优势分布收敛至标准正态分布,以解决多模态大模型在不同视觉任务中奖励拓扑差异大、感知与推理难以平衡的问题;并结合响应长度与熵塑造机制,构建出高性能开源多模态通用模型OpenVLThinkerV2。
Details
Motivation: 现有GRPO方法在开源多模态通用模型中受限于跨视觉任务的奖励拓扑高度异质性,以及细粒度感知与多步推理能力难以协同优化。 Method: 提出高斯GRPO(G²RPO),将优势分布强制匹配标准正态分布;引入响应长度塑造(动态扩展推理链或约束直接输出)和熵塑造(限制探索范围防熵崩塌/爆炸)两种任务级调控机制。 Result: 在18个多样化基准上全面超越强开源及主流闭源前沿模型,验证了OpenVLThinkerV2的鲁棒性与通用性。 Conclusion: G²RPO及其配套塑造机制有效提升了多模态RL训练稳定性与任务泛化能力,为开源多模态通用模型提供了可扩展、鲁棒的新训练范式。 Abstract: Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.[131] Guiding a Diffusion Model by Swapping Its Tokens
Weijia Zhang,Yuehao Liu,Shanyan Guan,Wu Ran,Yanhao Ge,Wei Li,Chao Ma
Main category: cs.CV
TL;DR: 本文提出了一种名为Self-Swap Guidance(SSG)的新方法,通过在token latent层面进行语义差异最大的空间或通道维度交换,实现类似Classifier-Free Guidance(CFG)的效果,适用于条件与无条件扩散模型生成,提升图像保真度和鲁棒性。
Details
Motivation: CFG虽能提升图像质量,但依赖文本条件,无法用于无条件生成;现有无条件引导方法扰动方式粗糙、副作用大,缺乏细粒度控制。 Method: 提出Self-Swap Guidance(SSG):在扩散模型采样过程中,识别并交换最语义不相似的token latents(在空间或通道维),构造扰动预测,利用其与原始预测的方向差指导采样。 Result: 在MS-COCO 2014/2017和ImageNet上验证,SSG在图像保真度、提示对齐性及扰动鲁棒性上均优于现有无条件引导方法,并可即插即用。 Conclusion: SSG将CFG思想成功拓展至无条件生成场景,提供细粒度、可控、通用的引导机制,显著提升各类扩散模型的生成质量与稳定性。 Abstract: Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.[132] AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
Ziwei Zhou,Zeyuan Lai,Rui Wang,Yifan Yang,Zhen Xing,Yuqing Yang,Qi Dai,Lili Qiu,Chong Luo
Main category: cs.CV
TL;DR: 本文提出了AVGen-Bench,一个面向任务的文本到音视频(T2AV)生成评测基准,并设计了多粒度评估框架,揭示了当前T2AV模型在语义可靠性(如文字渲染、语音连贯性、物理推理和音乐音高控制)方面存在显著缺陷。
Details
Motivation: 现有T2AV评测方法碎片化,无法捕捉真实提示所需的细粒度音视频联合正确性。 Method: 构建包含11类高质量真实场景提示的AVGen-Bench基准;提出融合轻量级专用模型与多模态大语言模型(MLLM)的多粒度评估框架,覆盖感知质量到细粒度语义可控性。 Result: 评估发现当前T2AV模型虽具备较强视听美学能力,但在文本渲染、语音连贯性、物理推理及音乐音高控制等方面普遍存在严重失败。 Conclusion: T2AV生成亟需更精准、任务驱动的评测体系,AVGen-Bench为推动该领域发展提供了新标准和开源资源。 Abstract: Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.[133] ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
Daichi Yashima,Shuhei Kurita,Yusuke Oda,Shuntaro Suzuki,Seitaro Otsuki,Komei Sugiura
Main category: cs.CV
TL;DR: 本文提出ABMamba,一种具有线性计算复杂度的全开源多模态大语言模型,用于视频字幕生成,通过改进状态空间模型和引入对齐分层双向扫描模块,显著提升处理长视频序列的效率与性能。
Details
Motivation: 现有基于Transformer的方法在处理长视频序列时因注意力机制呈二次方计算复杂度而计算代价高昂,难以有效建模复杂的时序依赖关系。 Method: 提出Aligned Hierarchical Bidirectional Scan Mamba(ABMamba),以深度状态空间模型替代传统Transformer注意力机制,并设计对齐分层双向扫描模块,在多个时间分辨率上处理视频。 Result: 在VATEX和MSR-VTT等标准视频字幕基准上,ABMamba性能媲美典型多模态大模型,同时吞吐量提升约三倍。 Conclusion: ABMamba为高效、可扩展的视频理解提供了一种新范式,验证了线性复杂度状态空间模型在开放多模态大模型中的可行性与优势。 Abstract: In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.[134] Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts
Haolei Xu,Haiwen Hong,Hongxing Li,Rui Zhou,Yang Zhang,Longtao Huang,Hui Xue,Yongliang Shen,Weiming Lu,Yueting Zhuang
Main category: cs.CV
TL;DR: 本文发现多模态混合专家(MoE)模型存在'看见但不思考'现象:能准确感知图像内容,却在视觉推理任务中表现差于纯文本任务;通过分析指出路由机制在处理图像时未能充分激活任务相关推理专家,并提出路由引导干预方法显著提升视觉推理性能。
Details
Motivation: 解决多模态MoE模型在视觉推理任务中性能显著低于纯文本任务的异常现象(Seeing but Not Thinking)。 Method: 系统性分析跨模态语义共享与专家分层分布,提出Routing Distraction假设,并设计路由引导干预方法以增强领域专家激活。 Result: 在三个多模态MoE模型和六个基准上验证了方法有效性,复杂视觉推理任务最高提升3.17%;发现领域专家可定位认知功能,支持跨任务迁移。 Conclusion: 视觉输入导致路由机制在中间层发生偏离,抑制了关键推理专家的激活;通过显式引导路由可缓解该问题,揭示了多模态MoE中专家分工与认知功能的关联。 Abstract: Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.[135] EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience
Emanuele Balloni,Emanuele Frontoni,Chiara Matti,Marina Paolanti,Roberto Pierdicca,Emiliano Santarnecchi
Main category: cs.CV
TL;DR: 本文提出EEG2Vision框架,通过扩散模型和提示引导的后处理机制,从低密度EEG信号中重建高质量图像,显著提升低通道数下的视觉质量与语义一致性。
Details
Motivation: 非侵入式脑电图(EEG)空间分辨率低、噪声高,尤其在现实低密度电极配置下难以实现高质量视觉刺激重建。 Method: 提出模块化端到端EEG-to-image框架EEG2Vision:首先基于EEG条件化扩散模型进行初始图像重建;再利用多模态大语言模型提取语义描述,并驱动图像到图像扩散模型进行几何与感知一致性增强。系统评估了128/64/32/24通道等不同EEG分辨率下的性能。 Result: 通道数减少导致语义解码精度大幅下降(如50类Top-1准确率从89%降至38%),但图像重建质量下降较小(FID从76.77升至80.51);所提boosting机制在所有配置下均提升感知指标,低通道设置下最高提升9.71%的Inception Score;用户研究证实 boosted 重建结果具有明显感知偏好。 Conclusion: EEG2Vision显著提升了低分辨率EEG设备上实时脑-图转换应用的可行性,有望推动此类技术走出实验室、走向实际场景。 Abstract: Reconstructing visual stimuli from non-invasive electroencephalography (EEG) remains challenging due to its low spatial resolution and high noise, particularly under realistic low-density electrode configurations. To address this, we present EEG2Vision, a modular, end-to-end EEG-to-image framework that systematically evaluates reconstruction performance across different EEG resolutions (128, 64, 32, and 24 channels) and enhances visual quality through a prompt-guided post-reconstruction boosting mechanism. Starting from EEG-conditioned diffusion reconstruction, the boosting stage uses a multimodal large language model to extract semantic descriptions and leverages image-to-image diffusion to refine geometry and perceptual coherence while preserving EEG-grounded structure. Our experiments show that semantic decoding accuracy degrades significantly with channel reduction (e.g., 50-way Top-1 Acc from 89% to 38%), while reconstruction quality slight decreases (e.g., FID from 76.77 to 80.51). The proposed boosting consistently improves perceptual metrics across all configurations, achieving up to 9.71% IS gains in low-channel settings. A user study confirms the clear perceptual preference for boosted reconstructions. The proposed approach significantly boosts the feasibility of real-time brain-2-image applications using low-resolution EEG devices, potentially unlocking this type of applications outside laboratory settings.[136] Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning
Emanuele Balloni,Emanuele Frontoni,Chiara Matti,Marina Paolanti,Roberto Pierdicca,Emiliano Santarnecchi
Main category: cs.CV
TL;DR: 本文提出Brain3D,一种基于EEG-to-image解码的多模态EEG到3D重建架构,通过分阶段(图像生成→3D描述提取→扩散生成→单图转3D)实现几何感知的脑信号驱动3D重建,在语义对齐与几何保真度上取得良好效果。
Details
Motivation: 现有研究主要聚焦于从脑电(EEG)重建2D图像,而EEG到3D表示的重建尚未被充分探索,限制了神经解码在几何理解与多场景应用中的潜力。 Method: 提出Brain3D架构:1)从EEG信号生成视觉 grounded 图像;2)利用多模态大语言模型提取结构化、3D感知的文本描述;3)以该描述为条件进行扩散模型生成;4)将生成图像输入单图到3D模型,输出一致的3D网格。全程避免直接EEG-to-3D映射,采用几何感知的生成推理。 Result: 实验表明该方法在10类EEG解码任务中达到85.4% Top-1准确率,CLIPScore达0.648,重建结果在语义和几何层面均与原始刺激高度一致。 Conclusion: Brain3D验证了多模态协同下由EEG驱动3D重建的可行性,为脑机接口在三维内容生成等新应用场景提供了可扩展的技术路径。 Abstract: Decoding visual information from electroencephalography (EEG) has recently achieved promising results, primarily focusing on reconstructing two-dimensional (2D) images from brain activity. However, the reconstruction of three-dimensional (3D) representations remains largely unexplored. This limits the geometric understanding and reduces the applicability of neural decoding in different contexts. To address this gap, we propose Brain3D, a multimodal architecture for EEG-to-3D reconstruction based on EEG-to-image decoding. It progressively transforms neural representations into the 3D domain using geometry-aware generative reasoning. Our pipeline first produces visually grounded images from EEG signals, then employs a multimodal large language model to extract structured 3D-aware descriptions, which guide a diffusion-based generation stage whose outputs are finally converted into coherent 3D meshes via a single-image-to-3D model. By decomposing the problem into structured stages, the proposed approach avoids direct EEG-to-3D mappings and enables scalable brain-driven 3D generation. We conduct a comprehensive evaluation comparing the reconstructed 3D outputs against the original visual stimuli, assessing both semantic alignment and geometric fidelity. Experimental results demonstrate strong performance of the proposed architecture, achieving up to 85.4% 10-way Top-1 EEG decoding accuracy and 0.648 CLIPScore, supporting the feasibility of multimodal EEG-driven 3D reconstruction.[137] AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
Imane Momayiz,Soufiane Ait Elaouad,Abdeljalil Elmajjodi,Haitame Bouanane
Main category: cs.CV
TL;DR: 本文提出了AtlasOCR,首个开源的摩洛哥阿拉伯语(Darija)专用OCR模型,基于3B参数视觉语言模型(VLM)微调而成,在自建数据集AtlasOCRBench和标准基准KITAB-Bench上达到SOTA性能。
Details
Motivation: 摩洛哥阿拉伯语(Darija)富含视觉内容但缺乏专用OCR工具,现有模型难以满足其独特语言与书写变体需求。 Method: 构建Darija专用数据集(结合OCRSmith合成生成与真实世界数据),采用QLoRA与Unsloth对Qwen2.5-VL 3B模型进行参数高效微调,并开展关键超参数消融研究。 Result: 在新构建的AtlasOCRBench和公开KITAB-Bench上均取得SOTA性能,超越更大参数模型,展现出对Darija及标准阿拉伯语OCR任务的强鲁棒性与泛化能力。 Conclusion: AtlasOCR验证了轻量级、参数高效微调策略在低资源方言OCR中的有效性,为阿拉伯语族小语种OCR提供了可复现、开源的新范式。 Abstract: Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.[138] Tensor-Augmented Convolutional Neural Networks: Enhancing Expressivity with Generic Tensor Kernels
Chia-Wei Hsing,Wei-Lin Tu
Main category: cs.CV
TL;DR: 本文提出了一种物理引导的浅层张量增强卷积神经网络(TACNN),用通用张量替代传统卷积核,以提升表征能力并捕捉高阶特征相关性,在Fashion-MNIST上仅用两层即达到媲美VGG-16和GoogLeNet的精度。
Details
Motivation: 传统CNN依赖深层结构来捕获复杂相关性,导致计算开销大、可解释性差;需一种兼具高表达力与结构简洁性的新模型。 Method: 提出张量增强CNN(TACNN),将卷积核泛化为高阶张量,使每层输出成为能建模高阶相关性的多线性形式,并利用张量与量子叠加态在希尔伯特空间中的对应关系增强表达能力。 Result: 在Fashion-MNIST数据集上,仅含两层卷积的TACNN达到93.7%测试准确率,超越VGG-16(93.5%)并持平GoogLeNet(93.7%)。 Conclusion: TACNN通过物理启发的张量设计,在保持浅层架构的同时显著提升表达能力,为构建更高效、可解释的深度学习模型提供了新路径。 Abstract: Convolutional Neural Networks (CNNs) excel at extracting local features hierarchically, but their performance in capturing complex correlations hinges heavily on deep architectures, which are usually computationally demanding and difficult to interpret. To address these issues, we propose a physically-guided shallow model: tensor-augmented CNN (TACNN), which replaces conventional convolution kernels with generic tensors to enhance representational capacity. This choice is motivated by the fact that an order-$N$ tensor naturally encodes an arbitrary quantum superposition state in the Hilbert space of dimension $d^N$, where $d$ is the local physical dimension, thus offering substantially richer expressivity. Furthermore, in our design the convolution output of each layer becomes a multilinear form capable of capturing high-order feature correlations, thereby equipping a shallow multilayer architecture with an expressive power competitive to that of deep CNNs. On the Fashion-MNIST benchmark, TACNN demonstrates clear advantages over conventional CNNs, achieving remarkable accuracies with only a few layers. In particular, a TACNN with only two convolution layers attains a test accuracy of 93.7$\%$, surpassing or matching considerably deeper models such as VGG-16 (93.5$\%$) and GoogLeNet (93.7$\%$). These findings highlight TACNN as a promising framework that strengthens model expressivity while preserving architectural simplicity, paving the way towards more interpretable and efficient deep learning models.[139] DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather
Christof Leitgeb,Thomas Puchleitner,Max Peter Ronecker,Daniel Watzenig
Main category: cs.CV
TL;DR: 本文提出DinoRADE,一种以雷达为中心的检测框架,结合FMCW雷达张量与DINOv3视觉特征,通过可变形交叉注意力融合多模态信息,在K-Radar数据集上显著提升恶劣天气下对脆弱道路使用者(VRUs)的检测性能,尤其在五类目标上实现SOTA,并较现有雷达-相机方法提升12.1%。
Details
Motivation: 现有FMCW雷达方法在恶劣天气下虽表现良好,但难以分辨精细空间细节,尤其对小型和脆弱道路使用者(VRUs)检测能力不足;同时,当前研究缺乏在恶劣天气数据集(如K-Radar)上针对VRUs的系统性检测评估。 Method: 提出DinoRADE:以密集雷达张量为输入,利用DINOv3视觉基础模型提取图像特征,并通过可变形交叉注意力机制,在相机视角下围绕变换后的参考点聚合视觉特征,实现雷达与视觉的高效跨模态融合。 Result: 在K-Radar数据集所有天气条件下完成全面评测,首次单独报告五类目标的检测性能;相较现有单类别检测方法及最新雷达-相机方法,mAP提升12.1%。 Conclusion: DinoRADE验证了融合先进视觉基础模型与高分辨率雷达数据的有效性,为恶劣天气下鲁棒、细粒度的多模态感知提供了新范式,尤其提升了对脆弱道路使用者的安全检测能力。 Abstract: Reliable and weather-robust perception systems are essential for safe autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a DINOv3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. The code is available under https://github.com/chr-is-tof/RADE-Net.[140] AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
Handong Li,Zikang Liu,Longteng Guo,Tongtian Yue,Yepeng Tang,Xinxin Zhu,Chuanyang Zheng,Ziming Wang,Zhibin Wang,Jun Song,Cheng Yu,Bo Zheng,Jing Liu
Main category: cs.CV
TL;DR: AdaSpark是一种自适应稀疏性框架,通过自适应选择视频立方体和关键令牌来降低Video-LLM处理长视频的计算开销,同时保持细粒度感知与长程时序建模能力。
Details
Motivation: 现有高效化方法在处理长视频时,或因不可逆信息丢弃损害细粒度感知,或因固定稀疏模式限制长程时序建模。 Method: 提出AdaSpark框架:将视频划分为3D时空立方体;设计两个协同组件——自适应立方体选择注意力(AdaS-Attn)和自适应令牌选择前馈网络(AdaS-FFN);采用基于熵的Top-p机制动态分配算力。 Result: 在小时级视频基准上,FLOPs降低最多达57%,性能媲美稠密模型,且保留了细粒度感知与长程依赖建模能力。 Conclusion: AdaSpark实现了计算效率与建模能力的更好平衡,为高效Video-LLM提供了新范式。 Abstract: Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.[141] DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning
Junbo Wang,Liangyu Fu,Yuke Li,Yining Zhu,Ya Jing,Xuecheng Wu,Jiangbin Zheng
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的非自回归视频描述生成框架(DiffVC),通过并行解码提升生成速度、减少累积误差,并利用判别式条件扩散模型增强多模态交互,从而在保持高效生成的同时提升描述质量。
Details
Motivation: 现有自回归视频描述方法存在生成慢、累积误差大等问题;而非自回归方法又因多模态交互建模不足导致生成质量差。 Method: 提出基于判别式条件扩散模型的非自回归视频描述框架(DiffVC):先编码视频为视觉表征;训练时对真实文本加高斯噪声,并以视觉表征为条件通过判别式去噪器重建文本表征;再输入非自回归语言模型生成描述;推理时直接从高斯分布采样噪声生成文本。 Result: 在MSVD、MSR-VTT和VATEX数据集上,DiffVC超越了已有非自回归方法,性能媲美自回归方法,CIDEr最高提升9.9,BLEU@4提升2.6,且生成速度更快。 Conclusion: DiffVC有效兼顾生成效率与质量,验证了扩散模型在非自回归视频描述任务中的有效性与潜力。 Abstract: Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.[142] Coordinate-Based Dual-Constrained Autoregressive Motion Generation
Kang Ding,Hongsong Wang,Jie Gui,Liang Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为CDAMD的坐标驱动双约束自回归运动生成框架,结合自回归与扩散模型优势,解决误差放大和模式崩溃问题,在文本到运动生成与编辑任务中达到SOTA性能。
Details
Motivation: 扩散模型存在噪声预测中的误差放大问题,而自回归模型因运动离散化导致模式崩溃,亟需一种兼顾高保真度与语义一致性的新方法。 Method: 提出CDAMD框架:以运动坐标为输入,采用自回归范式;引入扩散启发的多层感知机提升运动保真度;设计双约束因果掩码,将运动token作为先验并与文本编码拼接。 Result: 在新建的坐标基运动合成基准(含文本到运动生成与运动编辑)上,该方法在保真度与语义一致性两方面均达到当前最优性能。 Conclusion: CDAMD成功融合自回归与扩散建模范式优势,验证了坐标空间直接建模的有效性,为文本驱动运动生成提供了新思路与实用基准。 Abstract: Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human-computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD). With motion coordinates as input, CDAMD follows the autoregressive paradigm and leverages diffusion-inspired multi-layer perceptrons to enhance the fidelity of predicted motions. Furthermore, a Dual-Constrained Causal Mask is introduced to guide autoregressive generation, where motion tokens act as priors and are concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish new benchmarks for both text-to-motion generation and motion editing. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of both fidelity and semantic consistency on these benchmarks.[143] EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition
Junbo Wang,Liangyu Fu,Yuke Li,Yining Zhu,Xuecheng Wu,Kun Hu
Main category: cs.CV
TL;DR: 本文提出了一种高效微表情识别框架EPIR,通过双范数偏移分块(DNSPT)、令牌集成和判别性令牌提取模块,在降低计算复杂度的同时提升识别性能,并在多个公开数据集上取得SOTA结果。
Details
Motivation: 现有基于Transformer的微表情识别方法计算复杂度高,且受限于小规模数据集难以学习有效表征。 Method: 提出EPIR框架,包括:1)双范数偏移分块(DNSPT)模块建模面部区域像素空间关系;2)令牌集成模块在多级Transformer中减少令牌数量而不损失信息;3)判别性令牌提取器结合改进注意力机制与动态令牌选择模块(DTSM)捕获关键微表情特征。 Result: 在CASME II、SAMM、SMIC和CAS(ME)3四个数据集上显著优于SOTA方法,例如在CAS(ME)3上UF1提升9.6%,在SMIC上UAR提升4.58%。 Conclusion: EPIR框架在保持高性能的同时显著降低了计算开销,有效缓解了小样本与高复杂度之间的矛盾,为微表情识别提供了新思路。 Abstract: Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.[144] OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation
Seungjae Moon,Seunghyun Oh,Youngmin Ro
Main category: cs.CV
TL;DR: 本文提出OV-Stitcher,一种无需训练的开放词汇语义分割框架,通过在最终编码器块中拼接子图像特征来实现全局注意力,从而提升上下文聚合与分割一致性。
Details
Motivation: 现有无训练开放词汇语义分割方法受限于预训练编码器的输入分辨率,依赖滑动窗口策略导致局部处理、缺乏全局注意力和上下文推理能力。 Method: OV-Stitcher在最终编码器块中直接拼接由滑动窗口提取的子图像特征,并重建注意力表示,以实现该层内的全局注意力。 Result: 在八个基准上验证,mIoU从48.7提升至50.7,显著优于先前无训练基线。 Conclusion: OV-Stitcher是一种可扩展且高效的无训练开放词汇分割方案,有效缓解了碎片化特征与上下文缺失问题。 Abstract: Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.[145] Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Luozheng Qin,Jia Gong,Qian Qiao,Tianjiao Li,Li Xu,Haoyu Pan,Chao Qu,Zhiyu Tan,Hao Li
Main category: cs.CV
TL;DR: 本文提出Uni-ViGU框架,以视频生成模型为基底,统一视频理解与生成任务;通过统一连续/离散流匹配、模态驱动的MoE结构及双向训练机制,实现高效协同,并在两类任务上均取得竞争力结果。
Details
Motivation: 视觉生成(尤其是视频)计算开销远高于理解,导致传统以理解为中心的多模态大模型难以高效扩展至生成任务,亟需生成为中心的新范式。 Method: 提出Uni-ViGU框架:1)统一流匹配方法,对视频做连续流匹配、对文本做离散流匹配;2)模态驱动的MoE架构,在Transformer中轻量增强文本生成能力并保留视频生成先验;3)双向训练机制,包括知识回溯(重建输入提示)和能力精炼(细粒度字幕微调)。 Result: 在视频生成与理解多项基准上达到有竞争力的性能,验证了生成为中心架构在统一多模态智能中的有效性与可扩展性。 Conclusion: 以生成为基石的统一架构(如Uni-ViGU)是构建高效、可扩展的统一多模态模型的可行且重要路径,打破了传统理解优先范式的局限。 Abstract: Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.[146] PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction
Zhi-Yi Lin,Thomas Markhorst,Jouh Yeong Chew,Xucong Zhang
Main category: cs.CV
TL;DR: 本文提出PolySLGen框架,用于生成多人交互场景下目标参与者的多模态(语音、身体动作、说话状态)反应,通过姿态融合模块和社会线索编码器建模群体互动,显著提升反应的上下文适配性、时序连贯性与真实感。
Details
Motivation: 现有方法局限于单模态或仅说话反应,且多针对两人互动,忽视非语言线索和多人互动的复杂动态,难以适用于真实社交场景。 Method: 提出PolySLGen在线框架,包含姿态融合模块和社会线索编码器,联合聚合群体的动作与社会信号,以生成目标参与者的语音、身体动作及说话状态分数。 Result: 实验表明,PolySLGen在动作质量、动作-语音对齐、说话状态预测及人类感知真实性等方面均优于多个适配基线和SOTA方法。 Conclusion: PolySLGen有效建模多人多模态交互,为具身AI在自然群体互动中的反应生成提供了新范式。 Abstract: Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.[147] Bag of Bags: Adaptive Visual Vocabularies for Genizah Join Image Retrieval
Sharva Gogawale,Gal Grudka,Daria Vasyutinsky-Shapira,Omer Ventura,Berat Kurar-Barakat,Nachum Dershowitz
Main category: cs.CV
TL;DR: 本文提出了一种名为Bag of Bags(BoB)的图像级表示方法,用于手稿碎片归属检索任务,通过局部视觉词典和集合间距离度量,在开罗藏经阁数据集上超越了传统Bag of Words基线方法。
Details
Motivation: 解决手稿碎片归属问题,即给定一个碎片图像,检索出同源手稿的其他碎片,这对古籍修复与研究具有重要意义。 Method: 提出Bag of Bags(BoB)表示法:使用稀疏卷积自编码器训练二值化碎片块,编码页面连通组件,对每张图像用k-means聚类嵌入,并用集合间距离(如Chamfer距离)比较图像;还引入质量加权的BoB-OT变体并给出近似保证,结合BoW初筛与BoB-OT重排序构成两阶段实用流程。 Result: 在开罗藏经阁数据集上,最佳BoB变体(Chamfer)达到Hit@1=0.78、MRR=0.84,相较最强BoW基线(BoW-RawPatches-χ²)分别提升6.1%相对准确率;BoB-OT具备理论近似保证,两阶段流程兼顾性能与效率。 Conclusion: BoB方法在手稿碎片归属任务中优于传统BoW,其局部词汇建模与集合匹配机制更契合碎片视觉异质性,且通过两阶段策略可扩展至更大规模手稿库。 Abstract: A join is a set of manuscript fragments identified as originally emanating from the same manuscript. We study manuscript join retrieval: Given a query image of a fragment, retrieve other fragments originating from the same physical manuscript. We propose Bag of Bags (BoB), an image-level representation that replaces the global-level visual codebook of classical Bag of Words (BoW) with a fragment-specific vocabulary of local visual words. Our pipeline trains a sparse convolutional autoencoder on binarized fragment patches, encodes connected components from each page, clusters the resulting embeddings with per image $k$-means, and compares images using set to set distances between their local vocabularies. Evaluated on fragments from the Cairo Genizah, the best BoB variant (viz.\@ Chamfer) achieves Hit@1 of 0.78 and MRR of 0.84, compared to 0.74 and 0.80, respectively, for the strongest BoW baseline (BoW-RawPatches-$χ^2$), a 6.1\% relative improvement in top-1 accuracy. We furthermore study a mass-weighted BoB-OT variant that incorporates cluster population into prototype matching and present a formal approximation guarantee bounding its deviation from full component-level optimal transport. A two-stage pipeline using a BoW shortlist followed by BoB-OT reranking provides a practical compromise between retrieval strength and computational cost, supporting applicability to larger manuscript collections.[148] Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection
Yushuo Zhang,Yu Cheng,Yongkang Hu,Jiuan Zhou,Jiawei Chen,Yuan Xie,Zhaoxia Yin
Main category: cs.CV
TL;DR: 本文提出Face-D²CL框架,通过多域协同表征和双持续学习机制(EWC+OGC)解决面部DeepFake检测中特征表达不足与灾难性遗忘问题,无需历史数据回放,在稳定性和可塑性上均超越现有SOTA方法。
Details
Motivation: 面部伪造技术快速发展威胁公众信任与信息安全,而现有持续学习方法在真实场景中面临特征表达不足和灾难性遗忘两大瓶颈。 Method: 提出Face-D²CL框架:1)多域协同表征融合空间与频域特征以全面捕获伪造痕迹;2)双持续学习机制结合Elastic Weight Consolidation(区分真假样本参数重要性)与Orthogonal Gradient Constraint(约束任务适配器更新不干扰旧知识)。 Result: 在平均检测错误率上相对降低60.7%,在未见伪造域上的平均AUC提升7.9%,稳定性和可塑性均优于当前SOTA方法。 Conclusion: Face-D²CL实现了抗遗忘能力与适应新兴伪造范式能力的动态平衡,是一种高效、无需数据回放的面部DeepFake持续检测新范式。 Abstract: The rapid advancement of facial forgery techniques poses severe threats to public trust and information security, making facial DeepFake detection a critical research priority. Continual learning provides an effective approach to adapt facial DeepFake detection models to evolving forgery patterns. However, existing methods face two key bottlenecks in real-world continual learning scenarios: insufficient feature representation and catastrophic forgetting. To address these issues, we propose Face-D(^2)CL, a framework for facial DeepFake detection. It leverages multi-domain synergistic representation to fuse spatial and frequency-domain features for the comprehensive capture of diverse forgery traces, and employs a dual continual learning mechanism that combines Elastic Weight Consolidation (EWC), which distinguishes parameter importance for real versus fake samples, and Orthogonal Gradient Constraint (OGC), which ensures updates to task-specific adapters do not interfere with previously learned knowledge. This synergy enables the model to achieve a dynamic balance between robust anti-forgetting capabilities and agile adaptability to emerging facial forgery paradigms, all without relying on historical data replay. Extensive experiments demonstrate that our method surpasses current SOTA approaches in both stability and plasticity, achieving 60.7% relative reduction in average detection error rate, respectively. On unseen forgery domains, it further improves the average detection AUC by 7.9% compared to the current SOTA method.[149] T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation
Pranjal Khadka
Main category: cs.CV
TL;DR: 本文提出了一种用于3D医学图像分割的时序适配器,通过在视觉语言模型(VLM)中引入相邻切片上下文,显著提升分割精度与跨域/跨模态泛化能力。
Details
Motivation: 传统3D全监督分割依赖大量昂贵的体素级标注;现有VLM直接应用于2D切片导致解剖连续性差、分割噪声大。 Method: 设计一种时序适配器,包含:1)在token级对固定窗口内相邻切片建模的时序Transformer;2)增强单切片表征的空间上下文模块;3)自适应门控机制融合时序与单切片特征。基于FLARE22的30例标注数据训练。 Result: 在FLARE22上平均Dice达0.704(相对基线+0.206);零样本迁移到BTCV和AMOS22分别提升+0.210和+0.230;跨模态(CT→MRI)测试中Dice达0.366,超越仅用CT训练的3D监督模型DynUNet(0.224)。 Conclusion: 所提时序适配器有效利用VLM的通用视觉语义,并通过显式建模3D连续性弥补其2D切片处理缺陷,在少样本、零样本及跨模态场景下均展现出强泛化能力。 Abstract: Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.[150] OceanMAE: A Foundation Model for Ocean Remote Sensing
Viola-Joanna Stamer,Panagiotis Agrafiotis,Behnood Rasti,Begüm Demir
Main category: cs.CV
TL;DR: 本文提出OceanMAE,一种面向海洋遥感的掩码自编码器,通过融合多光谱Sentinel-2影像与物理意义明确的海洋描述符进行自监督预训练,提升下游海洋分割与水深估计任务性能。
Details
Motivation: 海洋遥感受限于标注数据稀缺及现有主要在陆地影像上预训练的模型迁移能力弱,亟需面向海洋领域定制的自监督预训练方法。 Method: 提出OceanMAE模型,在标准MAE框架中引入多光谱Sentinel-2数据与物理海洋特征作为辅助输入进行掩码自编码预训练;下游采用改进UNet架构完成海洋污染物分割与水深回归任务。 Result: 在MADOS、MARIDA和MagicBathyNet数据集上的实验表明,OceanMAE在海洋分割任务上显著优于基线,在水深估计上表现具竞争力且任务相关;消融实验证明引入海洋描述符可提升分割性能。 Conclusion: 物理信息引导、领域对齐的自监督预训练能有效提升海洋遥感任务性能,为该领域提供了新范式。 Abstract: Accurate ocean mapping is essential for applications such as bathymetry estimation, seabed characterization, marine litter detection, and ecosystem monitoring. However, ocean remote sensing (RS) remains constrained by limited labeled data and by the reduced transferability of models pre-trained mainly on land-dominated Earth observation imagery. In this paper, we propose OceanMAE, an ocean-specific masked autoencoder that extends standard MAE pre-training by integrating multispectral Sentinel-2 observations with physically meaningful ocean descriptors during self-supervised learning. By incorporating these auxiliary ocean features, OceanMAE is designed to learn more informative and ocean-aware latent representations from large- scale unlabeled data. To transfer these representations to downstream applications, we further employ a modified UNet-based framework for marine segmentation and bathymetry estimation. Pre-trained on the Hydro dataset, OceanMAE is evaluated on MADOS and MARIDA for marine pollutant and debris segmentation, and on MagicBathyNet for bathymetry regression. The experiments show that OceanMAE yields the strongest gains on marine segmentation, while bathymetry benefits are competitive and task-dependent. In addition, an ablation against a standard MAE on MARIDA indicates that incorporating auxiliary ocean descriptors during pre-training improves downstream segmentation quality. These findings highlight the value of physically informed and domain-aligned self-supervised pre- training for ocean RS. Code and weights are publicly available at https://git.tu-berlin.de/joanna.stamer/SSLORS2.[151] On the Global Photometric Alignment for Low-Level Vision
Mingjia Li,Tianle Du,Hainuo Wang,Qiming Hu,Xiaojie Guo
Main category: cs.CV
TL;DR: 本文提出Photometric Alignment Loss (PAL),通过闭式仿射颜色对齐来减少光度不一致带来的干扰,从而提升低层视觉任务的性能和泛化能力。
Details
Motivation: 监督式低层视觉模型依赖像素级损失,但配对训练集存在每对图像间的光度不一致(如亮度、色彩、白平衡差异),导致优化困难,影响内容恢复。 Method: 理论分析证明在最小二乘分解下,预测与目标残差中的光度分量与结构分量正交,且光度分量主导梯度能量;据此提出PAL损失,利用协方差统计和轻量矩阵求逆实现闭式仿射颜色对齐,抑制无关光度差异。 Result: 在6个任务、16个数据集、16种网络架构上,PAL持续提升指标与泛化性能。 Conclusion: PAL是一种高效、通用、开销极小的监督目标改进方法,能有效解耦光度干扰与内容恢复,显著增强低层视觉模型训练效果。 Abstract: Supervised low-level vision models rely on pixel-wise losses against paired references, yet paired training sets exhibit per-pair photometric inconsistency, say, different image pairs demand different global brightness, color, or white-balance mappings. This inconsistency enters through task-intrinsic photometric transfer (e.g., low-light enhancement) or unintended acquisition shifts (e.g., de-raining), and in either case causes an optimization pathology. Standard reconstruction losses allocate disproportionate gradient budget to conflicting per-pair photometric targets, crowding out content restoration. In this paper, we investigate this issue and prove that, under least-squares decomposition, the photometric and structural components of the prediction-target residual are orthogonal, and that the spatially dense photometric component dominates the gradient energy. Motivated by this analysis, we propose Photometric Alignment Loss (PAL). This flexible supervision objective discounts nuisance photometric discrepancy via closed-form affine color alignment while preserving restoration-relevant supervision, requiring only covariance statistics and tiny matrix inversion with negligible overhead. Across 6 tasks, 16 datasets, and 16 architectures, PAL consistently improves metrics and generalization. The implementation is in the appendix.[152] MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
Zheng Jiang,Heng Guo,Chengyu Fang,Changchen Xiao,Xinyang Hu,Lifeng Sun,Minfeng Xu
Main category: cs.CV
TL;DR: 本文提出MedVR,一种无需人工标注的强化学习框架,用于提升医学视觉语言模型(VLMs)的视觉推理能力,通过熵引导的视觉再定位(EVR)和基于共识的信用分配(CCA)机制,在多个医学VQA基准上达到SOTA性能。
Details
Motivation: 现有医学VLMs受限于纯文本推理范式,难以有效结合视觉证据,导致细粒度视觉分析能力弱、易产生视觉幻觉,影响临床安全性与可靠性。 Method: 提出MedVR框架,包含两个核心机制:1)熵引导的视觉再定位(EVR),利用模型不确定性指导视觉探索;2)基于共识的信用分配(CCA),从多轮推理结果的一致性中提取伪监督信号;全程无需中间步骤的人工标注。 Result: 在多个公开医学VQA基准上达到当前最优性能,显著超越现有方法,且提升了模型鲁棒性与可解释性。 Conclusion: MedVR实现了无需标注的视觉驱动推理,为推动医学AI在临床场景中的安全、可信部署提供了新范式。 Abstract: Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.[153] OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
Yiduo Jia,Muzhi Zhu,Hao Zhong,Mingyu Liu,Yuling Xi,Hao Chen,Bin Qin,Yongjie Yang,Zhenbo Luo,Chunhua Shen
Main category: cs.CV
TL;DR: 本文提出OmniJigsaw,一种基于时间重排序代理任务的通用自监督框架,用于增强多模态(视频-音频)理解与协同推理;通过联合模态融合、样本级模态选择和片段级模态掩码三种策略促进跨模态整合,并设计两阶段数据过滤流程以适配海量无标注数据;实验表明其在15个基准上显著提升性能,并揭示并缓解了‘双模态捷径现象’。
Details
Motivation: 将强化学习后训练范式扩展到全模态模型,以同时提升视频-音频理解与协同推理能力,并解决代理任务中因模态冗余导致的‘双模态捷径现象’。 Method: 提出OmniJigsaw框架,核心是基于时间顺序重建打乱的音视频片段的代理任务;采用联合模态集成、样本级模态选择和片段级模态掩码三种跨模态整合策略;并设计粗粒度到细粒度的两阶段数据过滤流程以提升数据质量与适配效率。 Result: 在15个视频、音频及协同推理基准上取得显著性能提升;验证片段级模态掩码优于样本级模态选择,能有效缓解‘双模态快捷现象’;证明框架可高效扩展至大规模无标注全模态数据。 Conclusion: OmniJigsaw是一种可扩展、高效的全模态自监督学习范式,为音视频协同理解与推理提供了新思路和实用框架。 Abstract: To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.[154] SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection
You Hu,Chenzhuo Zhao,Changfa Mo,Haotian Liu,Xiaobai Li
Main category: cs.CV
TL;DR: 本文提出了首个用于检测AI生成科学图表的基准数据集SciFigDetect,揭示了现有AI生成图像检测方法在该领域表现不佳的问题。
Details
Motivation: 现代多模态生成器能生成接近出版质量的科学图表,但现有AI生成图像检测方法主要针对自然图像,缺乏对结构化、文本密集且语义紧密的科学图表的检测能力。 Method: 构建了一个基于智能体的数据流水线,包括检索授权论文、多模态理解图文内容、构建结构化提示、合成候选图表,并通过审阅驱动的精炼循环进行筛选,最终形成涵盖多种图表类型、生成源及真实-合成配对的基准数据集。 Result: 实验表明,当前检测器在零样本迁移、跨生成器泛化及退化图像场景下均表现极差,存在严重过拟合和鲁棒性不足问题。 Conclusion: 现有AI生成图像检测能力与高质量科学图表的新分布之间存在显著差距,该基准有望推动鲁棒、可泛化的科学图表取证研究。 Abstract: Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.[155] Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Blessing Agyei Kyem,Joshua Kofi Asamoah,Anthony Dontoh,Armstrong Aboah
Main category: cs.CV
TL;DR: 本文提出PaveInstruct数据集和PaveGPT模型,通过领域特定指令微调,显著提升视觉语言模型在路面状况评估中的性能,支持符合工程标准的输出,并为基础设施检测提供通用AI范式。
Details
Motivation: 通用视觉语言模型在专业工程领域(如路面检测)表现不佳,因其缺乏精准术语理解、结构化推理及工程规范遵循能力。 Method: 构建包含278,889图像-指令-响应对、覆盖32类任务的PaveInstruct数据集;基于该数据集对视觉语言模型进行指令微调,得到PaveGPT;在感知、理解与推理任务上对比评估其性能。 Result: PaveGPT在空间定位、推理与生成任务上较SOTA模型提升超20%,并能生成符合ASTM D6433标准的评估结果。 Conclusion: 领域专用指令微调可有效赋能视觉语言模型完成高精度工程评估任务,推动交通部门采用统一对话式评估工具替代多个专用系统,并为桥梁、铁路、建筑等基础设施检测提供可迁移AI框架。 Abstract: General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.[156] EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization
Xiangyuan Wang,Honghao Cai,Yunhao Bai,Tianze Zhou,Haohua Chen,Yao Hu,Xu Tang,Yibo Chen,Wei Zhu
Main category: cs.CV
TL;DR: 本文提出EditCaption,一种两阶段后训练流程,用于提升视觉语言模型(VLM)在图像编辑指令合成中的准确性,显著降低方向、视角和属性描述错误,提升下游模型性能。
Details
Motivation: 高质量带编辑指令的图像对稀缺,而现有VLM自动生成指令存在方向混淆、视角模糊和属性描述粗略三大系统性缺陷,导致超47%指令不可用于训练。 Method: 提出两阶段EditCaption流程:第一阶段构建10万样本监督微调(SFT)数据集,融合GLM自动标注、EditScore过滤与人工精修;第二阶段收集1万组人类偏好数据,针对三类错误应用直接偏好优化(DPO)。 Result: 微调后的Qwen3-VL在Eval-400和ByteMorph-Bench上超越Gemini-3-Pro等主流模型;关键错误率从47.75%降至23%,正确率从41.75%升至66%。 Conclusion: EditCaption为指令引导图像编辑提供了可扩展、高保真、人类对齐的数据合成路径,有效缓解训练数据瓶颈。 Abstract: High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.[157] Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges
Saniya M. Deshmukh,Kailash A. Hambarde,Hugo Proença
Main category: cs.CV
TL;DR: 本文是一篇关于跨域目标检测(CDOD)的综述论文,系统梳理了该领域的挑战、方法分类、领域偏移传播机制、数据集与评估标准,并指出了未来研究方向。
Details
Motivation: 现有目标检测模型在跨域部署时性能显著下降,且当前文献缺乏对领域偏移本质挑战和适应策略有效性的统一视角。 Method: 提出多阶段问题建模框架,构建基于适应范式、建模假设和检测流程组件的概念化分类体系,并分析领域偏移在检测各阶段的传播机制。 Result: 建立了CDOD的统一分析框架,系统归纳了主流方法、常用数据集与评估协议,并明确了关键挑战与未来方向。 Conclusion: 跨域目标检测需区别于分类任务单独建模,其复杂性源于检测流程的多阶段耦合特性;本综述为构建更鲁棒的检测系统提供了理论基础与实践指南。 Abstract: Object detection models trained on a source domain often exhibit significant performance degradation when deployed in unseen target domains, due to various kinds of variations, such as sensing conditions, environments and data distributions. Hence, regardless the recent breakthrough advances in deep learning-based detection technology, cross-domain object detection (CDOD) remains a critical research area. Moreover, the existing literature remains fragmented, lacking a unified perspective on the structural challenges underlying domain shift and the effectiveness of adaptation strategies. This survey provides a comprehensive and systematic analysis of CDOD. We start upon a problem formulation that highlights the multi-stage nature of object detection under domain shift. Then, we organize the existing methods through a conceptual taxonomy that categorizes approaches based on adaptation paradigms, modeling assumptions, and pipeline components. Furthermore, we analyze how domain shift propagates across detection stages and discuss why adaptation in object detection is inherently more complex than in classification. In addition, we review commonly used datasets, evaluation protocols, and benchmarking practices. Finally, we identify the key challenges and outline promising future research directions. Cohesively, this survey aims to provide a unified framework for understanding CDOD and to guide the development of more robust detection systems.[158] $\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization
Arnav Devalapally,Poornima Jain,Kartik Srinivas,Vineeth N. Balasubramanian
Main category: cs.CV
TL;DR: 本文提出了一种针对源域独有类别在无源域数据的域自适应(SFDA)中意外泄露问题的机器遗忘新设定SCADA-UL,并设计了结合对抗样本生成与重标定策略的遗忘方法,在保护源域隐私的同时保持目标域性能。
Details
Motivation: 现有源自由域自适应(SFDA)方法会在目标域中无意泄露源域独有类别的知识,带来隐私风险,而传统机器遗忘方法未考虑分布偏移,无法应对该场景。 Method: 提出SCADA-UL遗忘设定;设计基于对抗生成‘遗忘类’样本、重标定标签策略和对抗优化的新型遗忘方法;扩展至持续学习和未知遗忘类别两种变体。 Result: 所提方法在SCADA-UL设定下显著优于基线,在多个基准数据集上达到与重训练相当的遗忘效果,同时保持目标域分类性能。 Conclusion: 本文首次系统定义并解决了SFDA中源独有类别的隐私遗忘问题,验证了所提方法的有效性、泛化性与实用性,为安全域自适应提供了新范式。 Abstract: The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called SCADA-UL: Unlearning Source-exclusive ClAsses in Domain Adaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization. We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown. Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets. Our code is available at https://github.com/D-Arnav/SCADA[159] DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection
Jiangbei Yue,Sharib Ali
Main category: cs.CV
TL;DR: 本文提出了一种双分支多模态框架,结合文本-图像分支和视觉分支,以提升深度学习模型在内窥镜图像中对分布外(OOD)样本的检测性能,显著优于现有方法。
Details
Motivation: 现有OOD检测方法通常仅依赖单一视觉模态或图像-文本匹配,未能充分利用多模态信息,难以应对临床中复杂动态的分布外数据(如未见过的疾病病例)。 Method: 提出一种双分支多模态框架:一个文本-图像分支(计算得分S_t)和一个纯视觉分支(计算得分S_v),二者互补融合生成最终OOD检测得分S,并与阈值比较判定OOD。 Result: 在公开内窥镜图像数据集上实验表明,该框架在多种骨干网络下均表现鲁棒,OOD检测性能较当前最优方法最高提升24.84%。 Conclusion: 所提双分支多模态框架能更充分地利用多模态语义信息,有效提升临床场景下DL系统的OOD检测能力与可靠性。 Abstract: The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%[160] Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models
Jing Gu,Niccolò Cavagnero,Gijs Dubbelman
Main category: cs.CV
TL;DR: 本文提出了一种面向自动驾驶的轻量级视觉模型Orion-Lite,通过潜在特征蒸馏与真值轨迹监督,成功将大型视觉-语言-动作(VLA)模型ORION的知识蒸馏到紧凑的视觉-only模型中,在闭环评估和复杂交互场景下超越了教师模型,并在Bench2Drive基准上达到80.6的驾驶得分,刷新SOTA。
Details
Motivation: 大型语言模型(LLMs)具备丰富的世界知识,有助于提升自动驾驶系统应对罕见与复杂场景的能力,但其巨大参数量难以满足低延迟、低功耗部署需求;现有知识蒸馏工作多限于简单场景与开环评估,缺乏对复杂交互与闭环真实驾驶性能的验证。 Method: 采用潜在特征蒸馏(latent feature distillation)结合真值轨迹监督(ground-truth trajectory supervision),将VLA教师模型ORION的知识迁移至轻量级纯视觉学生模型Orion-Lite。 Result: Orion-Lite在严苛的Bench2Drive闭环基准上取得80.6的Driving Score,首次超越其大型VLA教师模型ORION,成为新SOTA;验证了纯视觉架构在高性能力反应式规划中仍有巨大未开发潜力。 Conclusion: LLM知识可通过高效蒸馏方式赋能轻量级视觉模型,在保持低计算开销的同时实现甚至超越大型多模态教师模型的闭环驾驶性能,表明纯视觉路径在自动驾驶中仍具强大竞争力与实用价值。 Abstract: Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.[161] Preventing Overfitting in Deep Image Prior for Hyperspectral Image Denoising
Panagiotis Gkotsis,Athanasios A. Rontogiannis
Main category: cs.CV
TL;DR: 本文提出了一种结合鲁棒数据保真项和显式敏感性正则化的DIP方法,用于高光谱图像去噪,有效缓解过拟合问题并提升性能。
Details
Motivation: DIP方法在逆成像任务中易过拟合,导致性能下降且需早停,亟需改进。 Method: 采用Smooth ℓ1数据项、基于散度的正则化及输入优化联合训练。 Result: 在含高斯、稀疏和条纹噪声的真实高光谱图像上,该方法有效抑制过拟合,去噪性能优于现有DIP方法。 Conclusion: 联合鲁棒数据保真与敏感性正则化可显著提升DIP在HSI去噪中的泛化能力与鲁棒性。 Abstract: Deep image prior (DIP) is an unsupervised deep learning framework that has been successfully applied to a variety of inverse imaging problems. However, DIP-based methods are inherently prone to overfitting, which leads to performance degradation and necessitates early stopping. In this paper, we propose a method to mitigate overfitting in DIP-based hyperspectral image (HSI) denoising by jointly combining robust data fidelity and explicit sensitivity regularization. The proposed approach employs a Smooth $\ell_1$ data term together with a divergence-based regularization and input optimization during training. Experimental results on real HSIs corrupted by Gaussian, sparse, and stripe noise demonstrate that the proposed method effectively prevents overfitting and achieves superior denoising performance compared to state-of-the-art DIP-based HSI denoising methods.[162] Revisiting Radar Perception With Spectral Point Clouds
Hamza Alsharif,Jing Gu,Pavol Jancura,Satish Ravindran,Gijs Dubbelman
Main category: cs.CV
TL;DR: 本文提出光谱点云范式,将点云视为雷达频谱的稀疏压缩表示,并通过注入频谱信息提升其性能,使其在雷达感知任务中可媲美甚至超越密集范围-多普勒(RD)谱输入。
Details
Motivation: 密集范围-多普勒谱虽常被认为性能优于稀疏点云,但其易受传感器与配置差异影响,阻碍模型迁移;而点云作为通用表征潜力未被充分挖掘。 Method: 提出光谱点云范式,设计实验框架对比不同密度点云模型与密集RD基准的性能,并探索两种基础频谱增强方法(向点云注入目标相关频谱信息)。 Result: 在特定点云密度下,光谱点云模型性能达到甚至超过RD基准;经频谱增强后,点云模型显著超越RD基准。 Conclusion: 光谱点云是一种鲁棒、统一的雷达感知输入表示,有望支撑未来雷达基础模型的发展。 Abstract: Radar perception models are trained with different inputs, from range-Doppler spectra to sparse point clouds. Dense spectra are assumed to outperform sparse point clouds, yet they can vary considerably across sensors and configurations, which hinders transfer. In this paper, we provide alternatives for incorporating spectral information into radar point clouds and show that, point clouds need not underperform compared to spectra. We introduce the spectral point cloud paradigm, where point clouds are treated as sparse, compressed representations of the radar spectra, and argue that, when enriched with spectral information, they serve as strong candidates for a unified input representation that is more robust against sensor-specific differences. We develop an experimental framework that compares spectral point cloud (PC) models at varying densities against a dense range-Doppler (RD) benchmark, and report the density levels where the PC configurations meet the performance of the RD benchmark. Furthermore, we experiment with two basic spectral enrichment approaches, that inject additional target-relevant information into the point clouds. Contrary to the common belief that the dense RD approach is superior, we show that point clouds can do just as well, and can surpass the RD benchmark when enrichment is applied. Spectral point clouds can therefore serve as strong candidates for unified radar perception, paving the way for future radar foundation models.[163] CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild
Siyuan Yao,Hao Sun,Ruiqi Yu,Xiwei Jiang,Wenqi Ren,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文构建了一个名为CAMotion的高质量视频伪装物体检测(VCOD)基准数据集,旨在解决现有VCOD数据集规模小、多样性不足的问题,并对当前SOTA模型进行了全面评估。
Details
Motivation: 现有视频伪装物体检测(VCOD)数据集在规模和多样性上严重受限,难以支撑数据驱动的深度学习方法的深入分析与广泛评估。 Method: 构建了CAMotion数据集,涵盖多种物种和复杂挑战属性(如不确定边缘、遮挡、运动模糊、形状复杂性等),并从多角度提供序列标注细节与统计分布;同时对现有SOTA模型在该数据集上进行综合评测。 Result: 发布了首个面向野外动态伪装物体检测的大规模、高多样性视频基准CAMotion,并揭示了VCOD任务中的主要挑战。 Conclusion: CAMotion有望推动伪装物体检测领域,特别是视频场景下的算法研究与实际应用发展。 Abstract: Discovering camouflaged objects is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. While the problem of camouflaged object detection over sequential video frames has received increasing attention, the scale and diversity of existing video camouflaged object detection (VCOD) datasets are greatly limited, which hinders the deeper analysis and broader evaluation of recent deep learning-based algorithms with data-hungry training strategy. To break this bottleneck, in this paper, we construct CAMotion, a high-quality benchmark covers a wide range of species for camouflaged moving object detection in the wild. CAMotion comprises various sequences with multiple challenging attributes such as uncertain edge, occlusion, motion blur, and shape complexity, etc. The sequence annotation details and statistical distribution are presented from various perspectives, allowing CAMotion to provide in-depth analyses on the camouflaged object's motion characteristics in different challenging scenarios. Additionally, we conduct a comprehensive evaluation of existing SOTA models on CAMotion, and discuss the major challenges in VCOD task. The benchmark is available at https://www.camotion.focuslab.net.cn, we hope that our CAMotion can lead to further advancements in the research community.[164] GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis
Yishen Liu,Hongcang Chen,Pengcheng Zhao,Yunfan Bao,Yuxi Tian,Jieming Zhang,Hao Chen,Zheng Zhi,Yongchun Liu,Ying Li,Dongpu Cao
Main category: cs.CV
TL;DR: 本文提出GroundingAnomaly,一种新型少样本异常图像生成框架,通过空间条件模块和门控自注意力模块,实现对合成异常的精确空间控制并保持预训练先验,在MVTec AD和VisA数据集上实现了异常检测、分割和实例级检测等下游任务的SOTA性能。
Details
Motivation: 工业质量控制中视觉异常检测受限于真实异常样本稀缺,现有异常合成方法存在掩码不准确或修复融合效果差的问题。 Method: 提出GroundingAnomaly框架,包含两个核心模块:1)空间条件模块,利用逐像素语义图实现对合成异常的精确空间控制;2)门控自注意力模块,通过门控注意力层将条件标记注入冻结U-Net,在保留预训练先验的同时实现稳定少样本适配。 Result: 在MVTec AD和VisA数据集上评估表明,该方法生成高质量异常图像,并在异常检测、分割及实例级检测等多个下游任务中达到SOTA性能。 Conclusion: GroundingAnomaly有效解决了少样本异常生成中的空间控制不准与预训练知识丢失问题,为工业视觉异常检测提供了更可靠的数据增强方案。 Abstract: The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.[165] Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow
Richard Petersen,Fredrik Kahl,Jennifer Alvén
Main category: cs.CV
TL;DR: 本文提出了一种无需重训练生成模型、仅微调预测器的弱监督肺结节分割方法,结合预训练的3D校正流(rectified flow)与预测器模型,利用图像级标签实现高质量分割,尤其在小结构检测上优于现有归因类方法。
Details
Motivation: 密集标注(如分割掩码)在3D医学图像中成本高昂;现有弱监督方法(尤其归因法)难以准确捕捉小结构(如肺结节)。 Method: 将预训练的3D校正流模型与预测器以即插即用方式结合,采用训练-free的生成模型引导,仅对预测器使用图像级标签进行微调,不重训练生成模型。 Result: 在LUNA16数据集上显著优于基线方法;能稳定检测不同大小和形状的肺结节,提升分割质量(针对两个独立预测器均有效)。 Conclusion: 生成式基础模型可作为高效工具用于弱监督3D医学图像分割,本方法为低标注成本下的精准小病灶分割提供了新范式。 Abstract: Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.[166] Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data
Yuchuan Deng,Qijie Wei,Kaiheng Qian,Jiazhen Liu,Zijie Xin,Bangxiang Lan,Jingyu Liu,Jianfeng Dong,Xirong Li
Main category: cs.CV
TL;DR: 本文提出Fundus-R1模型,利用纯公开数据(94%仅含图像级标签)训练一个具备推理能力的眼底图像理解多模态大语言模型,通过RAG生成知识感知的推理链,并在RLVR中引入过程奖励提升推理自洽性,在多个基准上显著优于基线。
Details
Motivation: 现有眼底图像理解模型依赖大量私有、高质量临床报告配对样本,但这些数据不公开,导致可复现性差且研究受限于少数机构。 Method: 1) 提出基于RAG的方法,自动生成连接视觉发现与图像标签、融合眼科知识的图像特异性推理链;2) 在强化学习(RLVR)中引入过程奖励,鼓励每次采样中推理链的自我一致性。 Result: 在FunBench、Omni-Fundus和GMAI-Fundus三个眼底阅读基准上,Fundus-R1显著优于通用多模态模型Qwen2.5-VL及未使用生成推理链的后训练版本。 Conclusion: 仅用公开数据(尤其高比例图像级标签数据)即可有效训练高性能、推理增强的眼底阅读MLLM,为该领域普惠化研究开辟新路径。 Abstract: Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.[167] Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
Xun Zhu,Fanbin Mo,Xi Chen,Kaili Zheng,Shaoshuai Yang,Yiming Shi,Jian Gao,Miao Li,Ji Wu
Main category: cs.CV
TL;DR: 本文通过特征探测方法,系统分析了14个开源医疗多模态大语言模型(MLLMs)在图像分类任务中性能下降的原因,揭示了四种失败模式,并提出了量化评估指标。
Details
Motivation: 尽管医疗多模态大语言模型(MLLMs)在预训练数据和参数量上具有显著优势,但在基础的医学图像分类任务中却持续落后于传统深度学习模型,这一悖论促使作者深入探究性能退化的根源。 Method: 在三个代表性医学图像分类数据集上对14个开源医疗MLLMs进行大规模实验,并采用模块级、层级的视觉特征探针技术,追踪整个MLLM流程中视觉特征的信息流,以可视化分类信号的失真、稀释或覆盖过程。 Result: 首次系统识别出医疗MLLMs分类性能退化的四种失败模式:视觉表征质量受限、连接器投影保真度损失、大语言模型推理理解不足、语义映射不一致;并提出了刻画特征演化健康度的定量评分体系。 Conclusion: 当前医疗MLLMs距离临床可部署仍有很长的路要走,需突破关键瓶颈;本研究呼吁社区重新思考高期望与实际临床潜力之间的差距。 Abstract: The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.[168] InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
Ashutosh Kumar,Rajat Saini,Jingjing Pan,Mustafa Erdogan,Mingfang Zhang,Betty Le Dem,Norimasa Kobori,Quan Kong
Main category: cs.CV
TL;DR: 本文提出InstAP框架,通过全局与实例级联合对齐提升视觉-语言预训练,解决了现有方法在实例级推理上的不足。
Details
Motivation: 现有视觉-语言预训练范式擅长全局场景理解,但在实例级推理上受限于仅使用全局监督信号。 Method: 提出InstAP(Instance-Aware Pre-training)框架,联合优化全局图文对齐与细粒度的实例级对比对齐,并构建大规模双粒度数据集InstVL(含200万图像、5万视频),提供整体场景描述和密集的、空间-时间定位的实例描述。 Result: InstAP在InstVL基准上显著优于现有VLP模型的实例级检索性能;即使与在相同数据上训练的强基线相比,也展现出实例感知目标的有效性;同时提升全局理解能力,在MSR-VTT、DiDeMo等视频零样本任务中表现具竞争力;可视化显示其能准确定位文本提及的具体实例。 Conclusion: 实例感知的预训练不仅能增强细粒度理解能力,还能反哺全局理解,是一种更全面、更具泛化性的视觉-语言联合建模范式。 Abstract: Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.[169] PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
Ruizhi Zhang,Ye Huang,Yuangang Pan,Chuanfu Shen,Zhilin Liu,Ting Xie,Wen Li,Lixin Duan
Main category: cs.CV
TL;DR: 本文提出 PokeGym,一个基于《宝可梦传说:阿尔宙斯》的视觉驱动、长视野3D具身智能基准,旨在克服现有VLM评测在交互性、3D深度感知、状态泄露和可扩展性上的四大缺陷;通过严格隔离RGB输入与内存验证评估,系统评测VLM在导航、交互等任务中的视觉接地、语义推理与自主探索能力,并发现物理死锁恢复而非高层规划是当前VLM的主要瓶颈,且不同能力模型表现出‘无意识死锁’与‘有意识死锁’的元认知差异。
Details
Motivation: 现有VLM评测基准存在四大缺陷:被动感知忽略交互动态、2D环境无法检验深度感知、状态泄露绕过真实视觉处理、人工评估成本高且不可扩展;亟需面向复杂3D具身环境的、严格视觉驱动且可自动评估的新基准。 Method: 构建 PokeGym 基准:基于《宝可梦传说:阿尔宙斯》游戏实现30个长视野(30–220步)任务,涵盖导航、交互与混合场景;采用三类指令粒度(视觉引导、步骤引导、目标仅提示);实行代码级隔离——代理仅接收原始RGB帧,独立评估器通过内存扫描验证结果,确保纯视觉决策与自动化评估。 Result: 实验揭示当前VLM的核心瓶颈是物理死锁恢复能力(而非高层规划),死锁频次与任务成功率呈强负相关;进一步发现‘元认知分化’:弱模型多陷入‘无意识死锁’(未察觉被困),强模型则出现‘有意识死锁’(察觉被困但无法脱困)。 Conclusion: 必须在VLM架构中显式融入空间直觉能力,以提升其在复杂3D具身环境中的鲁棒性与自主性;PokeGym为未来具身VLM研究提供了严格、可扩展、视觉原生的评测平台。 Abstract: While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.[170] MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping
Junyao Gao,Sibo Liu,Jiaxing Li,Yanan Sun,Yuanpeng Tu,Fei Shen,Weidong Zhang,Cairong Zhao,Jun Zhang
Main category: cs.CV
TL;DR: 本文提出了MegaStyle,一个用于构建高质量风格数据集的数据整理流程,并基于该数据集训练了风格编码器和风格迁移模型。
Details
Motivation: 现有风格数据集缺乏风格内一致性、风格间多样性及高质量,限制了风格迁移模型的性能。 Method: 利用大生成模型的文本到图像风格映射能力,构建包含17万风格提示和40万内容提示的提示库,生成大规模风格数据集MegaStyle-1.4M;在此基础上,提出风格监督对比学习训练MegaStyle-Encoder,并训练基于FLUX的风格迁移模型MegaStyle-FLUX。 Result: 实验验证了MegaStyle-1.4M在保持风格内一致性、风格间多样性与高质量方面的重要性;MegaStyle-Encoder能提供可靠的风格相似性度量,MegaStyle-FLUX具备泛化性强的风格迁移能力。 Conclusion: MegaStyle为风格迁移任务提供了高质量数据基础与有效模型,显著推动了该领域的发展。 Abstract: In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, which can generate images in the same style from a given style description. Building on this foundation, we curate a diverse and balanced prompt gallery with 170K style prompts and 400K content prompts, and generate a large-scale style dataset MegaStyle-1.4M via content-style prompt combinations. With MegaStyle-1.4M, we propose style-supervised contrastive learning to fine-tune a style encoder MegaStyle-Encoder for extracting expressive, style-specific representations, and we also train a FLUX-based style transfer model MegaStyle-FLUX. Extensive experiments demonstrate the importance of maintaining intra-style consistency, inter-style diversity and high-quality for style dataset, as well as the effectiveness of the proposed MegaStyle-1.4M. Moreover, when trained on MegaStyle-1.4M, MegaStyle-Encoder and MegaStyle-FLUX provide reliable style similarity measurement and generalizable style transfer, making a significant contribution to the style transfer community. More results are available at our project website https://jeoyal.github.io/MegaStyle/.[171] SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction
Chensheng Dai,Shengjun Zhang,Min Chen,Yueqi Duan
Main category: cs.CV
TL;DR: 本文提出SurfelSplat,一种基于稀疏视角图像的前馈式3D表面重建方法,通过跨视角特征聚合与采样率引导的低通滤波,实现快速、通用、像素对齐的高斯surfels生成,在DTU数据集上达到SOTA精度且推理速度提升100倍。
Details
Motivation: 现有基于优化的3D高斯点阵(3DGS)表面重建方法依赖密集视角输入且耗时长,缺乏泛化性和实时性;亟需一种高效、通用、稀疏视角驱动的前馈重建框架。 Method: 提出SurfelSplat:1)引入像素对齐的高斯surfels表示;2)基于奈奎斯特采样定理设计跨视角特征聚合模块,包括空间采样率引导的低通滤波与多视角投影;3)构建专用特征融合网络回归几何精确的surfels。 Result: 在DTU基准上重建精度媲美SOTA优化方法,单场景推理时间<1秒,较优化方法加速约100倍,无需逐场景训练。 Conclusion: SurfelSplat验证了前馈式稀疏视角3D表面重建的可行性与高效性,为实时、通用的神经渲染与重建提供了新范式。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive performance in 3D scene reconstruction. Beyond novel view synthesis, it shows great potential for multi-view surface reconstruction. Existing methods employ optimization-based reconstruction pipelines that achieve precise and complete surface extractions. However, these approaches typically require dense input views and high time consumption for per-scene optimization. To address these limitations, we propose SurfelSplat, a feed-forward framework that generates efficient and generalizable pixel-aligned Gaussian surfel representations from sparse-view images. We observe that conventional feed-forward structures struggle to recover accurate geometric attributes of Gaussian surfels because the spatial frequency of pixel-aligned primitives exceeds Nyquist sampling rates. Therefore, we propose a cross-view feature aggregation module based on the Nyquist sampling theorem. Specifically, we first adapt the geometric forms of Gaussian surfels with spatial sampling rate-guided low-pass filters. We then project the filtered surfels across all input views to obtain cross-view feature correlations. By processing these correlations through a specially designed feature fusion network, we can finally regress Gaussian surfels with precise geometry. Extensive experiments on DTU reconstruction benchmarks demonstrate that our model achieves comparable results with state-of-the-art methods, and predict Gaussian surfels within 1 second, offering a 100x speedup without costly per-scene training.[172] Phantasia: Context-Adaptive Backdoors in Vision Language Models
Nam Duong Tran,Phi Le Nguyen
Main category: cs.CV
TL;DR: 本文揭示了现有视觉语言模型(VLM)后门攻击的隐蔽性被高估,并提出一种新型上下文自适应后门攻击方法Phantasia,显著提升攻击隐蔽性与适应性。
Details
Motivation: 现有VLM后门攻击多依赖固定、易识别的中毒响应模式,其实际隐蔽性缺乏严格评估;同时缺乏能生成语义一致且难以检测的动态中毒输出的方法。 Method: 1)通过迁移跨模态防御技术,系统评估现有攻击的可检测性;2)提出Phantasia:一种上下文自适应后门攻击框架,利用输入语义动态生成合理但恶意的响应,避免静态模式。 Result: 实验表明,多个SOTA后门攻击可被简单迁移的防御方法高效检测;Phantasia在多种VLM架构上实现最高攻击成功率,且在各类防御下保持良性性能。 Conclusion: VLM后门攻击的真实隐蔽性被高估;Phantasia为更隐蔽、自适应的多模态后门攻击提供了新范式,也凸显了亟需更强健的VLM安全防御机制。 Abstract: Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.[173] SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
Wenli Zhang,Xianglong Shi,Sirui Zhao,Xinqi Chen,Guo Cheng,Yifan Xu,Tong Xu,Yong Liao
Main category: cs.CV
TL;DR: 本文提出SyncBreaker,一种阶段感知的多模态保护框架,通过联合扰动图像和音频输入,有效抑制扩散模型驱动的说话人脸生成中的唇部同步与面部动态,同时保持输入感知质量并具备抗净化鲁棒性。
Details
Motivation: 现有防护方法局限于单模态(仅图像或仅音频),无法有效抑制语音驱动的面部动态,存在被滥用于欺诈和虚假信息的风险。 Method: 提出SyncBreaker框架:1)图像流采用多区间采样(MIS)下的归零监督,跨扩散阶段聚合去噪引导;2)音频流采用跨注意力欺骗(CAF),抑制特定区间的音频条件跨注意力响应;两流独立优化、推理时融合。 Result: 在白盒主动防护设置下,SyncBreaker显著优于强单模态基线,在降低唇同步与面部动态的同时,保持输入感知质量,并对净化攻击具有鲁棒性。 Conclusion: SyncBreaker验证了阶段感知、多模态协同扰动是防御音频驱动 talking-head 生成滥用的有效新范式。 Abstract: Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.[174] BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields
Fan Yang,Wenrui Chen,Guorun Yan,Ruize Liao,Wanjun Jia,Dongsheng Luo,Kailun Yang,Zhiyong Li,Yaonan Wang
Main category: cs.CV
TL;DR: 本文提出BLaDA框架,通过语言解析、三角形功能点定位和3D关键点抓取矩阵变换,实现零样本、可解释的灵巧操作,显著提升功能抓取精度与成功率。
Details
Motivation: 现有模块化方法依赖预定义功能标签,缺乏语义与姿态的紧密耦合,难以支持开放词汇指令下的功能灵巧操作。 Method: 提出BLaDA框架:1)知识引导的语言解析(KLP)模块将自然语言解析为六元操纵约束;2)三角功能点定位(TriLocation)模块基于3D高斯泼溅和三角几何约束定位功能区域;3)3D关键点抓取矩阵变换执行(KGT3D+)模块生成手腕姿态与手指级指令。 Result: 在复杂基准测试中,BLaDA在功能定位精度和跨类别/任务的功能操作成功率上均显著优于现有方法。 Conclusion: BLaDA实现了开放词汇指令驱动、语义-几何-物理紧密耦合、且高度可解释的零样本灵巧操作框架,推动了模块化机器人操作系统的发展。 Abstract: In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.[175] HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment
Changdao Chen
Main category: cs.CV
TL;DR: 本文提出HST-HGN模型,结合分层超图网络与双向状态空间模型(Bi-Mamba),在保持低计算开销的同时,有效建模面部表情的高阶协同与长程时序演化,显著提升无裁剪视频中驾驶员疲劳检测的精度与实时性。
Details
Motivation: 现有方法难以在计算受限条件下,从无裁剪视频中准确评估驾驶员疲劳,因细微面部表情的长程时序依赖建模困难:重模型计算开销大,轻量图模型又缺乏高阶协同与全局时序建模能力。 Method: 提出HST-HGN:空间上采用分层超图网络,融合姿态解耦的几何拓扑与多模态纹理块;时间上引入线性复杂度的Bi-Mamba模块进行双向序列建模,显式滤波时序演化过程。 Result: 在多个疲劳检测基准上达到SOTA性能,在判别力与计算效率间取得良好平衡,适用于车载边缘端实时部署。 Conclusion: HST-HGN通过异构时空超图与双向状态空间建模,有效解决了轻量化下高阶协同与长程时序建模难题,为实时疲劳检测提供了新范式。 Abstract: It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.[176] CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
Rui Gan,Junyi Ma,Pei Li,Xingyou Yang,Kai Chen,Sikai Chen,Bin Ran
Main category: cs.CV
TL;DR: 本文提出了CrashSight——一个基于真实路侧摄像头数据的大规模视觉-语言基准,用于评估模型在道路碰撞理解任务中的表现,特别关注基础设施视角下的协同自动驾驶场景。
Details
Motivation: 现有视觉-语言模型(VLMs)在安全关键交通场景中的性能缺乏充分评估,因其主流基准多聚焦于自车视角,忽视了路侧基础设施视角的重要性。 Method: 构建了包含250个碰撞视频和13K多选问答对的CrashSight基准,采用两层分类体系:第一层评估场景上下文与涉事主体的视觉定位能力;第二层考察碰撞力学、因果归因、时间演进及事后结果等高层推理能力,并对8种前沿VLM进行系统评测。 Result: 实验表明,当前VLM虽具备较强场景描述能力,但在时序与因果推理方面仍显著不足;论文还详细分析了典型失败案例。 Conclusion: CrashSight为协同自动驾驶中基础设施辅助感知提供了标准化评测框架,指明了提升VLM在安全关键场景下推理能力的研究方向。 Abstract: Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.[177] OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
Haoxi Zeng,Qiankun Liu,Yi Bin,Haiyue Zhang,Yujuan Ding,Guoqing Wang,Deqiang Ouyang,Heng Tao Shen
Main category: cs.CV
TL;DR: 本文提出OVS-DINO框架,通过将DINO模型与SAM的结构先验对齐,恢复其深层特征中的边界感知能力,显著提升开放词汇分割在复杂场景下的性能。
Details
Motivation: 现有基于CLIP和DINO的开放词汇分割方法在细粒度空间感知尤其是边缘感知方面仍存在不足,而DINO内部表征中固有的边界信息随网络加深逐渐衰减。 Method: 提出结构感知编码器(SAE)和结构调制解码器(SMD),利用SAM提供的结构先验激活DINO的边界特征,并采用SAM生成的伪掩码进行监督。 Result: 在多个弱监督OVS基准上达到SOTA,平均分数提升2.1%(44.8%→46.9%);在Cityscapes上提升6.3%(36.6%→42.9%)。 Conclusion: 通过结构对齐可有效恢复DINO的边缘敏感性,证明结合语义泛化能力与结构先验是提升开放词汇密集预测性能的关键路径。 Abstract: Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).[178] LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
Jingjing Wang,Zhengdong Hong,Chong Bao,Yuke Zhu,Junhan Sun,Guofeng Zhang
Main category: cs.CV
TL;DR: 本文提出LAMP方法,通过将图像编辑作为3D先验,提取物体间连续、几何感知的3D变换表示,以提升开放世界机器人操作中的泛化能力。
Details
Motivation: 现有基于学习的方法(如强化学习、模仿学习和视觉语言动作模型)在面对新任务和未见环境时泛化能力不足;而大语言模型和视觉语言模型受限于3D感知能力,难以支持细粒度操作。 Method: 提出LAMP框架,利用图像编辑中隐含的丰富2D空间线索,将其提升为物体间的连续、几何感知的3D变换表示,作为开放世界操作的通用表征。 Result: 实验表明LAMP能准确输出3D变换,并在开放世界操作任务中实现强零样本泛化性能。 Conclusion: LAMP通过引入图像编辑作为3D先验,有效提升了机器人在开放世界中对新任务和新环境的泛化操作能力,为几何感知表征学习提供了新思路。 Abstract: Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.[179] Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization
Sai Srinivas Kancheti,Aditya Kanade,Rohit Sinha,Vineeth N Balasubramanian,Tanuja Ganu
Main category: cs.CV
TL;DR: 本文提出Faithful GRPO(FGRPO),一种改进的强化学习方法,通过拉格朗日对偶上升机制在训练中显式约束推理链(CoT)的逻辑一致性和视觉依据性,显著提升多模态推理模型在空间推理任务中的推理质量与答案准确率。
Details
Motivation: 现有基于强化学习的多模态推理模型(如ViGoRL-Spatial、TreeVGR及标准GRPO训练模型)虽提升答案准确率,但其生成的推理链常与最终答案不一致、且缺乏对图像证据的准确描述,即推理‘不忠实’。 Method: 提出Faithful GRPO(FGRPO),在标准GRPO框架中引入批次级逻辑一致性与视觉接地性约束,并通过拉格朗日对偶上升法自适应调节约束权重,将其融入组内优势函数计算。 Result: 在Qwen2.5-VL-7B/3B模型和七个空间推理数据集上验证:FGRPO将推理不一致率从24.5%降至1.7%,视觉接地分数提升+13%,同时答案准确率也优于标准GRPO。 Conclusion: 强制推理链具备逻辑一致性和视觉接地性不仅提升推理质量,还能反哺最终答案性能,证实‘忠实推理’是提升多模态推理能力的关键路径。 Abstract: Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.[180] Novel View Synthesis as Video Completion
Qi Wu,Khiem Vuong,Minsik Jeon,Srinivasa Narasimhan,Deva Ramanan
Main category: cs.CV
TL;DR: 本文提出FrameCrafter方法,将稀疏新视角合成(NVS)建模为低帧率视频补全任务,通过改造视频扩散模型(如去除时序位置编码、引入逐帧潜在编码)使其对输入视角顺序不变,从而有效利用视频模型中隐含的多视角知识,在少量多视角图像下实现高质量新视角生成。
Details
Motivation: 现有基于单图扩散模型的方法缺乏多视角先验;而视频扩散模型天然蕴含多视角一致性知识,更适于稀疏NVS任务,但需解决其对输入顺序敏感的问题。 Method: 将稀疏NVS重构为视角序列的视频补全问题;设计FrameCrafter架构:采用逐帧潜在编码、移除时间位置嵌入、引入置换不变性机制,使视频模型适应无序稀疏视角输入。 Result: 在稀疏视角NVS基准上达到具有竞争力的性能,证明视频模型仅需极少监督即可‘遗忘’时间信息并有效迁移至NVS任务。 Conclusion: 视频扩散模型经轻量改造后可高效适配稀疏NVS,验证了利用视频模型隐式多视角知识的可行性与优越性。 Abstract: We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to "forget" about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/[181] Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification
Kabilan Elangovan,Daniel Ting
Main category: cs.CV
TL;DR: 本文提出了一种无需标注的、置信度加权的一致性评估指标C-Score,用于衡量CAM类可解释方法在同类病灶患者间空间推理策略的稳定性,并揭示了分类性能(AUC)与解释一致性之间存在的多种隐性脱节现象,为模型部署提供早期预警和架构级临床建议。
Details
Motivation: 现有CAM评估框架仅关注解释的定位准确性(对比放射科医生标注),而忽视了解释在同类患者间是否一致(即模型是否使用相同空间推理策略),这可能导致高分类性能但低可信赖性的解释。 Method: 提出C-Score:一种基于强度加权的成对软IoU、面向正确分类样本、无需人工标注的一致性量化指标;在Kermany胸部X光数据集上,系统评估6种CAM方法 × 3种CNN架构 × 30个训练轮次,涵盖迁移学习与微调阶段。 Result: 发现三种AUC与一致性脱节的新机制:阈值导致的‘金标准列表坍缩’、峰值AUC时的技术特异性归因坍缩、全局聚合掩盖的类别级不一致性;C-Score可在AUC灾难性崩溃前一个检查点预警(如ResNet50V2上ScoreCAM的退化);支持架构特异的临床部署推荐。 Conclusion: C-Score填补了可解释AI中‘一致性’评估的空白,是比传统分类指标更早、更鲁棒的模型稳定性指示器,推动临床AI从单纯追求预测精度转向兼顾推理稳健性与可信赖性。 Abstract: Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.[182] Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Ying Shen,Jerry Xiong,Tianjiao Yu,Ismini Lourentzou
Main category: cs.CV
TL;DR: 本文提出Phantom模型,通过在视频生成过程中联合建模视觉内容与潜在物理动力学,提升生成视频的物理一致性与视觉真实性。
Details
Motivation: 现有生成式视频模型虽具高视觉保真度,但缺乏对真实世界物理规律的理解与遵循,导致运动和动态不真实。 Method: 提出Phantom模型,将物理属性推理嵌入视频生成流程;构建物理感知的视频表征,联合预测潜在物理动力学与未来帧,无需显式指定复杂物理方程。 Result: 在标准视频生成与物理感知基准上,Phantom在物理一致性方面优于现有方法,同时保持具有竞争力的感知质量。 Conclusion: 将物理推理直接融入生成过程可有效提升视频的物理合理性,且不牺牲视觉 realism,为物理驱动的视频生成提供了新范式。 Abstract: Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.[183] Visually-grounded Humanoid Agents
Hang Ye,Xiaoxuan Ma,Fan Lu,Wayne Wu,Kwan-Yee Lin,Yizhou Wang
Main category: cs.CV
TL;DR: 本文提出Visually-grounded Humanoid Agents,一种基于视觉的双层(世界-智能体)范式,使数字人仅凭视觉观测和指定目标即可在新场景中主动、自然地执行目标导向行为。
Details
Motivation: 现有数字人系统多为被动驱动,依赖特权状态或脚本控制,难以扩展至新环境;本文旨在实现仅依赖视觉输入与目标指令的主动、具身化数字人行为。 Method: 构建双层架构:世界层通过遮挡感知的高斯重建管道从真实视频生成语义丰富的3D高斯场景,并支持可动画高斯人体模型;智能体层赋予该模型第一人称RGB-D感知能力,结合空间感知与迭代推理进行具身规划,并输出全身动作执行。同时构建了评估数字人-场景交互的新基准。 Result: 实验表明,所提方法在任务成功率和碰撞率上均优于消融实验及当前最优规划方法,实现了鲁棒的自主行为。 Conclusion: 该工作推动了主动式数字人规模化部署与以人为中心的具身AI发展,所有数据、代码与模型将开源。 Abstract: Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with digital humans at scale that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer reconstructs semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware pipeline and accommodates animatable Gaussian-based human avatars. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid-scene interaction in diverse reconstructed environments. Experiments show our agents achieve robust autonomous behavior, yielding higher task success rates and fewer collisions than ablations and state-of-the-art planning methods. This work enables active digital human population and advances human-centric embodied AI. Data, code, and models will be open-sourced.[184] When Fine-Tuning Changes the Evidence: Architecture-Dependent Semantic Drift in Chest X-Ray Explanations
Kabilan Elangovan,Daniel Ting
Main category: cs.CV
TL;DR: 本文研究了在医学图像分类中,迁移学习和微调过程中模型归因结构的稳定性问题,提出了‘语义漂移’的概念,并通过多种架构和归因方法量化了这种漂移。
Details
Motivation: 尽管迁移学习加微调在医学图像分类中提升了诊断性能,但在多类且视觉特征重叠的情况下,准确率提升并不能保证支持预测的视觉证据的稳定性。 Method: 在五类胸部X光任务上,使用DenseNet201、ResNet50V2和InceptionV3进行两阶段训练,并采用无需参考的指标(如空间定位和归因图结构一致性)来量化语义漂移。 Result: 粗略解剖定位保持稳定,但IoU重叠显示不同架构间证据结构发生显著重组;LayerCAM与GradCAM++在收敛预测性能下给出的稳定性排序可能相反。 Conclusion: 解释稳定性是模型架构、优化阶段与归因目标三者交互的结果,不能仅依赖分类准确率评估模型可信性。 Abstract: Transfer learning followed by fine-tuning is widely adopted in medical image classification due to consistent gains in diagnostic performance. However, in multi-class settings with overlapping visual features, improvements in accuracy do not guarantee stability of the visual evidence used to support predictions. We define semantic drift as systematic changes in the attribution structure supporting a model's predictions between transfer learning and full fine-tuning, reflecting potential shifts in underlying visual reasoning despite stable classification performance. Using a five-class chest X-ray task, we evaluate DenseNet201, ResNet50V2, and InceptionV3 under a two-stage training protocol and quantify drift with reference-free metrics capturing spatial localization and structural consistency of attribution maps. Across architectures, coarse anatomical localization remains stable, while overlap IoU reveals pronounced architecture-dependent reorganization of evidential structure. Beyond single-method analysis, stability rankings can reverse across LayerCAM and GradCAM++ under converged predictive performance, establishing explanation stability as an interaction between architecture, optimization phase, and attribution objective.[185] MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Tanmay Gupta,Piper Wolters,Zixian Ma,Peter Sushko,Rock Yuren Pang,Diego Llanes,Yue Yang,Taira Anderson,Boyuan Zheng,Zhongzheng Ren,Harsh Trivedi,Taylor Blanton,Caleb Ouellette,Winson Han,Ali Farhadi,Ranjay Krishna
Main category: cs.CV
TL;DR: 本文提出了完全开源的多模态网页代理MolmoWeb及其训练数据集MolmoWebMix,旨在推动开放、可复现的网页智能体研究;该代理仅基于网页截图和任务指令预测浏览器动作,无需HTML或API访问,在多个基准测试中达到SOTA性能,并将全部模型、数据与代码开源。
Details
Motivation: 现有最强网页代理依赖闭源模型,缺乏透明性与可复现性,阻碍科学进步与社区协作;作者主张‘为开放网络构建开放代理’,推动开源、可验证、可扩展的网页智能体发展。 Method: 构建了大规模混合数据集MolmoWebMix(含10万+合成轨迹、3万+人工演示、原子技能轨迹及GUI感知数据),并训练了指令驱动的视觉-语言动作策略模型MolmoWeb(4B/8B参数量),仅输入网页截图和自然语言指令,直接输出浏览器操作动作。 Result: MolmoWeb在WebVoyager、Online-Mind2Web、DeepShop等基准上超越同规模开源模型(如Fara-7B、UI-Tars-1.5-7B)及部分基于大闭源模型(如GPT-4o)的SoM代理;通过test-time scaling(best-of-N)显著提升性能(如WebVoyager pass@4达94.7%)。 Conclusion: MolmoWeb证明了全开源、纯视觉输入的网页代理可达到甚至超越闭源方案的性能,其完整开源(模型、数据、代码、评测框架)为社区提供了坚实基础,有望加速开放Web智能体的研究与应用。 Abstract: Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.[186] UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding
Joungbin An,Agrim Jain,Kristen Grauman
Main category: cs.CV
TL;DR: 本文提出UniversalVTG,一种轻量级、通用的视频时序定位模型,通过跨数据集预训练和查询统一机制,在多个基准上达到SOTA性能,且参数量远小于现有MLLM方法。
Details
Motivation: 现有VTG方法多为数据集专用,泛化性差;而基于大语言模型的方法计算开销高、视频上下文受限,难以处理长视频。 Method: 提出UniversalVTG模型:1)离线Query Unifier将异构查询格式统一为声明式空间,缓解语言不匹配与负迁移;2)结合高效定位头,支持长未剪辑视频;3)在大规模跨数据集上进行统一监督预训练。 Result: 在GoalStep-StepGrounding、Ego4D-NLQ、TACoS、Charades-STA、ActivityNet-Captions等多个基准上,单个UniversalVTG检查点均达到SOTA;相比MLLM方法参数量减少100倍以上,精度持平或更优。 Conclusion: Unified supervision与轻量架构可有效提升VTG模型的通用性与实用性,无需依赖大参数量MLLM即可实现高性能长视频定位。 Abstract: Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.[187] FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On
Johanna Karras,Yuanhao Wang,Yingwei Li,Ira Kemelmacher-Shlizerman
Main category: cs.CV
TL;DR: 本文提出FIT数据集,首个包含精确身体与服装尺寸信息的大规模虚拟试穿数据集,并基于此构建了首个支持服装合身度感知的虚拟试穿模型。
Details
Motivation: 现有虚拟试穿方法忽视服装合身度(如大号衣服穿在小号人体上),主因缺乏含精确尺寸标注(尤其'不合身'案例)的训练数据。 Method: 提出FIT数据集:1)用GarmentCode生成3D服装并经物理仿真模拟真实穿着效果;2)设计新型重纹理框架将合成渲染图转为照片级真实图像且严格保持几何结构;3)在重纹理中引入人物身份保持机制以生成同一人物穿不同服装的配对图像。基于FIT训练合身度感知VTO基线模型。 Result: 构建了含113万+试穿三元组、带精确身体与服装尺寸标注的FIT数据集;训练出首个能反映真实服装合身效果的VTO模型,性能达新SOTA,并提供公开基准。 Conclusion: FIT填补了合身度感知虚拟试穿的数据空白,为该方向提供了首个高质量基准与可行技术路径,推动VTO从‘视觉真实’迈向‘物理真实’。 Abstract: Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment, while preserving their original pose and identity. Although recent VTO methods excel at visualizing garment appearance, they largely overlook a crucial aspect of the try-on experience: the accuracy of garment fit -- for example, depicting how an extra-large shirt looks on an extra-small person. A key obstacle is the absence of datasets that provide precise garment and body size information, particularly for "ill-fit" cases, where garments are significantly too large or too small. Consequently, current VTO methods default to generating well-fitted results regardless of the garment or person size. In this paper, we take the first steps towards solving this open problem. We introduce FIT (Fit-Inclusive Try-on), a large-scale VTO dataset comprising over 1.13M try-on image triplets accompanied by precise body and garment measurements. We overcome the challenges of data collection via a scalable synthetic strategy: (1) We programmatically generate 3D garments using GarmentCode and drape them via physics simulation to capture realistic garment fit. (2) We employ a novel re-texturing framework to transform synthetic renderings into photorealistic images while strictly preserving geometry. (3) We introduce person identity preservation into our re-texturing model to generate paired person images (same person, different garments) for supervised training. Finally, we leverage our FIT dataset to train a baseline fit-aware virtual try-on model. Our data and results set the new state-of-the-art for fit-aware virtual try-on, as well as offer a robust benchmark for future research. We will make all data and code publicly available on our project page: https://johannakarras.github.io/FIT.[188] Self-Improving 4D Perception via Self-Distillation
Nan Huang,Pengcheng Yu,Weijia Zeng,James M. Rehg,Angjoo Kanazawa,Haiwen Feng,Qianqian Wang
Main category: cs.CV
TL;DR: SelfEvo是一种无需标注数据的自改进框架,利用时空上下文不对称性进行自蒸馏,持续提升多视角重建模型在动态场景中的4D感知性能。
Details
Motivation: 现有大规模多视图重建模型严重依赖昂贵且稀缺的3D/4D真值标注,尤其在动态场景中难以扩展。 Method: 提出SelfEvo框架,采用基于时空上下文不对称性的自蒸馏机制,并系统研究了有效自改进所需的关键设计(如损失信号、不对称形式和训练策略)。 Result: 在八个涵盖不同数据集与领域的基准上,SelfEvo一致提升预训练基线模型(如VGGT和π³),视频深度估计相对提升达36.5%,相机估计提升20.1%,且不使用任何标注数据。 Conclusion: SelfEvo验证了仅用无标签视频即可持续提升多视图4D重建模型的有效性,为低成本、可扩展的4D感知提供了新范式。 Abstract: Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.[189] RewardFlow: Generate Images by Optimizing What You Reward
Onkar Susladkar,Dong-Hwan Jang,Tushar Prakash,Adheesh Juvekar,Vedant Shah,Ayush Barik,Nabeel Bashir,Muntasir Wahed,Ritish Shrirao,Ismini Lourentzou
Main category: cs.CV
TL;DR: RewardFlow是一种无需模型 inversion 的推理时引导框架,通过多奖励 Langevin 动力学调控预训练扩散与流匹配模型,融合多种可微奖励并引入可微 VQA 奖励,结合提示感知的自适应策略实现高质量图像编辑与组合生成。
Details
Motivation: 现有方法在图像编辑和组合生成中难以兼顾语义对齐、感知保真、局部定位、对象一致性和人类偏好等多目标;同时缺乏细粒度语义监督机制。 Method: 提出RewardFlow框架:1)构建多奖励Langevin动力学采样过程,整合语义对齐、感知保真、局部接地、对象一致性及人类偏好等可微奖励;2)设计可微VQA奖励,利用语言-视觉推理提供细粒度语义监督;3)开发提示感知的自适应策略,从指令中提取语义原语、推断编辑意图,并动态调节各奖励权重与步长。 Result: 在多个图像编辑与组合生成基准上达到SOTA性能,显著提升编辑保真度与组合对齐效果。 Conclusion: RewardFlow证明了无需模型修改或再训练、仅通过推理时多目标协同优化即可大幅提升生成质量与可控性,为通用可控生成提供了新范式。 Abstract: We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.[190] ParseBench: A Document Parsing Benchmark for AI Agents
Boyang Zhang,Sebastián G. Acosta,Preston Carlson,Sacha Bron,Pierre-Loïc Doulcet,Simon Suo
Main category: cs.CV
TL;DR: 本文提出了ParseBench,一个面向AI代理需求的新型文档解析基准,强调语义正确性,涵盖表格、图表、内容保真度、语义格式和视觉定位五个维度,评估了14种方法并揭示了现有系统的能力缺口。
Details
Motivation: 现有基准无法充分反映AI代理在企业自动化中对文档解析的语义正确性要求(如表格结构、图表数据精度、语义格式和视觉定位),且依赖窄域文档分布和文本相似性指标,易忽略关键失败。 Method: 构建了包含约2000页人工验证的企业文档(保险、金融、政府领域)的ParseBench基准,定义五个能力维度,并对14种方法(含多模态模型、专用解析器及LlamaParse)进行系统评估。 Result: 评估显示当前方法能力碎片化,无一能在全部五个维度上持续表现优异;LlamaParse Agentic以最高整体得分(agenticoverall%)领先,但仍存在显著能力缺口。 Conclusion: ParseBench填补了面向AI代理的文档解析评估空白,揭示了语义正确性解析的关键挑战,为后续研究与系统开发提供了标准化测试平台和明确改进方向。 Abstract: AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.[191] Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Tao Xie,Peishan Yang,Yudong Jin,Yingfeng Cai,Wei Yin,Weiqiang Ren,Qian Zhang,Wei Hua,Sida Peng,Xiaoyang Guo,Xiaowei Zhou
Main category: cs.CV
TL;DR: 本文提出了一种神经全局上下文表示方法,通过轻量级子网络在测试时自监督快速适应,以提升长视频序列下的大规模3D场景重建精度与一致性。
Details
Motivation: 现有前馈式3D重建模型在长视频序列中因内存有限和难以捕获全局上下文而性能下降;受人类利用全局场景理解辅助局部感知的启发,需建模长程场景上下文。 Method: 设计轻量级神经子网络构成的可快速自监督微调的神经全局上下文表示,高效压缩并保留长程场景信息,融入重建流程。 Result: 在KITTI Odometry和Oxford Spires等大规模基准上,取得领先的位姿精度和SOTA 3D重建精度,同时保持高效率。 Conclusion: 所提全局上下文表示能显著提升长序列下大规模3D重建的准确性、一致性和可扩展性,兼顾性能与效率。 Abstract: This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.[192] E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation
Mayur Deshmukh,Hiroyasu Akada,Helge Rhodin,Christian Theobalt,Vladislav Golyanik
Main category: cs.CV
TL;DR: 本文提出E-3DPSM,一种面向事件流特性的连续姿态状态机,用于头戴式设备的单目事件相机3D人体姿态估计,显著提升精度与时间稳定性。
Details
Motivation: 现有方法未充分适配事件流的异步、连续特性,导致3D估计精度低、易受自遮挡和时序抖动影响,难以满足VR/AR等应用需求。 Method: 提出E-3DPSM:将连续人体运动与细粒度事件动态对齐,通过演化潜在状态并预测事件驱动的3D关节点连续变化,再与直接3D姿态预测融合,实现稳定无漂移的重建。 Result: 在两个基准上达到SOTA,MPJPE精度最高提升19%,时间稳定性提升达2.7倍;实时运行于单工作站,达80Hz。 Conclusion: E-3DPSM有效建模事件流的连续性与异步性,为事件驱动的 egocentric 3D姿态估计提供了更鲁棒、高效的新范式。 Abstract: Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x. See our project page for the source code and trained models.[193] Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
Shilin Yan,Jintao Tong,Hongwei Xue,Xiaojun Tang,Yangyang Wang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou
Main category: cs.CV
TL;DR: 本文提出HDPO框架,通过解耦准确性和效率两个优化通道,解决多模态智能体过度调用工具的问题,显著降低工具调用次数并提升推理准确性。
Details
Motivation: 现有多模态智能体缺乏元认知能力,无法合理判断何时应依赖内部知识、何时需调用外部工具,导致盲目调用工具、延迟高、噪声大;而现有基于标量奖励的强化学习方法存在优化困境,难以平衡工具使用与任务准确率。 Method: 提出HDPO框架,摒弃奖励标量化,构建两个正交优化通道:准确性通道(最大化任务正确率)和效率通道(仅在准确轨迹中通过条件优势估计强制执行经济性),从而自然形成‘先掌握任务、再优化自依赖’的认知课程。 Result: 所提出的模型Metis在多项评估中将工具调用次数降低数个数量级,同时提升推理准确性。 Conclusion: HDPO通过条件化效率优化有效缓解了多模态智能体的元认知缺陷,为构建高效、低延迟、高精度的智能体提供了新范式。 Abstract: The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.[194] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
Zhengyang Sun,Yu Chen,Xin Zhou,Xiaofan Li,Xiwu Chen,Dingkang Liang,Xiang Bai
Main category: cs.CV
TL;DR: 本文提出NUMINA框架,通过训练-free的方式提升文本到视频扩散模型中对象数量的准确性。该方法通过识别提示与生成布局之间的不一致,并利用注意力机制进行保守优化和跨注意力调制,从而实现更准确的计数效果。
Details
Motivation: 文本到视频扩散模型在生成符合提示中指定数量的对象时表现不佳,需要一种无需额外训练即可改善数值对齐的方法。 Method: NUMINA是一种训练-free的‘识别-引导’框架:首先选择判别性的自注意力和交叉注意力头来推导可计数的潜在布局,以识别提示与布局间的不一致;然后保守地优化该布局,并调制交叉注意力以指导重生成。 Result: 在新构建的CountBench上,NUMINA将Wan2.1系列模型(1.3B、5B、14B)的计数准确率分别提升最多7.4%、4.9%和5.5%,同时提升了CLIP对齐度并保持时间一致性。 Conclusion: 结构化引导可有效补充种子搜索和提示增强策略,为实现计数准确的文本到视频生成提供了实用路径。 Abstract: Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.[195] GaussiAnimate: Reconstruct and Rig Animatable Categories with Level of Dynamics
Jiaxin Wang,Dongxin Lyu,Zeyu Cai,Zhiyang Dou,Cheng Lin,Anpei Chen,Yuliang Xiu
Main category: cs.CV
TL;DR: 本文提出了一种名为Skelebones的支架-皮肤绑定系统,通过自由形态骨骼(Bones)、均值曲率骨架(Skeleton)和非参数分部运动匹配(PartMM)三步,实现对4D形状动态性的高效压缩与可控表达,在重动画性能和重建保真度上显著优于LBS和BoB等基线方法。
Details
Motivation: 自由形态骨骼虽能有效捕捉非刚性变形,但缺乏用于直观控制的运动学结构;现有方法在控制性与表达力之间难以兼顾。 Method: 提出Skelebones系统:(1) 将时序一致的可变形高斯分布压缩为自由形态骨骼以近似表面非刚性变形;(2) 从标准高斯分布中提取并时序优化均值曲率骨架,保证类别无关、运动自适应且拓扑正确;(3) 利用非参数分部运动匹配(PartMM)绑定骨架与骨骼,通过匹配、检索与融合已有运动合成新骨骼运动。 Result: 在合成与真实数据集上验证,对未见姿态的重动画PSNR分别比LBS和BoB提升17.3%和21.7%;PartMM在低数据(~1000帧)下RMSE比鲁棒LBS改善48.4%,且优于GRU/MLP方法超20%;同时保持高重建保真度,尤其适用于复杂非刚性动态角色。 Conclusion: Skelebones成功将4D形状的动态性压缩为紧凑、可控且富有表现力的skelebones表示,为高斯辐射场驱动的角色动画提供了新范式。 Abstract: Free-form bones, that conform closely to the surface, can effectively capture non-rigid deformations, but lack a kinematic structure necessary for intuitive control. Thus, we propose a Scaffold-Skin Rigging System, termed "Skelebones", with three key steps: (1) Bones: compress temporally-consistent deformable Gaussians into free-form bones, approximating non-rigid surface deformations; (2) Skeleton: extract a Mean Curvature Skeleton from canonical Gaussians and refine it temporally, ensuring a category-agnostic, motion-adaptive, and topology-correct kinematic structure; (3) Binding: bind the skeleton and bones via non-parametric partwise motion matching (PartMM), synthesizing novel bone motions by matching, retrieving, and blending existing ones. Collectively, these three steps enable us to compress the Level of Dynamics of 4D shapes into compact skelebones that are both controllable and expressive. We validate our approach on both synthetic and real-world datasets, achieving significant improvements in reanimation performance across unseen poses-with 17.3% PSNR gains over Linear Blend Skinning (LBS) and 21.7% over Bag-of-Bones (BoB)-while maintaining excellent reconstruction fidelity, particularly for characters exhibiting complex non-rigid surface dynamics. Our Partwise Motion Matching algorithm demonstrates strong generalization to both Gaussian and mesh representations, especially under low-data regime (~1000 frames), achieving 48.4% RMSE improvement over robust LBS and outperforming GRU- and MLP-based learning methods by >20%. Code will be made publicly available for research purposes at cookmaker.cn/gaussianimate.[196] ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets
Xiaoben Li,Jingyi Wu,Zeyu Cai,Yu Siyuan,Boqian Li,Yuliang Xiu
Main category: cs.CV
TL;DR: 本文提出ETCH-X,一种改进的人体拟合方法,通过紧致性感知拟合(“脱衣”)和隐式稠密对应(“稠密拟合”)两阶段模块化设计,提升对服装、姿态及输入缺失的鲁棒性与细节表达能力,并在多个数据集上显著超越原ETCH。