cs.CL [Back]

[1] Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

Berkin Durmus,Chen Cen,Eduardo Pacheco,Arda Okan,Atila Orhon

Main category: cs.CL

TL;DR: 本文提出Contextual Earnings-22数据集，旨在解决学术基准与工业场景间语音识别性能差距问题，强调上下文条件（尤其是罕见和领域定制词汇）的重要性，并建立六种强基线方法验证上下文语音识别的有效性。

Details

Motivation: 学术基准上的语音识别准确率已趋饱和，但工业界和高风险领域仍存在显著提升空间；作者认为核心差异在于上下文条件——学术基准多用常见通用词汇，而实际应用中罕见、上下文定义的定制词汇对可用性影响更大，且缺乏标准化上下文评测基准。 Method: 构建开源数据集Contextual Earnings-22（基于Earnings-22），嵌入真实定制词汇上下文；设置六种强基线，涵盖关键词提示（keyword prompting）和关键词增强（keyword boosting）两类主流上下文语音识别方法。 Result: 实验表明，两种方法在规模化系统中均达到可比且显著提升的识别准确率，揭示了上下文建模在大规模部署中的实际有效性。 Conclusion: 上下文条件是推动语音识别实用化进步的关键因素；Contextual Earnings-22为该方向提供了首个标准化基准，有助于释放被低估的潜在进展。 Abstract: The accuracy frontier of speech-to-text systems has plateaued on academic benchmarks.1 In contrast, industrial benchmarks and adoption in high-stakes domains suggest otherwise. We hypothesize that the primary difference between the two is contextual conditioning: Academic benchmarks are dominated by frequently encountered general vocabulary that is relatively easy to recognize compared with rare and context-defined custom vocabulary that has disproportionate impact on the usability of speech transcripts. Despite progress on contextual speech-to-text, there is no standardized benchmark. We introduce Contextual Earnings-22, an open dataset built upon Earnings-22, with realistic custom vocabulary contexts to foster research and reveal latent progress. We set six strong baselines for two dominant approaches: keyword prompting and keyword boosting. Experiments show both reach comparable and significantly improved accuracy when scaled from proof-of-concept to large-scale systems.

[2] Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

Youcef Soufiane Gheffari,Oussama Mustapha Benouddane,Samiya Silarbi

Main category: cs.CL

TL;DR: 本文提出了一种基于CNN-Transformer混合架构的阿拉伯语语音情感识别（SER）系统，在EYASE数据集上达到97.8%准确率和0.98宏观F1分数，验证了该方法在低资源语言中的有效性。

Details

Motivation: 阿拉伯语语音情感识别研究稀缺，主要受限于标注数据集匮乏。 Method: 采用CNN-Transformer混合架构：CNN提取梅尔频谱图的判别性频谱特征，Transformer编码器建模语音长程时序依赖。 Result: 在EYASE（埃及阿拉伯语情感语音）语料库上取得97.8%准确率和0.98宏观F1分数。 Conclusion: CNN与注意力机制结合能有效提升阿拉伯语SER性能，Transformer类方法对低资源语言具有应用潜力。 Abstract: Recognizing emotions from speech using machine learning has become an active research area due to its importance in building human-centered applications. However, while many studies have been conducted in English, German, and other European and Asian languages, research in Arabic remains scarce because of the limited availability of annotated datasets. In this paper, we present an Arabic Speech Emotion Recognition (SER) system based on a hybrid CNN-Transformer architecture. The model leverages convolutional layers to extract discriminative spectral features from Mel-spectrogram inputs and Transformer encoders to capture long-range temporal dependencies in speech. Experiments were conducted on the EYASE (Egyptian Arabic speech emotion) corpus, and the proposed model achieved 97.8% accuracy and a macro F1-score of 0.98. These results demonstrate the effectiveness of combining convolutional feature extraction with attention-based modeling for Arabic SER and highlight the potential of Transformer-based approaches in low-resource languages.

[3] Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Avyav Kumar Singh,Yen-Chen Wu,Alexandru Cioba,Alberto Bernacchia,Davide Buffelli

Main category: cs.CL

TL;DR: 本文提出字节级蒸馏（BLD）方法，通过在字节层面统一师生模型的接口来解决跨分词器知识蒸馏（CTD）问题，简单有效且性能优越。

Details

Motivation: 现有跨分词器蒸馏方法依赖启发式词汇对齐策略，复杂度高且效果有限，亟需更简洁鲁棒的解决方案。 Method: 将教师模型输出分布转换为字节级概率，为学生模型附加轻量级字节解码头，并通过该共享字节接口进行知识蒸馏。 Result: BLD在多个基准测试中表现媲美甚至超越更复杂的CTD方法，适用于1B至8B参数规模模型。 Conclusion: 字节级别是跨分词器知识迁移的天然共同接口，但CTD仍是一个尚未完全解决的问题。 Abstract: Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher's output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with--and on several benchmarks surpasses--significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

[4] Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá

Opeyemi Osakuade,Simon King

Main category: cs.CL

TL;DR: 本文研究离散语音单元（DSUs）在编码超音段信息（如声调）方面的局限性，发现现有量化方法更倾向于保留音段结构而弱化声调等超音段特征；尽管SSL隐表示本身可编码声调，但经量化后声调可靠性下降；作者提出一种分阶段残差量化策略以提升声调编码能力。

Details

Motivation: DSUs被广泛用于语音任务，尤其涉及文本与语音联合建模的场景，但其对超音段信息（如声调、语调）的编码可靠性不足，亟需深入探究并改进。 Method: 在声调语言（普通话和约鲁巴语）上系统评估SSL隐表示及多种量化方法（包括K-means及其他变体）对声调的编码能力，并提出两阶段残差K-means量化策略：先用K-means编码音段信息，再对残差表示进行二次聚类以增强声调表征。 Result: SSL原始隐表示能有效编码声调，但经标准量化（如K-means）得到的DSUs显著削弱声调信息；该现象在多种量化方法中普遍存在；所提残差量化策略在保持音段结构的同时提升了声调编码的可靠性。 Conclusion: 当前DSU量化策略对超音段特征存在固有局限，需发展声调/语调感知的新表征学习方法；分阶段残差量化是一种有前景的改进方向。 Abstract: Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody. Our investigations using the tone languages Mandarin and Yorùbá show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means. We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.

[5] Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma

Xuechen Zhang,Aviv Slobodkin,Joydeep Paul,Mandar Sharma,Samet Oymak,Shravya Shetty,Gautam Prasad

Main category: cs.CL

TL;DR: 本文提出DFR-Gemma框架，使大语言模型能直接对密集地理空间嵌入进行推理，无需中间文本表示，提升效率与准确性。

Details

Motivation: 现有地理空间基础模型嵌入与大语言模型（LLM）集成方式存在冗余、token低效和数值失真问题。 Method: 提出Direct Feature Reasoning-Gemma（DFR-Gemma），通过轻量级投影器将高维地理空间嵌入对齐到LLM隐空间，作为语义token与自然语言指令共同输入。 Result: 在多任务地理空间基准测试中，DFR-Gemma实现零样本准确推理，显著优于基于文本的基线方法，并能解码潜在空间模式。 Conclusion: 将嵌入视为首要数据输入，是构建更直接、高效、可扩展的多模态地理空间智能的可行路径。 Abstract: Representation learning for geospatial and spatio-temporal data plays a critical role in enabling general-purpose geospatial intelligence. Recent geospatial foundation models, such as the Population Dynamics Foundation Model (PDFM), encode complex population and mobility dynamics into compact embeddings. However, their integration with Large Language Models (LLMs) remains limited. Existing approaches to LLM integration treat these embeddings as retrieval indices or convert them into textual descriptions for reasoning, introducing redundancy, token inefficiency, and numerical inaccuracies. We propose Direct Feature Reasoning-Gemma (DFR-Gemma), a novel framework that enables LLMs to reason directly over dense geospatial embeddings. DFR aligns high-dimensional embeddings with the latent space of an LLM via a lightweight projector, allowing embeddings to be injected as semantic tokens alongside natural language instructions. This design eliminates the need for intermediate textual representations and enables intrinsic reasoning over spatial features. To evaluate this paradigm, we introduce a multi-task geospatial benchmark that pairs embeddings with diverse question-answer tasks, including feature querying, comparison, and semantic description. Experimental results show that DFR allows LLMs to decode latent spatial patterns and perform accurate zero-shot reasoning across tasks, while significantly improving efficiency compared to text-based baselines. Our results demonstrate that treating embeddings as primary data inputs, provides a more direct, efficient, and scalable approach to multimodal geospatial intelligence.

[6] Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

Mengdan Zhu,Senhao Cheng,Liang Zhao

Main category: cs.CL

TL;DR: 本文提出DLR框架，通过分解查询、提取条件化视觉潜变量和基于 grounded rationale 推理，解决视觉语言模型在复杂视觉推理中的信息丢失问题。

Details

Motivation: 现有方法在复杂视觉推理中存在视觉信息丢失问题，或增加工具调用成本，或依赖局部补丁嵌入，难以支持多步推理。 Method: 提出'分解、观察、推理'(DLR)强化潜变量推理框架，包含三阶段训练流程和球面高斯潜变量策略，以实现动态查询分解、前提条件化连续视觉潜变量提取及基于 grounded rationale 的答案推导。 Result: 在以视觉为中心的基准测试中，DLR持续优于文本仅、交错多模态CoT及潜变量推理等强基线方法，并提供更优的逐步可解释性。 Conclusion: DLR有效缓解了视觉语言模型在复杂视觉推理中的视觉信息损失问题，提升了多步推理能力与可解释性。 Abstract: Vision-Language Models often struggle with complex visual reasoning due to the visual information loss in textual CoT. Existing methods either add the cost of tool calls or rely on localized patch-based embeddings that are insufficient to extract semantics in multi-step reasoning. We propose \emph{"Decompose, Look, and Reason" (DLR)}, a reinforced latent reasoning framework that dynamically decomposes queries into textual premises, extracts premise-conditioned continuous visual latents, and deduces answers through grounded rationales. We introduce a three-stage training pipeline and propose a novel Spherical Gaussian Latent Policy to enable effective exploration in the latent space. Extensive experiments on vision-centric benchmarks show that DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and latent reasoning methods, while providing superior stepwise interpretability.

[7] EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

Xueren Ge,Sahil Murtaza,Anthony Cortez,Homa Alemzadeh

Main category: cs.CL

TL;DR: 本文提出了一种基于ePCR数据和主题流的多智能体生成流程，构建了EMSDialog数据集，并验证其在提升急救对话诊断预测性能上的有效性。

Details

Motivation: 现有医疗对话语料库多为双人对话，缺乏支持多角色工作流和细粒度标注的资源，难以满足流式临床对话中动态证据追踪与适时诊断决策的需求。 Method: 设计了一个基于ePCR数据、结合主题流建模的多智能体生成流程，通过迭代规划、生成与自修正，并引入基于规则的事实性与主题连贯性检查，生成高质量合成对话。 Result: 构建了包含4,414段多说话人EMS对话的EMSDialog数据集，涵盖43种诊断、说话人角色及轮次级主题标注；经人工与大模型评估，证实其高质量与高真实性；增强训练显著提升了诊断预测的准确率、及时性与稳定性。 Conclusion: EMSDialog填补了多角色临床对话诊断数据的空白，所提生成范式可扩展至其他专业领域，为对话式医疗AI提供了新基准与数据基础。 Abstract: Conversational diagnosis prediction requires models to track evolving evidence in streaming clinical conversations and decide when to commit to a diagnosis. Existing medical dialogue corpora are largely dyadic or lack the multi-party workflow and annotations needed for this setting. We introduce an ePCR-grounded, topic-flow-based multi-agent generation pipeline that iteratively plans, generates, and self-refines dialogues with rule-based factual and topic flow checks. The pipeline yields EMSDialog, a dataset of 4,414 synthetic multi-speaker EMS conversations based on a real-world ePCR dataset, annotated with 43 diagnoses, speaker roles, and turn-level topics. Human and LLM evaluations confirm high quality and realism of EMSDialog using both utterance- and conversation-level metrics. Results show that EMSDialog-augmented training improves accuracy, timeliness, and stability of EMS conversational diagnosis prediction.

[8] TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization

Figen Eğin,Aytuğ Onan

Main category: cs.CL

TL;DR: 本文提出AutoMUP方法，基于多个人工摘要自动生成土耳其语教育视频的金标准摘要，通过嵌入聚类与共识建模生成分级摘要，并在TR-EduVSum数据集上验证其与强LLM摘要高度语义重合。

Details

Motivation: 缺乏可复现、自动化的土耳其语教育视频金标准摘要生成方法，现有评估依赖人工或单一模型，难以反映真实共识。 Method: 构建TR-EduVSum数据集（82个土耳其语算法课程视频+3281份人工摘要）；提出AutoMUP：提取意义单元→嵌入聚类→统计建模参与者间一致性→按共识权重生成分级摘要；金标准取最高共识配置。 Result: AutoMUP摘要与Flash 2.5、GPT-5.1等强LLM摘要语义重合度高；消融实验证明共识权重和聚类对摘要质量起决定性作用；方法可低成本迁移至其他突厥语族语言。 Conclusion: AutoMUP为土耳其语教育视频提供了全自动、可复现、共识驱动的金标准摘要生成框架，兼具评估价值与跨语言泛化潜力。 Abstract: This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of "Data Structures and Algorithms" and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM (Large Language Model) summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.

[9] Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

Tunazzina Islam

Main category: cs.CL

TL;DR: 本文提出一种基于大语言模型（LLM）推理的无监督聚类结果优化框架，通过一致性验证、冗余裁决和标签接地三个阶段，提升聚类的连贯性、可解释性与人类对齐度，无需标注数据。

Details

Motivation: 无监督语义聚类方法常产生不连贯、冗余或缺乏依据的簇，难以在无标注数据下有效验证。 Method: 设计三阶段LLM推理框架：（i）一致性验证（判断簇摘要是否被原始文本支持）；（ii）冗余裁决（基于语义重叠合并或剔除候选簇）；（iii）标签接地（完全无监督地生成可解释簇标签），将表征学习与结构验证解耦。 Result: 在两个不同社交平台的真实语料上显著优于经典主题模型和最新表征基线，人类评估显示LLM生成标签与人工判断高度一致，且跨平台具有时间与规模鲁棒性。 Conclusion: LLM可作为通用语义结构验证与优化机制，提升大规模文本无监督分析的可靠性与可解释性。 Abstract: Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms.Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates common failure modes of embedding-only approaches. We evaluate the framework on real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analyses under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analyses of large text collections without supervision.

[10] CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

Mohamed Ehab,Ali Hamdi,Khaled Shaban

Main category: cs.CL

TL;DR: 本文提出了一种面向类别不平衡问题的新型集成方法CAMO（Class-Aware Minority-Optimized），通过分层机制动态增强少数类预测，在多个不平衡文本分类基准上显著提升宏观F1分数，且具备模型与领域通用性。

Details

Motivation: 现实世界中的分类任务常面临严重类别不平衡问题，传统集成方法偏向多数类，损害少数类性能和整体F1得分。 Method: 提出CAMO方法，采用分层策略，融合投票分布、置信度校准和模型间不确定性，动态提升少数类权重并强化其预测。 Result: 在DIAR-AI/Emotion和BEA 2025两个高度不平衡数据集上，CAMO在精调模型下持续取得最高严格宏观F1分数，优于7种基线集成方法及8种语言模型（含3个LLM和5个SLM）。 Conclusion: CAMO是一种可靠、领域中立的不平衡分类集成框架，其优势与模型适配协同作用，表明最优集成策略需匹配模型特性。 Abstract: Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts. We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.

[11] ADAG: Automatically Describing Attribution Graphs

Aryaman Arora,Zhengxuan Wu,Jacob Steinhardt,Sarah Schwettmann

Main category: cs.CL

TL;DR: 本文提出ADAG自动化管道，用于解释语言模型中的电路追踪，通过归因特征分析、聚类和大语言模型生成自然语言解释，成功复现已知可解释电路并发现有害建议越狱的可操控特征簇。

Details

Motivation: 现有电路追踪方法依赖人工解释特征作用，缺乏自动化和系统性，限制了可扩展性和可靠性。 Method: 提出归因特征谱（attribution profiles）量化特征功能角色，设计新聚类算法分组特征，并构建LLM解释-模拟框架生成并评估特征组的自然语言解释。 Result: 在已知人工分析的电路追踪任务上成功复现可解释电路；发现Llama 3.1 8B Instruct中导致有害建议越狱的可操控特征簇。 Conclusion: ADAG实现了端到端自动化的电路追踪解释，提升了语言模型可解释性的可扩展性与实用性，为安全分析提供新工具。 Abstract: In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.

[12] DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification

Ziyi Wang,Siva Rajesh Kasa,Ankith M S,Santhosh Kumar Kasa,Jiaru Zou,Sumit Negi,Ruqi Zhang,Nan Jiang,Qifan Song

Main category: cs.CL

TL;DR: 本文提出DIVERSED方法，通过动态松弛验证步骤来提升推测解码的效率，同时保持生成质量。

Details

Motivation: 标准推测解码中严格的验证步骤限制了接受率，导致加速效果受限。 Method: 提出基于集成的动态验证器，根据任务和上下文自适应地融合草稿模型与目标模型的概率分布。 Result: 理论分析与实验表明，DIVERSED显著提升了推理效率，优于传统推测解码方法。 Conclusion: DIVERSED在不牺牲生成质量的前提下，有效缓解了推测解码中的验证瓶颈，提升了大语言模型推理速度。 Abstract: Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.

[13] Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction

Mingchen Li,Jiatan Huang,Zonghai Yao,Hong yu

Main category: cs.CL

TL;DR: 本文提出K2K框架，通过将临床知识编码进模型参数空间实现内部键值记忆快速检索，避免外部检索的高延迟问题，并结合激活引导探针构建与交叉注意力重排序提升检索质量，在四个医疗结果预测基准数据集上达到SOTA性能。

Details

Motivation: 大型语言模型在医疗场景中因幻觉和缺乏细粒度医学上下文而可靠性不足；传统RAG方法依赖大规模外部知识库的计算密集型检索，导致高延迟，不适用于时间敏感的临床决策。 Method: 提出Keys to Knowledge（K2K）框架：将关键临床信息编码至模型参数空间，构建内部键值记忆；采用激活引导的探针构造和交叉注意力重排序机制提升内部检索质量；无需推理时外部检索，消除额外开销。 Result: 在四个医疗健康结果预测基准数据集上，K2K显著优于现有方法，达到当前最优（state-of-the-art）性能。 Conclusion: K2K为LLMs在高风险临床环境中的可靠、低延迟应用提供了新范式，验证了内部知识编码替代外部检索的有效性与实用性。 Abstract: Large language models (LLMs) hold significant promise for healthcare, yet their reliability in high-stakes clinical settings is often compromised by hallucinations and a lack of granular medical context. While Retrieval Augmented Generation (RAG) can mitigate these issues, standard supervised pipelines require computationally intensive searches over massive external knowledge bases, leading to high latency that is impractical for time-sensitive care. To address this, we introduce Keys to Knowledge (K2K), a novel framework that replaces external retrieval with internal, key-based knowledge access. By encoding essential clinical information directly into the model's parameter space, K2K enables rapid retrieval from internal key-value memory without inference-time overhead. We further enhance retrieval quality through activation-guided probe construction and cross-attention reranking. Experimental results demonstrate that K2K achieves state-of-the-art performance across four benchmark healthcare outcome prediction datasets.

Ziyi Chen,Yasir Khan,Mengyuan Zhang,Cheng Peng,Mengxian Lyu,Yiyang Liu,Krishna Vaddiparti,Robert L Cook,Mattia Prosperi,Yonghui Wu

Main category: cs.CL

TL;DR: 本研究开发了一个基于大语言模型（LLM）的工具，用于从临床笔记中自动识别HIV相关污名，通过人工标注1332条句子并比较多种模型性能，发现GatorTron-large表现最优（Micro F1=0.62），few-shot提示显著提升生成式模型效果。

Details

Motivation: HIV相关污名是影响感染者心理健康、治疗依从性和结局的关键心理社会因素，但目前缺乏可直接用于临床笔记中污名内容提取与分类的现成NLP工具。 Method: 基于佛罗里达大学2012–2022年PLWH临床笔记，利用专家定义关键词和临床词嵌入扩展候选句子；人工标注1332句至四个污名子维度；对比GatorTron-large、BERT等编码器模型与GPT-OSS-20B、LLaMA-8B、MedGemma-27B等生成式LLM在zero-shot和few-shot设置下的性能。 Result: GatorTron-large整体最佳（Micro F1=0.62）；few-shot下GPT-OSS-20B和LLaMA-8B分别达0.57和0.59；Negative Self-Image最易预测，Personalized Stigma最难；zero-shot生成推理失败率高达32%。 Conclusion: 本研究首次构建了实用的NLP工具以识别临床笔记中的HIV污名，验证了领域适配编码器模型的有效性，并揭示了few-shot提示对生成式模型的关键提升作用。 Abstract: Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.

[15] SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

Jie Sun,Yu Liu,Lu Han,Qiwen Deng,Xiang Shu,Yang Xiao,Xingyu Lu,Jun Zhou,Pengfei Liu,Lintao Ma,Jiancan Wu,Xiang Wang

Main category: cs.CL

TL;DR: 本文提出SepSeq框架，通过插入分隔符标记来缓解LLM在处理长数值序列时因Softmax注意力分散导致的性能下降，无需训练且即插即用，在9个主流LLM上平均相对准确率提升35.6%，推理token消耗减少16.4%。

Details

Motivation: Transformer架构的大型语言模型（LLMs）理论上支持大上下文窗口，但在处理长数值序列时性能严重下降，作者将此归因于Softmax机制引起的注意力分散，导致模型难以聚焦关键信息。 Method: 提出Separate Sequence（SepSeq）方法，是一种无需训练、即插即用的框架，通过在输入序列中策略性插入分隔符（separator）标记，使其作为注意力汇点（attention sink），从而重新校准注意力分布，兼顾局部聚焦与全局上下文建模。 Result: 在9个广泛采用的LLM上进行了大量实验验证，SepSeq在多个领域任务中平均相对准确率提升35.6%，同时平均减少16.4%的总推理token消耗。 Conclusion: Separator标记可有效缓解注意力分散问题，SepSeq是一种轻量、通用、高效的方法，显著提升LLM处理长数值序列的能力，且不增加训练开销。 Abstract: While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention sink, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing total inference token consumption by 16.4% on average.

Steven Au,Sujit Noronha

Main category: cs.CL

TL;DR: 本文提出PPT-Bench基准，用于评估大语言模型在面对四类哲学压力（知识不稳定、价值消解、权威倒置、身份瓦解）时的‘认识论攻击’响应，揭示其不同于传统奉承行为的认知脆弱性，并测试了多种缓解策略的有效性。

Details

Motivation: 现有研究多关注LLM的奉承行为（如偏好对齐、谄媚），但忽视了更广义的认识论失败；作者旨在系统评估模型在知识、价值与身份等根本层面被质疑时的稳定性与一致性。 Method: 构建PPT-Bench诊断基准，基于哲学压力分类法（PPT）设计四类压力类型，并在三层（L0基线、L1单轮压力、L2多轮苏格拉底式追问）上测试5个主流LLM；对比分析不同缓解策略（如提示锚定、角色稳定提示、对比解码）的效果。 Result: 四类哲学压力引发统计上可区分的不一致性模式，表明认识论攻击能暴露标准社交压力基准无法捕获的模型缺陷；缓解效果高度依赖压力类型和模型架构：API模型中提示工程更优，开源模型中Leading Query Contrastive Decoding最稳健。 Conclusion: LLM在认识论根基层面存在系统性脆弱性，需专用基准（如PPT-Bench）与类型适配的干预手段来评估与提升其认知稳健性。 Abstract: Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.

[17] An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

Clarissa Miranda-Pena,Andrew Reeson,Cécile Paris,Josiah Poon,Jonathan K. Kummerfeld

Main category: cs.CL

TL;DR: 本文研究了静态分析工具在检测和缓解大型语言模型（LLM）生成代码时的幻觉问题（尤其是库相关幻觉）方面的潜力与局限性，发现其可检测16%-70%的错误，但存在固有上限（48.5%-77%），无法完全解决该问题。

Details

Motivation: 大型语言模型在生成涉及库调用的代码时仍频繁产生幻觉（如调用不存在的API），亟需有效、低成本的检测与缓解方法。 Method: 系统评估多种静态分析工具在多个NL-to-code基准数据集上对LLM生成代码中幻觉（特别是库幻觉）的检测能力，并通过人工分析确定其理论检测上限。 Result: 静态分析工具可检测16%-70%的全部错误、14%-85%的库幻觉；人工分析表明其理论上限为48.5%-77%。 Conclusion: 静态分析是一种低成本、部分有效的幻觉缓解手段，但因其固有局限，永远无法彻底解决LLM代码幻觉问题。 Abstract: Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses.One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.

[18] Sensitivity-Positional Co-Localization in GQA Transformers

Manoj Chandrashekar Rao

Main category: cs.CL

TL;DR: 本文研究了分组查询注意力（GQA）Transformer中任务敏感层与位置编码（RoPE）影响层是否重合的问题，发现二者呈现强反向定位关系（late layers敏感于任务，early layers敏感于RoPE），但将两种适配方法（LSLoRA和GARFA）共同应用于任务敏感层仍能显著提升性能。

Details

Motivation: 探究GQA模型中任务正确性最敏感的网络层是否与位置编码（RoPE）适应最具影响力的层相重合（即‘共定位假设’），以指导更高效的参数高效微调策略。 Method: 提出LSLoRA（基于正确性差分隐状态指标筛选任务敏感层并限制LoRA适配范围）和GARFA（为每个目标层的每个KV头引入8个可学习RoPE频率缩放因子）；在Llama 3.1 8B（32层、4:1 Q:KV头比）上进行层敏感性分析与交叉层消融实验。 Result: 发现任务敏感层集中于后段（层23–31），RoPE影响层集中于前段（层0–9），Spearman相关系数rs = −0.735（p < 0.001），证实强‘反定位’；但将LSLoRA与GARFA同时施加于任务敏感层时，在6个基准上全面优于其他配置（+4–16个百分点），HumanEval+达67.1%，接近Claude 3.5 Haiku（68.3%），总计算成本仅100美元。 Conclusion: 任务敏感性与RoPE敏感性在GQA模型中空间分离，但面向任务敏感层联合部署结构感知的适配方法（LSLoRA+GARFA）仍是最优策略，挑战了共定位直觉，为高效微调提供了新范式。 Abstract: We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.

[19] TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

Atahan Dokme,Benjamin Reichman,Larry Heck

Main category: cs.CL

TL;DR: 本文研究情感表达是否会影响大语言模型在定量推理任务中的表现，发现情感语境会降低准确率2-10个百分点，但通过中性化处理可恢复大部分性能，表明问题源于情感风格而非内容失真。

Details

Motivation: 现实世界中的查询常带有情绪（如愤怒、紧迫感或兴奋），而现有大语言模型多在中性语言上训练和评估，因此需探究纯情感框架是否仅凭风格就损害推理能力。 Method: 构建可控的情绪翻译框架，将数学推理题重写为情感变体但保持所有数值与逻辑关系不变；据此创建Temper-5400基准（含5400组经语义验证的情感–中性配对），并在18个不同规模模型上评估。 Result: 情感框架使准确率下降2–10个百分点；中性化情感变体后性能基本恢复；非情感的表面改写不引起性能下降，证实是情感内容而非形式变化导致退化。 Conclusion: 情感表达本身即可干扰大语言模型的定量推理能力，且该影响可通过轻量级推理时中性化缓解；所提框架亦适用于其他风格鲁棒性评估。 Abstract: Large language models are trained and evaluated on quantitative reasoning tasks written in clean, emotionally neutral language. However, real-world queries are often wrapped in frustration, urgency or enthusiasm. Does emotional framing alone degrade reasoning when all numerical content is preserved? To investigate this, a controlled emotion translation framework is developed that rewrites problems into emotional variants while preserving all quantities and relationships. Using this framework, Temper-5400 (5,400 semantically verified emotion--neutral pairs) is constructed across GSM8K, MultiArith, and ARC-Challenge, and evaluated on eighteen models (1B to frontier scale). Two core results emerge: First, emotional framing reduces accuracy by 2-10 percentage points even though all numerical content is preserved. Second, neutralizing emotional variants recovers most of the lost performance, showing both that the degradation is tied to emotional style rather than content corruption and that neutralization can serve as a lightweight inference-time mitigation. Non-emotional paraphrases cause no such degradation, implicating emotional content rather than surface-level changes. Beyond emotion specifically, the benchmark construction procedure provides a general framework for controlled stylistic translation and robustness evaluation.

[20] GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning

Kaiyuan Tian,Yu Tang,Gongqingjian Jiang,Baihui Liu,Yifu Gao,Xialin Su,Linbo Qiao,Dongsheng Li

Main category: cs.CL

TL;DR: 本文提出GRASS框架，通过基于梯度的自适应层重要性采样和层优化器状态卸载机制，在降低内存消耗的同时提升微调性能。

Details

Motivation: 现有低秩适配方法限制模型表达能力，层微调方法忽视任务和训练阶段对层重要性的动态影响，导致下游任务性能不佳。 Method: GRASS利用平均梯度范数作为任务和训练阶段感知的层重要性指标，并通过自适应训练策略动态调整层采样概率；同时引入层优化器状态卸载机制以重叠计算与通信。 Result: 在多个模型和基准测试中，GRASS平均准确率提升达4.38点，内存使用减少最多19.97%。 Conclusion: GRASS是一种高效、自适应的层微调框架，在保持训练吞吐量的同时显著提升性能并降低显存占用。 Abstract: Full-parameter fine-tuning of large language models is constrained by substantial GPU memory requirements. Low-rank adaptation methods mitigate this challenge by updating only a subset of parameters. However, these approaches often limit model expressiveness and yield lower performance than full-parameter fine-tuning. Layer-wise fine-tuning methods have emerged as an alternative, enabling memory-efficient training through static layer importance sampling strategies. However, these methods overlook variations in layer importance across tasks and training stages, resulting in suboptimal performance on downstream tasks. To address these limitations, we propose GRASS, a gradient-based adaptive layer-wise importance sampling framework. GRASS utilizes mean gradient norms as a task-aware and training-stage-aware metric for estimating layer importance. Furthermore, GRASS adaptively adjusts layer sampling probabilities through an adaptive training strategy. We also introduce a layer-wise optimizer state offloading mechanism that overlaps computation and communication to further reduce memory usage while maintaining comparable training throughput. Extensive experiments across multiple models and benchmarks demonstrate that GRASS consistently outperforms state-of-the-art methods, achieving an average accuracy improvement of up to 4.38 points and reducing memory usage by up to 19.97\%.

[21] AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

Yuxuan Hu,Jianchao Tan,Jiaqi Zhang,Wen Zan,Pingwei Sun,Yifan Lu,Yerui Sun,Yuchen Xie,Xunliang Cai,Jing Zhang

Main category: cs.CL

TL;DR: 本文提出AsyncTLS，一种分层稀疏注意力机制，通过粗粒度块过滤与细粒度令牌选择结合，并利用时间局部性实现KV缓存异步卸载，显著提升长上下文推理效率而不损失精度。

Details

Motivation: 长上下文推理面临注意力计算复杂度高和KV缓存内存占用大的双重挑战，现有稀疏注意力方法在精度与效率之间难以兼顾。 Method: 提出AsyncTLS：1）分层稀疏注意力（块级粗筛+令牌级精选）；2）基于时间局部性的异步KV缓存卸载引擎，重叠数据传输与计算。 Result: 在Qwen3和GLM-4.7-Flash模型上，于48k–96k长上下文下，相比全注意力保持相当精度，算子速度提升1.2–10.0倍，端到端吞吐提升1.3–4.7倍。 Conclusion: AsyncTLS在精度与效率间取得更好权衡，为大模型长上下文推理提供了实用且可扩展的解决方案。 Abstract: Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.

[22] Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model

Kunfeng Chen,Luyao Zhuang,Fei Liao,Juhua Liu,Jian Wang,Bo Du

Main category: cs.CL

TL;DR: 本文提出了一种名为Tool Retrieval Bridge (TRB) 的方法，通过引入桥接模型将模糊指令重写为更具体的指令，从而提升工具检索在真实模糊指令场景下的性能，并构建了新基准VGToolBench进行验证。

Details

Motivation: 现有工具检索方法依赖于包含详细API信息的学术基准，而现实中的用户指令往往模糊不清，导致性能下降，因此需要针对模糊指令优化工具检索。 Method: 构建新基准VGToolBench模拟模糊指令；提出TRB方法，利用桥接模型将模糊指令重写为更具体、适配检索器偏好的形式。 Result: TRB在多种检索设置下均显著提升性能，例如使BM25的NDCG平均分从9.73提升至19.59（相对提升111.51%）。 Conclusion: TRB是一种简单有效的方法，能有效缓解模糊指令带来的歧义问题，并在各类基线检索器上实现一致且显著的性能提升。 Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to address real-world challenges. Due to the extensive and irregularly updated number of tools, tool retrieval for selecting the desired tool subset is essential. However, current tool retrieval methods are usually based on academic benchmarks containing overly detailed instructions (e.g., specific API names and parameters), while real-world instructions are more vague. Such a discrepancy would hinder the tool retrieval in real-world applications. In this paper, we first construct a new benchmark, VGToolBench, to simulate human vague instructions. Based on this, we conduct a series of preliminary analyses and find that vague instructions indeed damage the performance of tool retrieval. To this end, we propose a simple-yet-effective Tool Retrieval Bridge (TRB) approach to boost the performance of tool retrieval for vague instructions. The principle of TRB is to introduce a bridge model to rewrite the vague instructions into more specific ones and alleviate the gap between vague instructions and retriever preferences.We conduct extensive experiments under multiple commonly used retrieval settings, and the results show that TRB effectively mitigates the ambiguity of vague instructions while delivering consistent and substantial improvements across all baseline retrievers. For example, with the help of TRB, BM25 achieves a relative improvement of up to 111.51%, i.e., increasing the average NDCG score from 9.73 to 19.59. The source code and models are publicly available at https://github.com/kfchenhn/TRB.

[23] Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Harsh Kohli,Srinivasan Parthasarathy,Huan Sun,Yuekun Yao

Main category: cs.CL

TL;DR: 本文研究了隐式推理能力，提出循环深度Transformer模型以增强多跳推理的组合泛化能力，并揭示了其在系统性泛化和深度外推上的有效性及机制。

Details

Motivation: 现有大语言模型虽存储大量知识和规则，但在隐式多跳推理中缺乏组合泛化能力，难以将参数化知识进行有效组合。 Method: 引入循环深度Transformer（recurrent-depth transformers），即在相同Transformer层上进行迭代计算；通过从头训练模型，在系统性泛化与深度外推两个任务上开展受控实验，并结合机制分析与训练策略研究。 Result: 循环深度Transformer显著提升系统性泛化与深度外推能力：系统性泛化经三阶段‘grokking’过程涌现；深度外推可通过增加推理时迭代次数实现，但存在‘过度思考’（overthinking）问题。 Conclusion: 循环深度架构是提升隐式推理组合泛化能力的有效路径，但需平衡推理深度与过拟合风险，为训练和部署此类模型提供了理论依据与实践指导。 Abstract: We study implicit reasoning, i.e. the ability to combine knowledge or rules within a single forward pass. While transformer-based large language models store substantial factual knowledge and rules, they often fail to compose this knowledge for implicit multi-hop reasoning, suggesting a lack of compositional generalization over their parametric knowledge. To address this limitation, we study recurrent-depth transformers, which enables iterative computation over the same transformer layers. We investigate two compositional generalization challenges under the implicit reasoning scenario: systematic generalization, i.e. combining knowledge that is never used for compositions during training, and depth extrapolation, i.e. generalizing from limited reasoning depth (e.g. training on up to 5-hop) to deeper compositions (e.g. 10-hop). Through controlled studies with models trained from scratch, we show that while vanilla transformers struggle with both generalization challenges, recurrent-depth transformers can effectively make such generalization. For systematic generalization, we find that this ability emerges through a three-stage grokking process, transitioning from memorization to in-distribution generalization and finally to systematic generalization, supported by mechanistic analysis. For depth extrapolation, we show that generalization beyond training depth can be unlocked by scaling inference-time recurrence, with more iterations enabling deeper reasoning. We further study how training strategies affect extrapolation, providing guidance on training recurrent-depth transformers, and identify a key limitation, overthinking, where excessive recurrence degrades predictions and limits generalization to very deep compositions.

[24] Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers

Michelle Damin Kim,Ellie S. Paek,Yufen Lin,Emily Mroz,Jane Chung,Jinho D. Choi

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的方法，构建多样化的社交媒体数据集以测量和比较照护者与非照护者群体的孤独感，并设计了专家驱动的孤独评估框架与成因分类体系，在Reddit数据上验证了其有效性。

Details

Motivation: 为准确测量和比较照护者与非照护者群体在社交媒体中表现出的孤独感差异，需构建高质量、多样化且具人群代表性的数据集，但现有方法在标注质量、成因细粒度分析及人群覆盖方面存在局限。 Method: 结合专家知识构建孤独评估框架与成因分类体系；采用人工验证的数据处理流程，利用GPT-4o、GPT-5-nano和GPT-5在Reddit上构建高质量语料库；进行孤独程度判定与成因类型分类，并开展人口统计信息提取与跨群体分布对比分析。 Result: 孤独评估框架在照护者与非照护者群体中准确率分别为76.09%和79.78%；成因分类框架的微平均F1分数分别为0.825和0.80；发现两群体在孤独成因分布上存在显著差异，照护者孤独多源于照护角色、身份认同缺失与被抛弃感；Reddit数据可有效支持构建多元照护者孤独数据集。 Conclusion: 本研究建立了首个基于LLM的、面向孤独感研究的高质量社交媒体数据构建与分析流水线，证实其在揭示不同人群孤独体验异质性方面的有效性，为数字心理健康研究提供了新范式与可复用工具。 Abstract: This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers' loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.

[25] MemReader: From Passive to Active Extraction for Long-Term Agent Memory

Jingyi Kang,Chunyu Li,Ding Chen,Bo Tang,Feiyu Xiong,Zhiyu Li

Main category: cs.CL

TL;DR: 本文提出MemReader系列模型，用于主动式长时记忆提取：MemReader-0.6B为轻量级被动抽取器，保证结构一致性；MemReader-4B为主动式抽取器，基于GRPO优化，在ReAct范式下评估信息价值、指代歧义与完整性，可选择写入、延迟、检索或丢弃，显著提升知识更新、时序推理与幻觉抑制能力。

Details

Motivation: 现有记忆提取方法为单次被动转录，难以应对对话噪声、指代缺失和跨轮依赖，导致记忆污染、低价值写入与不一致问题。 Method: 提出MemReader家族：0.6B为蒸馏得到的紧凑被动抽取器，确保准确性和schema一致性；4B为主动抽取器，采用Group Relative Policy Optimization（GRPO）优化，在ReAct范式下动态判断信息价值、参考歧义与完整性，并支持写入、延迟、检索或丢弃等动作。 Result: 在LOCOMO、LongMemEval和HaluMem基准上，MemReader持续超越现有基于抽取的基线方法；MemReader-4B在知识更新、时序推理和幻觉减少任务上达到SOTA；已集成至MemOS并部署于实际应用。 Conclusion: 有效的智能体长时记忆不仅在于提取更多信息，更需推理驱动、选择性地构建低噪声、动态演化的记忆系统。 Abstract: Long-term memory is fundamental for personalized and autonomous agents, yet populating it remains a bottleneck. Existing systems treat memory extraction as a one-shot, passive transcription from context to structured entries, which struggles with noisy dialogue, missing references, and cross-turn dependencies, leading to memory pollution, low-value writes, and inconsistency. In this paper, we introduce the MemReader family for active long-term memory extraction in agent systems: MemReader-0.6B, a compact and cost-efficient passive extractor distilled for accurate and schema-consistent structured outputs, and MemReader-4B, an active extractor optimized with Group Relative Policy Optimization (GRPO) to make memory writing decisions. Under a ReAct-style paradigm, MemReader-4B explicitly evaluates information value, reference ambiguity, and completeness before acting, and can selectively write memories, defer incomplete inputs, retrieve historical context, or discard irrelevant chatter. Experiments on LOCOMO, LongMemEval, and HaluMem show that MemReader consistently outperforms existing extraction-based baselines. In particular, MemReader-4B achieves state-of-the-art performance on tasks involving knowledge updating, temporal reasoning, and hallucination reduction. These results suggest that effective agent memory requires not merely extracting more information, but performing reasoning-driven and selective memory extraction to build low-noise and dynamically evolving long-term memory. Furthermore, MemReader has been integrated into MemOS and is being deployed in real-world applications. To support future research and adoption, we release the models and provide public API access.

[26] Contextualising (Im)plausible Events Triggers Figurative Language

Annerose Eichel,Tonmoy Rakshit,Sabine Schulte im Walde

Main category: cs.CL

TL;DR: 本文探讨了英语主谓宾事件中（非）字面性与合理性之间的关系，通过设计系统化的合理与不合理事件三元组及抽象/具体成分类别，对比分析人类与大语言模型（LLM）在判断合理性时的差异：人类能精细识别并结合语境区分（非）字面性与不合理性，而LLM则表现出浅层语境化能力，并倾向于将不合理事件解释为非字面但合理的含义。

Details

Motivation: 探究（非）字面性与事件合理性之间的关系，以及人类与大语言模型在判断此类事件时的认知差异。 Method: 构建包含合理/不合理、抽象/具体成分的主谓宾事件三元组，收集并分析人类判断与大语言模型生成的判断及示例语境。 Result: 人类能精细区分（非）字面性与不合理性，并有效结合语境；大语言模型仅呈现浅层语境化，且存在将不合理事件偏向解释为非字面但合理现象的偏差。 Conclusion: （非）字面性与合理性在人类认知中可被精细区分，但当前大语言模型缺乏相应深层语义理解能力，易混淆二者。 Abstract: This work explores the connection between (non-)literalness and plausibility at the example of subject-verb-object events in English. We design a systematic setup of plausible and implausible event triples in combination with abstract and concrete constituent categories. Our analysis of human and LLM-generated judgments and example contexts reveals substantial differences between assessments of plausibility. While humans excel at nuanced detection and contextualization of (non-)literal vs. implausible events, LLM results reveal only shallow contextualization patterns with a bias to trade implausibility for non-literal, plausible interpretations.

[27] Linear Representations of Hierarchical Concepts in Language Models

Masaki Sakata,Benjamin Heinzerling,Takumi Ito,Sho Yokoi,Kentaro Inui

Main category: cs.CL

TL;DR: 本文研究了语言模型如何在内部表征中编码层级关系（如日本⊂东亚⊂亚洲），提出基于线性关系概念的方法，训练特定于层级深度和语义域的线性变换，并分析其表征差异；实验表明层级关系可在域内线性恢复，且编码于低维、领域特异的子空间中，但这些子空间间层级表征高度相似，说明模型以高度可解释的线性方式编码概念层级。

Details

Motivation: 探究语言模型内部表征中层级关系（如地理包含关系）的编码机制与程度，弥补以往工作在多词实体和跨层表征分析上的不足。 Method: 基于Linear Relational Concepts，为每个层级深度和语义域训练专用线性变换，比较变换以刻画层级关系相关的表征差异；覆盖多词实体与跨层表示，评估域内泛化与跨域迁移能力。 Result: 层级关系可在各领域内从模型表征中线性恢复；层级信息编码于低维、领域特异的子空间中；不同领域子空间间的层级表征高度相似。 Conclusion: 所研究的语言模型均以高度可解释的线性方式编码概念层级结构。 Abstract: We investigate how and to what extent hierarchical relations (e.g., Japan $\subset$ Eastern Asia $\subset$ Asia) are encoded in the internal representations of language models. Building on Linear Relational Concepts, we train linear transformations specific to each hierarchical depth and semantic domain, and characterize representational differences associated with hierarchical relations by comparing these transformations. Going beyond prior work on the representational geometry of hierarchies in LMs, our analysis covers multi-token entities and cross-layer representations. Across multiple domains we learn such transformations and evaluate in-domain generalization to unseen data and cross-domain transfer. Experiments show that, within a domain, hierarchical relations can be linearly recovered from model representations. We then analyze how hierarchical information is encoded in representation space. We find that it is encoded in a relatively low-dimensional subspace and that this subspace tends to be domain-specific. Our main result is that hierarchy representation is highly similar across these domain-specific subspaces. Overall, we find that all models considered in our experiments encode concept hierarchies in the form of highly interpretable linear representations.

[28] Data Selection for Multi-turn Dialogue Instruction Tuning

Bo Li,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 本文提出MDS（多轮对话选择）框架，从数据选择角度提升指令微调语言模型的训练数据质量，通过全局覆盖与局部结构两个阶段对整段对话进行评分与筛选。

Details

Motivation: 现有指令微调语言模型依赖的大规模多轮对话数据集存在噪声大、结构不一致、话题漂移、重复闲聊和答案格式不匹配等问题。 Method: 提出MDS框架：1）全局覆盖阶段——在用户查询轨迹空间中按bin进行对话级代表性与非冗余性选择；2）局部结构阶段——基于实体锚定的话题连贯性、信息进展度及问答格式一致性评估对话内部可靠性。 Result: MDS在三个多轮基准测试集及领域内Banking测试集上均优于强单轮选择器、对话级LLM打分器和启发式基线，在无参考与有参考指标下均获最佳综合排名，且在相同训练预算下对长对话更鲁棒。 Conclusion: 对话级数据选择比单轮选择更有效，MDS通过兼顾全局代表性与局部结构性，显著提升了多轮对话数据质量，从而增强指令微调效果。 Abstract: Instruction-tuned language models increasingly rely on large multi-turn dialogue corpora, but these datasets are often noisy and structurally inconsistent, with topic drift, repetitive chitchat, and mismatched answer formats across turns. We address this from a data selection perspective and propose \textbf{MDS} (Multi-turn Dialogue Selection), a dialogue-level framework that scores whole conversations rather than isolated turns. MDS combines a global coverage stage that performs bin-wise selection in the user-query trajectory space to retain representative yet non-redundant dialogues, with a local structural stage that evaluates within-dialogue reliability through entity-grounded topic grounding and information progress, together with query-answer form consistency for functional alignment. MDS outperforms strong single-turn selectors, dialogue-level LLM scorers, and heuristic baselines on three multi-turn benchmarks and an in-domain Banking test set, achieving the best overall rank across reference-free and reference-based metrics, and is more robust on long conversations under the same training budget. Code and resources are included in the supplementary materials.

[29] TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

Xinliang Frederick Zhang,Lu Wang

Main category: cs.CL

TL;DR: 本文提出TSUBASA方法，通过动态记忆演化和基于上下文蒸馏的自学习机制，提升个性化大语言模型在长周期任务中的记忆读写能力，在Qwen-3系列模型上显著优于现有记忆增强系统。

Details

Motivation: 现有个性化大语言模型在长周期任务（如长期对话或行为追踪）中表现不佳：记忆机制难以捕捉用户行为演化，RAG面临质量与效率权衡，参数化适配受限于标注数据稀缺导致的训推差距。 Method: TSUBASA采用双路径设计：1）动态记忆演化以改进记忆写入；2）基于上下文蒸馏目标的自学习机制以增强记忆读取，使模型内化用户经验。 Result: 在多个长周期基准测试中，TSUBASA在Qwen-3（4B至32B）上超越Mem0、Memory-R1等主流记忆增强系统，实现质量与效率的帕累托改进，且降低token消耗。 Conclusion: TSUBASA有效突破了长周期个性化建模中的质量-效率瓶颈，为PLLM提供了更鲁棒、高保真且资源高效的记忆增强范式。 Abstract: Personalized large language models (PLLMs) have garnered significant attention for their ability to align outputs with individual's needs and preferences. However, they still struggle with long-horizon tasks, such as tracking a user's extensive history of conversations or activities. Existing memory mechanisms often fail to capture evolving behaviors, and RAG paradigms are trapped by a quality-efficiency tradeoff. Meanwhile, parametric adaptation is bottlenecked by train-inference gap due to the scarcity of labeled data. To enhance the long-horizon capabilities of PLLMs, we introduce TSUBASA, a two-pronged approach designed to improve memory writing via dynamic memory evolution, and memory reading via self-learning with a context distillation objective to internalize user experiences. Extensive evaluations on long-horizon benchmarks using the Qwen-3 model family (4B to 32B) validate the effectiveness of TSUBASA, surpassing competitive memory-augmented systems that rely primarily on memory writing, such as Mem0 and Memory-R1. Our analyses further confirms that TSUBASA breaks the quality-efficiency barrier to achieve Pareto improvements, delivering robust, high-fidelity personalization with a reduced token budget.

[30] HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy

Guoqi Ma,Liang Zhang,Hongyao Tu,Hao Fu,Hui Li,Yujie Lin,Longyue Wang,Weihua Luo,Jinsong Su

Main category: cs.CL

TL;DR: 本文探索了大语言模型（LLM）在跨文档关系抽取（RE）任务中的应用，发现直接使用LLM效果受限于大量预定义关系带来的分类难度；为此提出分层分类模型HCRE，结合层次化关系树与预测-验证推理策略，显著提升性能。

Details

Motivation: 现有基于小语言模型（SLM）的方法受限于语言理解能力；而初步实验发现LLM在跨文档RE中并未稳定超越SLM，主要因大量预定义关系导致分类困难，亟需新方法提升LLM适配性。 Method: 提出HCRE模型：1）构建基于预定义关系集的层次化关系树；2）利用LLM进行逐级关系预测；3）采用‘预测-再验证’策略，在每一层级通过多视角验证缓解错误传播。 Result: HCRE在多个数据集上显著优于现有基线方法，验证了分层分类与预测-验证策略的有效性。 Conclusion: LLM在跨文档RE中潜力未被充分释放，关键在于优化其面对复杂关系空间的推理方式；HCRE通过结构化关系表示与鲁棒推理机制，为LLM在该任务中的应用提供了新范式。 Abstract: Cross-document relation extraction (RE) aims to identify relations between the head and tail entities located in different documents. Existing approaches typically adopt the paradigm of ``\textit{Small Language Model (SLM) + Classifier}''. However, the limited language understanding ability of SLMs hinders further improvement of their performance. In this paper, we conduct a preliminary study to explore the performance of Large Language Models (LLMs) in cross-document RE. Despite their extensive parameters, our findings indicate that LLMs do not consistently surpass existing SLMs. Further analysis suggests that the underperformance is largely attributed to the challenges posed by the numerous predefined relations. To overcome this issue, we propose an LLM-based \underline{H}ierarchical \underline{C}lassification model for cross-document \underline{RE} (HCRE), which consists of two core components: 1) an LLM for relation prediction and 2) a \textit{hierarchical relation tree} derived from the predefined relation set. This tree enables the LLM to perform hierarchical classification, where the target relation is inferred level by level. Since the number of child nodes is much smaller than the size of the entire predefined relation set, the hierarchical relation tree significantly reduces the number of relation options that LLM needs to consider during inference. However, hierarchical classification introduces the risk of error propagation across levels. To mitigate this, we propose a \textit{prediction-then-verification} inference strategy that improves prediction reliability through multi-view verification at each level. Extensive experiments show that HCRE outperforms existing baselines, validating its effectiveness.

[31] Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

Shiwan Zhao,Zhihu Wang,Xuyang Zhao,Jiaming Zhou,Caiyue Xu,Chenfei Liu,Liting Zhang,Yuhang Jia,Yanzhe Zhang,Hualong Yu,Zichen Xu,Qicheng Li,Yong Qin

Main category: cs.CL

TL;DR: 本文提出了一种理解大语言模型（LLM）后训练的统一框架，将其视为对模型行为的结构化干预，并依据轨迹来源（off-policy/on-policy）、行为支持扩展、策略重塑和行为整合四大维度系统梳理现有方法。

Details

Motivation: 现有后训练方法（如SFT、偏好优化、RL等）常按目标函数或标签碎片化讨论，缺乏对其所解决的行为瓶颈的统一理解。 Method: 提出基于轨迹来源（off-policy vs. on-policy）和行为干预角色（支持扩展、策略重塑、行为整合）的二维分析框架，对主流后训练范式进行重新归类与解释。 Result: 统一解释了SFT、偏好学习、RL、蒸馏及多阶段流水线等方法的行为机制与适用场景，揭示其本质是协同的系统级设计问题。 Conclusion: LLM后训练的进步正日益依赖于多阶段协调的系统设计，而非单一目标函数的优化；该框架有助于诊断瓶颈、指导阶段组合与方法选择。 Abstract: Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.

[32] Rethinking Data Mixing from the Perspective of Large Language Models

Yuanjian Xu,Tianze Sun,Changwei Xu,XinLong Zhao,Jianing Hao,Ran Chen,Yang Liu,Ruijie Xu,Stephen Chen,Guang Zhang

Main category: cs.CL

TL;DR: 本文提出DoGraph框架，通过建立梯度动力学与领域分布的理论联系，将数据调度建模为图约束优化问题，以解决大语言模型训练中领域定义、人机领域感知对齐及领域加权对泛化影响等基础问题。

Details

Motivation: 解决大语言模型训练中数据混合策略的关键问题：领域定义不明确、人类与模型对领域的感知是否一致、以及领域加权如何影响泛化能力。 Method: 建立梯度动力学与领域分布之间的形式化联系，提出DoGraph重加权框架，将数据调度建模为图约束优化问题。 Result: 在不同规模GPT-2模型上的大量实验表明，DoGraph持续取得具有竞争力的性能。 Conclusion: 领域在训练动态中起关键作用；DoGraph提供了一种理论驱动且实用的数据调度方法，提升了LLM训练的泛化能力。 Abstract: Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.

[33] AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification

Hongyi Cen,Mingxin Wang,Yule Liu,Jingyi Zheng,Hanze Jia,Tan Tang,Yingcai Wu

Main category: cs.CL

TL;DR: 本文提出AtomEval框架，通过分解声明为SROM原子并使用原子有效性评分（AVS）来评估对抗性重写，以更可靠地检测事实性破坏，发现更强的LLM未必生成更有效的对抗性声明。

Details

Motivation: 标准指标无法捕捉真值条件一致性，常将语义被破坏的重写误判为成功，因此需要一种能识别事实性腐败的有效评估方法。 Method: 提出AtomEval框架，将声明分解为SROM（主语-关系-宾语-修饰语）原子，并设计原子有效性评分（AVS）来量化对抗性重写的事实一致性；在FEVER数据集上对多种攻击策略和LLM生成器进行实验验证。 Result: AtomEval在多个攻击策略和LLM生成器下提供了比传统指标更可靠的评估信号；分析发现更强的LLM并不必然生成更有效的对抗性声明。 Conclusion: 当前对抗性评估实践存在被忽视的局限性，基于有效性的评估（如AtomEval）对于准确衡量事实核查系统鲁棒性至关重要。 Abstract: Adversarial claim rewriting is widely used to test fact-checking systems, but standard metrics fail to capture truth-conditional consistency and often label semantically corrupted rewrites as successful. We introduce AtomEval, a validity-aware evaluation framework that decomposes claims into subject-relation-object-modifier (SROM) atoms and scores adversarial rewrites with Atomic Validity Scoring (AVS), enabling detection of factual corruption beyond surface similarity. Experiments on the FEVER dataset across representative attack strategies and LLM generators show that AtomEval provides more reliable evaluation signals in our experiments. Using AtomEval, we further analyze LLM-based adversarial generators and observe that stronger models do not necessarily produce more effective adversarial claims under validity-aware evaluation, highlighting previously overlooked limitations in current adversarial evaluation practices.

[34] Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

George Fountzoulas

Main category: cs.CL

TL;DR: Kathleen是一种直接在原始UTF-8字节上运行的文本分类架构，采用频域处理，无需分词器和注意力机制，仅含733K参数，并在多个基准数据集上超越更大规模的token化模型。

Details

Motivation: 解决传统文本分类模型依赖分词、高参数量、高计算复杂度（如Transformer的O(L^2)）等问题，探索更轻量、高效、端到端的字节级建模方法。 Method: 提出三个新组件：(1) RecurrentOscillatorBanks（带时序记忆的阻尼正弦卷积，实现O(L)序列处理）；(2) FFT-Rotate Wavetable Encoder（用单个可学习向量映射256个字节值，替代传统嵌入表）；(3) PhaseHarmonics（仅6个可学习相位参数的正弦非线性激活函数）。整体为纯频域、无注意力、无分词的轻量架构。 Result: Kathleen-Clean在IMDB（88.6%）、AG News（92.3%）、SST-2（83.3%）上达到SOTA或强竞争力；相比参数量16倍的token化模型，在IMDB和AG News上分别高出1.6%和2.1%；PhaseHarmonics仅6参数即带来+2.6%准确率提升；支持O(L)时间/内存复杂度，可处理超长字节序列。 Conclusion: 频域建模在字节级文本分类中极具潜力，极简设计（如极少参数的PhaseHarmonics）可显著超越复杂认知架构，挑战了‘大模型=高性能’的固有假设，为高效NLP提供了新范式。 Abstract: We present Kathleen, a text classification architecture that operates directly on raw UTF-8 bytes using frequency-domain processing -- requiring no tokenizer, no attention mechanism, and only 733K parameters. Kathleen introduces three novel components: (1) RecurrentOscillatorBanks -- damped sinusoid convolutions with temporal memory for O(L) sequence processing; (2) an FFT-Rotate Wavetable Encoder that maps all 256 byte values using a single learnable vector (256 floats), replacing conventional embedding tables (65K parameters) while improving accuracy; (3) PhaseHarmonics -- a sinusoidal non-linearity with just 6 learnable phase parameters that our ablation identifies as the single most impactful component (+2.6% accuracy, <0.001% of model parameters). Through comprehensive ablation of a 1.8M-parameter predecessor, we show that frequency-domain components systematically outperform complex cognitive architectures: removing a 560K-parameter bio-inspired framework costs only -0.2%, while removing the 6-parameter PhaseHarmonics costs -2.6%. The resulting Kathleen-Clean achieves 88.6% on IMDB, 92.3% on AG News, and 83.3% on SST-2 -- outperforming a tokenized counterpart with 16x more parameters on IMDB (+1.6%) and AG News (+2.1%). Kathleen processes sequences in O(L) time and memory, enabling byte-level operation at sequence lengths where O(L^2) Transformers exhaust GPU memory.

[35] A Decomposition Perspective to Long-context Reasoning for LLMs

Yanling Xiao,Huaibing Xie,Guoliang Zhao,Shihan Dou,Shaolei Wang,Yiting Liu,Nantao Zheng,Cheng Zhang,Pluto Zhou,Zhisong Zhang,Lemao Liu

Main category: cs.CL

TL;DR: 本文提出将长上下文推理任务分解为基本原子技能，并通过强化学习在合成的伪数据集上提升这些技能，从而增强大语言模型的长上下文推理能力。

Details

Motivation: 现有研究忽视了长上下文推理任务内部的复杂性，缺乏对其基本构成技能的系统分析与建模。 Method: 将长上下文推理分解为若干原子技能，自动合成针对性的伪数据集，并利用强化学习在这些数据集上训练模型以强化各原子技能。 Result: 在Loogle、Loong、LongBench-v2、BrowscompLong、Ruler-qa2和MRCR等多个基准上平均提升7.7%（从46.3%提升至54.0%）。 Conclusion: 提升长上下文推理能力的关键在于掌握其底层原子技能；基于原子技能的强化学习训练可显著提升模型整体长文本推理性能。 Abstract: Long-context reasoning is essential for complex real-world applications, yet remains a significant challenge for Large Language Models (LLMs). Despite the rapid evolution in long-context reasoning, current research often overlooks the internal complexity of the long-context reasoning task itself. In this paper, we move beyond this holistic view and decompose long-context reasoning into a set of fundamental atomic skills, and we then automatically synthesize a suite of pseudo datasets, each explicitly targeting a specific atomic skill. Our empirical analysis confirms that proficiency in these atomic skills is strongly correlated with general long-text reasoning performance. Building on this insight, we employ reinforcement learning on these pseudo datasets to sharpen the model's atomic skills, in the hope of boosting its general long-context reasoning ability. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach: it outperforms a strong baseline by an average margin of 7.7\% (improving from 46.3\% to 54.0\%) across Loogle, Loong, LongBench-v2, BrowscompLong, Ruler-qa2, and MRCR.

[36] Rag Performance Prediction for Question Answering

Or Dado,David Carmel. Oren Kurland

Main category: cs.CL

TL;DR: 本文研究了预测RAG（检索增强生成）在问答任务中相对于不使用RAG所带来的性能增益问题，提出了一种新颖的监督式预测器，通过显式建模问题、检索段落与生成答案之间的语义关系，取得了最佳预测效果。

Details

Motivation: 预测RAG在问答任务中是否带来性能提升，以指导其实际应用决策。 Method: 评估了若干预检索和后检索预测器，并提出并测试了一种新的后生成预测器，该预测器为监督式模型，显式建模问题、检索段落和生成答案间的语义关系。 Result: 所提出的新型监督式预测器在预测质量上表现最优。 Conclusion: 显式建模问题、检索内容与生成答案三者间语义关系的监督式预测方法，是预测RAG增益最有效的方式。 Abstract: We address the task of predicting the gain of using RAG (retrieval augmented generation) for question answering with respect to not using it. We study the performance of a few pre-retrieval and post-retrieval predictors originally devised for ad hoc retrieval. We also study a few post-generation predictors, one of which is novel to this study and posts the best prediction quality. Our results show that the most effective prediction approach is a novel supervised predictor that explicitly models the semantic relationships among the question, retrieved passages, and the generated answer.

[37] Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation

Zhengyi Zhao,Shubo Zhang,Zezhong Wang,Yuxi Zhang,Huimin Wang,Yutian Zhao,Yefeng Zheng,Binyang Li,Kam-Fai Wong,Xian Wu

Main category: cs.CL

TL;DR: 本文提出GuarantRAG框架，通过显式解耦推理与证据整合，利用Inner-Answer（基于参数知识）和Refer-Answer（通过对比DPO目标强制依赖外部证据），再经联合解码融合二者优势，显著提升RAG的事实准确性和减少幻觉。

Details

Motivation: 现有RAG方法虽能检索到相关文档，但大模型常因内部参数化知识与外部证据冲突而无法有效利用证据，即存在‘整合瓶颈’；隐式解决该冲突效果不佳。 Method: 提出GuarantRAG框架：1）生成仅依赖参数知识的Inner-Answer以建模推理流；2）设计对比DPO目标训练Refer-Answer，将Inner-Answer作为负样本、检索文档作为正样本，抑制幻觉；3）引入token级联合解码机制，动态融合Inner-Answer的逻辑连贯性与Refer-Answer的事实精确性。 Result: 在五个QA基准上，相比标准及动态RAG基线，准确率最高提升12.1%，幻觉率降低16.3%。 Conclusion: 显式解耦并协同优化推理与证据整合是突破RAG整合瓶颈的有效路径，GuarantRAG为提升RAG忠实性与可靠性提供了新范式。 Abstract: Retrieval-Augmented Generation (RAG) significantly enhances Large Language Models (LLMs) by providing access to external knowledge. However, current research primarily focuses on retrieval quality, often overlooking the critical ''integration bottleneck'': even when relevant documents are retrieved, LLMs frequently fail to utilize them effectively due to conflicts with their internal parametric knowledge. In this paper, we argue that implicitly resolving this conflict in a single generation pass is suboptimal. We introduce GuarantRAG, a framework that explicitly decouples reasoning from evidence integration. First, we generate an ''Inner-Answer'' based solely on parametric knowledge to capture the model's reasoning flow. Second, to guarantee faithful evidence extraction, we generate a ''Refer-Answer'' using a novel Contrastive DPO objective. This objective treats the parametric Inner-Answer as a negative constraint and the retrieved documents as positive ground truth, forcing the model to suppress internal hallucinations in favor of external evidence during this phase. Finally, rather than naive concatenation or using the DPO trained model directly, we propose a joint decoding mechanism that dynamically fuses the logical coherence of the Inner-Answer with the factual precision of the Refer-Answer at the token level. Experiments on five QA benchmarks demonstrate that GuarantRAG improves accuracy by up to 12.1% and reduces hallucinations by 16.3% compared to standard and dynamic RAG baselines.

[38] Efficient Provably Secure Linguistic Steganography via Range Coding

Ruiyi Yan,Yugo Murawaki

Main category: cs.CL

TL;DR: 本文提出了一种基于范围编码和旋转机制的高效、可证明安全的语言隐写方法，在保持高嵌入效率（约100%熵利用率）的同时显著提升嵌入速度（最高达1554.66比特/秒）。

Details

Motivation: 实现语言模型隐写的可证明安全性，尤其是兼顾完美不可感知性与高嵌入容量。 Method: 直接采用经典熵编码方法（范围编码），并引入旋转机制构建可证明安全的语言隐写方案。 Result: 在多种语言模型上实验表明，该方法达到约100%熵利用率，嵌入速度最高达1554.66 bits/s，优于现有基线方法。 Conclusion: 所提方法在保证可证明安全性与完美不可感知性的同时，显著提升了嵌入容量与效率，为语言隐写提供了新的实用化路径。 Abstract: Linguistic steganography involves embedding secret messages within seemingly innocuous texts to enable covert communication. Provable security, which is a long-standing goal and key motivation, has been extended to language-model-based steganography. Previous provably secure approaches have achieved perfect imperceptibility, measured by zero Kullback-Leibler (KL) divergence, but at the expense of embedding capacity. In this paper, we attempt to directly use a classic entropy coding method (range coding) to achieve secure steganography, and then propose an efficient and provably secure linguistic steganographic method with a rotation mechanism. Experiments across various language models show that our method achieves around 100% entropy utilization (embedding efficiency) for embedding capacity, outperforming the existing baseline methods. Moreover, it achieves high embedding speeds (up to 1554.66 bits/s on GPT-2). The code is available at github.com/ryehr/RRC_steganography.

[39] Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

Main category: cs.CL

TL;DR: 本文提出了一种双池 token 预算路由机制，通过将同构 GPU 集群划分为短上下文高吞吐池和长上下文高容量池，并基于在线学习的字节-词元比动态估算请求 token 预算进行路由，显著提升 vLLM 服务效率与稳定性。

Details

Motivation: 生产环境中 vLLM 集群常按最坏情况上下文长度配置，导致 KV 缓存严重过分配、并发利用率低；大量短请求被长上下文配置服务，浪费 4–8 倍吞吐能力，并引发 OOM、抢占和请求拒绝等可靠性问题；根本原因是配置与实际流量不匹配（configuration-traffic mismatch）。 Method: 提出 dual-pool token-budget routing：将集群划分为两个专用池（短上下文高吞吐池 & 长上下文高容量池）；每个请求根据其估计总 token 预算（由在线学习的 per-category 字节-词元比 × 请求字节数得出，无需 tokenizer）动态路由；辅以一个解析模型，可基于工作负载特征与实测吞吐差异预测成本节约。 Result: 在 Azure LLM Inference Dataset 和 LMSYS-Chat-1M 实际轨迹上评估（Llama-3-70B/A100），GPU 小时减少 31–42%，年节省达 286 万美元；抢占率降低 5.4 倍，P99 TTFT 提升 6%；Qwen3-235B-A22B/MI300X 在 10k req/s 场景下年节省预估为 1540 万美元；调度开销仅 O(1)，自动适配异构流量，并兼容 PagedAttention、连续批处理等现有优化。 Conclusion: dual-pool token-budget routing 是一种轻量、自适应、即插即用的调度机制，有效缓解 vLLM 生产部署中的配置-流量错配问题，在吞吐、成本、稳定性三方面实现显著协同增益。 Abstract: Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.

[40] Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection

Khalid Zaman,Melike Sah,Anuwat Chaiwongyenc,Cem Direkoglu

Main category: cs.CL

TL;DR: 本文提出量子视觉（QV）理论，将量子物理中的波粒二象性思想引入深度学习音频分类，特别是深度伪造语音检测；通过QV块将语音频谱图（如STFT、Mel谱图、MFCC）转换为‘信息波’输入模型，构建QV-CNN和QV-ViT，在ASVSpoof数据集上显著提升检测准确率与鲁棒性。

Details

Motivation: 将已在图像分类中验证有效的QV理论拓展至音频领域，探索其在深度伪造语音检测中的适用性与优势。 Method: 设计QV块将语音特征（STFT、Mel谱图、MFCC）映射为信息波，分别嵌入CNN和ViT架构形成QV-CNN与QV-ViT，并在ASVSpoof数据集上训练与评估。 Result: QV-CNN和QV-ViT均优于对应基线模型；QV-CNN+MFCC达94.20%准确率（EER=9.04%），QV-CNN+Mel谱图达最高准确率94.57%。 Conclusion: QV理论是音频深度伪造检测的有效新范式，为量子启发式音频感知学习开辟了新方向。 Abstract: We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.

[41] Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

Ian W. Kennedy,Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: 本文提出了一种面向输出的EM初始化方法（OA-EM），用于改善加性量化（AQ）在2比特极端压缩下的性能瓶颈，核心在于解决传统贪心序列初始化导致的优化困局；通过引入表征比ρ=N/KM刻画权重分组与码本容量关系，并采用Hessian加权马氏距离实现更优初始化，在多个模型和压缩设置下显著提升PV调优后的质量与计算效率权衡。

Details

Motivation: 加性量化在2比特精度下常出现灾难性失败，即使使用大量搜索和微调也难以缓解，作者发现其主因是码本初始化不当，导致优化陷入不良区域。 Method: 提出OA-EM（Output-Aware EM）初始化方法，基于Hessian加权的马氏距离进行输出感知的EM算法初始化，并引入表征比ρ=N/KM分析初始化对优化几何的影响。 Result: OA-EM在Llama 3.2 3B、Llama 3.1 8B、Qwen 2.5 3B上均显著优于基线，尤其在2 bpp时可避免困惑度数量级恶化，全面主导质量-计算前沿。 Conclusion: 在极高压缩场景下，码本初始化对最终性能起决定性作用，优化几何特性比后续搜索与微调更重要；OA-EM为高效边缘部署提供了可靠初始化方案。 Abstract: Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

[42] LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

Tian Huang,Tom Bourgeade,Irina Illina

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型（LLM）的可控生成与自动评估法，用于在法语低资源场景下构建带银标（silver-labeled）的OSCE医患对话数据集，并验证中等规模开源LLM可达到接近GPT-4o的评估准确率（约90%），支持本地化、隐私保护的医学教育评估系统。

Details

Motivation: 法国OSCE培训受限于人力与后勤，学生缺乏反复练习和结构化反馈机会；同时真实法语OSCE标注语料极度稀缺，阻碍可复现研究与可靠评测。 Method: 设计可控生成管道，结合场景特定评估标准，生成涵盖理想与扰动表现的法语医患对话；采用LLM辅助框架对合成对话进行可调节严格度的自动银标标注；在合成数据上评测多个开源与闭源LLM的评估性能。 Result: 中等规模LLM（≤32B参数）在合成数据上的评估准确率达约90%，与GPT-4o相当；验证了本地部署、隐私安全的OSCE自动评估可行性。 Conclusion: LLM可在低资源法语医疗教育场景中有效支撑高质量合成数据生成与自动评估，为可扩展、合规的临床技能训练提供新范式。 Abstract: Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.

[43] Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs

Soveatin Kuntur,Maciej Krzywda,Anna Wróblewska,Marcin Paprzycki,Maria Ganzha,Szymon Łukasik,Amir H. Gandomi

Main category: cs.CL

TL;DR: 本文在可控和可比条件下，对图神经网络（GNN）与非图机器学习方法在虚假信息检测任务上的性能进行了基准测试。结果表明，轻量级GNN（如GraphSAGE、ChebNet等）在多个多语言数据集上显著优于逻辑回归、SVM和MLP等基线模型，且推理时间相当甚至更短，证明经典GNN仍具高效性与实用性。

Details

Motivation: 现有虚假信息检测模型（如大语言模型、混合架构）计算成本高、部署受限，亟需评估更轻量、实用的替代方案。 Method: 在七个英文、印尼文和波兰文公开数据集上，统一使用TF-IDF特征，对比轻量级GNN（GCN、GraphSAGE、GAT、ChebNet）与非图模型（逻辑回归、SVM、MLP），以F1分数和推理时间为评估指标。 Result: GNN在所有数据集上均一致优于非图基线：GraphSAGE在Kaggle、WELFake、COVID-19数据集上F1分别达96.8%、91.9%、90.5%，显著高于MLP；ChebNet在FakeNewsNet上达79.1%，亦优于MLP的66.4%；且推理时间相当或更低。 Conclusion: 经典轻量级GNN在虚假信息检测中兼具高性能与高效率，挑战了当前一味追求复杂模型的倾向，为实际部署提供了更优选择。 Abstract: The rapid spread of online misinformation has led to increasingly complex detection models, including large language models and hybrid architectures. However, their computational cost and deployment limitations raise concerns about practical applicability. In this work, we benchmark graph neural networks (GNNs) against non-graph-based machine learning methods under controlled and comparable conditions. We evaluate lightweight GNN architectures (GCN, GraphSAGE, GAT, ChebNet) against Logistic Regression, Support Vector Machines, and Multilayer Perceptrons across seven public datasets in English, Indonesian, and Polish. All models use identical TF-IDF features to isolate the impact of relational structure. Performance is measured using F1 score, with inference time reported to assess efficiency. GNNs consistently outperform non-graph baselines across all datasets. For example, GraphSAGE achieves 96.8% F1 on Kaggle and 91.9% on WELFake, compared to 73.2% and 66.8% for MLP, respectively. On COVID-19, GraphSAGE reaches 90.5% F1 vs. 74.9%, while ChebNet attains 79.1% vs. 66.4% on FakeNewsNet. These gains are achieved with comparable or lower inference times. Overall, the results show that classic GNNs remain effective and efficient, challenging the need for increasingly complex architectures in misinformation detection.

[44] Clickbait detection: quick inference with maximum impact

Soveatin Kuntur,Panggih Kusuma Ningrum,Anna Wróblewska,Maria Ganzha,Marcin Paprzycki

Main category: cs.CL

TL;DR: 本文提出了一种轻量级混合方法用于点击诱饵检测，结合OpenAI语义嵌入与六个简洁的启发式特征，并通过PCA降维和多种分类器（XGBoost、GraphSAGE、GCN）评估，图模型在推理速度显著提升的同时保持了较强的判别能力。

Details

Motivation: 提高点击诱饵检测的效率与实用性，尤其在资源受限场景下兼顾性能与速度。 Method: 融合OpenAI语义嵌入与六个轻量级启发式特征；使用PCA对嵌入降维；分别采用XGBoost、GraphSAGE和GCN进行分类。 Result: 图神经网络模型（GraphSAGE、GCN）在推理时间大幅降低的同时达到有竞争力的F1分数和高ROC-AUC值。 Conclusion: 轻量级混合方法可在保持较高检测可靠性的同时显著提升推理效率，适用于实时或边缘部署场景。 Abstract: We propose a lightweight hybrid approach to clickbait detection that combines OpenAI semantic embeddings with six compact heuristic features capturing stylistic and informational cues. To improve efficiency, embeddings are reduced using PCA and evaluated with XGBoost, GraphSAGE, and GCN classifiers. While the simplified feature design yields slightly lower F1-scores, graph-based models achieve competitive performance with substantially reduced inference time. High ROC--AUC values further indicate strong discrimination capability, supporting reliable detection of clickbait headlines under varying decision thresholds.

[45] Training Data Size Sensitivity in Unsupervised Rhyme Recognition

Petr Plecháč,Artjoms Šeļa,Silvie Cinková,Mirella De Sisto,Lara Nugues,Neža Kočnik,Antonina Martynenko,Ben Nagy,Luca Giovannini,Robert Kolár

Main category: cs.CL

TL;DR: 本文研究了无监督韵律识别工具RhymeTagger在七种语言中的性能，探讨所需训练数据量及语言差异对准确率的影响，并与人工标注一致性及大语言模型进行对比。

Details

Motivation: 韵律判断具有历史建构性、主观性和跨语言复杂性，导致自动韵律识别困难，尤其在多语境下。 Method: 使用语言无关的RhymeTagger工具，在七种语言诗歌语料上开展无监督韵律识别实验；评估不同训练规模下的性能；计算专家人工标注的一致性（IAA）并分析分歧成因（如语音相似度、词间距离）；将RhymeTagger与三种大语言模型（采用one-shot策略）对比。 Result: RhymeTagger在获得足够训练数据后，性能稳定超越人工标注一致性；而缺乏显式语音表征的大语言模型在此任务上表现显著较差。 Conclusion: 无监督模式下，基于重复模式的RhymeTagger具备强泛化能力与跨语言鲁棒性；语音建模对韵律识别至关重要，纯文本大模型难以替代专门语音/韵律工具。 Abstract: Rhyme is deceptively intuitive: what is or is not a rhyme is constructed historically, scholars struggle with rhyme classification, and people disagree on whether two words are rhymed or not. This complicates automated rhymed recognition and evaluation, especially in multilingual context. This article investigates how much training data is needed for reliable unsupervised rhyme recognition using RhymeTagger, a language-independent tool that identifies rhymes based on repeating patterns in poetry corpora. We evaluate its performance across seven languages (Czech, German, English, French, Italian, Russian, and Slovene), examining how training size and language differences affect accuracy. To set a realistic performance benchmark, we assess inter-annotator agreement on a manually annotated subset of poems and analyze factors contributing to disagreement in expert annotations: phonetic similarity between rhyming words and their distance from each other in a poem. We also compare RhymeTagger to three large language models using a one-shot learning strategy. Our findings show that, once provided with sufficient training data, RhymeTagger consistently outperforms human agreement, while LLMs lacking phonetic representation significantly struggle with the task.

[46] Self-Debias: Self-correcting for Debiasing Large Language Models

Xuan Feng,Shuai Zhao,Luwei Xiao,Tianlong Gu,Bo An

Main category: cs.CL

TL;DR: 本文提出Self-Debias框架，通过将去偏视为概率质量资源再分配问题，在推理轨迹层面施加动态约束，实现LLM链式思维中偏见传播的自主识别与中断，并结合一致性过滤的在线自改进机制，在少量标注数据下实现高效去偏与推理能力保持。

Details

Motivation: 现有去偏方法难以在Chain-of-Thought过程中动态识别并中断持续发生的‘偏见传播’问题，缺乏模型内在的自我修正能力。 Method: 将去偏建模为输出概率质量的策略性再分配问题；设计轨迹级细粒度目标函数与动态去偏约束；引入基于一致性过滤的在线自改进机制以自动生成监督信号。 Result: 仅用20k标注样本即实现高效自校正，在多个基准上取得更优去偏效果，同时保持通用推理能力，无需持续外部干预。 Conclusion: Self-Debias通过赋予模型内在自修正能力，有效解决了CoT中偏见传播的动态中断难题，为构建更公平、可靠的LLM推理系统提供了新范式。 Abstract: Although Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, inherent social biases often cascade throughout the Chain-of-Thought (CoT) process, leading to continuous "Bias Propagation". Existing debiasing methods primarily focus on static constraints or external interventions, failing to identify and interrupt this propagation once triggered. To address this limitation, we introduce Self-Debias, a progressive framework designed to instill intrinsic self-correction capabilities. Specifically, we reformulate the debiasing process as a strategic resource redistribution problem, treating the model's output probability mass as a limited resource to be reallocated from biased heuristics to unbiased reasoning paths. Unlike standard preference optimization which applies broad penalties, Self-Debias employs a fine-grained trajectory-level objective subject to dynamic debiasing constraints. This enables the model to selectively revise biased reasoning suffixes while preserving valid contextual prefixes. Furthermore, we integrate an online self-improvement mechanism utilizing consistency filtering to autonomously synthesize supervision signals. With merely 20k annotated samples, Self-Debias activates efficient self-correction, achieving superior debiasing performance while preserving general reasoning capabilities without continuous external oversight.

[47] HyperMem: Hypergraph Memory for Long-Term Conversations

Juwei Yue,Chuanrui Hu,Jiawei Sheng,Zuyi Zhou,Wenyuan Zhang,Tingwen Liu,Li Guo,Yafeng Deng

Main category: cs.CL

TL;DR: 本文提出HyperMem，一种基于超图的分层记忆架构，用于解决现有对话代理长期记忆中难以捕捉高阶关联的问题。它通过超边建模主题、情节和事实三个层级的记忆，并设计混合索引与粗到细检索策略，在LoCoMo基准上达到92.73%的SOTA准确率。

Details

Motivation: 现有方法（如RAG和图记忆）依赖成对关系，难以建模多个元素间的高阶联合依赖，导致记忆检索碎片化。 Method: 提出HyperMem：基于超图的三层记忆结构（话题-情节-事实），用超边聚合相关情节与事实；设计混合词法-语义索引与粗到细检索策略。 Result: 在LoCoMo基准上取得92.73%的LLM-as-a-judge准确率，显著优于现有方法。 Conclusion: 超图结构能有效建模长期对话中的高阶关联，提升记忆一致性与检索完整性，为长程对话代理提供更鲁棒的记忆机制。 Abstract: Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues. However, existing approaches as Retrieval-Augmented Generation (RAG) and graph-based memory mostly rely on pairwise relations, which can hardly capture high-order associations, i.e., joint dependencies among multiple elements, causing fragmented retrieval. To this end, we propose HyperMem, a hypergraph-based hierarchical memory architecture that explicitly models such associations using hyperedges. Particularly, HyperMem structures memory into three levels: topics, episodes, and facts, and groups related episodes and their facts via hyperedges, unifying scattered content into coherent units. Leveraging this structure, we design a hybrid lexical-semantic index and a coarse-to-fine retrieval strategy, supporting accurate and efficient retrieval of high-order associations. Experiments on the LoCoMo benchmark show that HyperMem achieves state-of-the-art performance with 92.73% LLM-as-a-judge accuracy, demonstrating the effectiveness of HyperMem for long-term conversations.

[48] Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

Jun Seo,Sangwon Ryu,Heejin Do,Hyounghun Kim,Gary Geunbae Lee

Main category: cs.CL

TL;DR: 本文提出了一种行为感知的题目建模框架BAIM，通过融合Polya四阶段解题过程的动态程序性信息来增强题目表征，并引入上下文条件机制自适应地路由各阶段表征，从而提升知识追踪性能，尤其在重复交互场景下效果显著。

Details

Motivation: 现有知识追踪方法虽利用知识点对题目进行建模，但忽略了学生解题过程中的程序性动态特征，难以刻画真实认知行为。 Method: BAIM框架使用推理语言模型将每道题目的解答分解为理解、计划、执行和回顾四个阶段（基于Polya理论），提取各阶段嵌入轨迹以获得阶段级表征，并设计上下文条件的自适应路由机制，将其融入KT主干模型中。 Result: 在XES3G5M和NIPS34数据集上，BAIM持续超越强预训练基线模型，尤其在学生多次交互场景下提升显著。 Conclusion: 融合解题过程的程序性动态信息并自适应建模 learner-level 差异，能有效提升知识追踪的预测准确性与认知可解释性。 Abstract: Knowledge Tracing (KT) aims to predict learners' future performance from past interactions. While recent KT approaches have improved via learning item representations aligned with Knowledge Components, they overlook the procedural dynamics of problem solving. We propose Behavior-Aware Item Modeling (BAIM), a framework that enriches item representations by integrating dynamic procedural solution information. BAIM leverages a reasoning language model to decompose each item's solution into four problem-solving stages (i.e., understand, plan, carry out, and look back), pedagogically grounded in Polya's framework. Specifically, it derives stage-level representations from per-stage embedding trajectories, capturing latent signals beyond surface features. To reflect learner heterogeneity, BAIM adaptively routes these stage-wise representations, introducing a context-conditioned mechanism within a KT backbone, allowing different procedural stages to be emphasized for different learners. Experiments on XES3G5M and NIPS34 show that BAIM consistently outperforms strong pretraining-based baselines, achieving particularly large gains under repeated learner interactions.

[49] Floating or Suggesting Ideas? A Large-Scale Contrastive Analysis of Metaphorical and Literal Verb-Object Constructions

Prisca Piccirilli,Alexander Fraser,Sabine Schulte im Walde

Main category: cs.CL

TL;DR: 本文通过分析297个英语动词-宾语对在约200万语料库句子中的使用，结合2293个认知与语言特征，探究隐喻与字面表达在跨对和对内层面的差异。结果表明，二者差异并非普适，而是高度依赖具体构式。

Details

Motivation: 现有研究多聚焦隐喻的认知与心理语言学特性，但缺乏大规模、近义表达下隐喻与字面语言的系统比较。 Method: 基于297个英语动词-宾语对（如float idea vs. suggest idea）在约200万句子中的语境使用，利用5种NLP工具提取2293个涵盖情感、词汇、句法和语篇层面的特征，开展跨对与对内两类对比分析。 Result: 跨对分析显示：字面语境更频繁、连贯、结构规整；隐喻语境则具更高情感负荷、可意象性、词汇多样性与构式特异性。对内分析揭示多数动宾对内部效应不一致，差异呈现显著异质性。 Conclusion: 隐喻与字面用法之间不存在统一的分布模式，其差异主要体现为构式特异性；大规模数据与多维特征结合有助于精细刻画二者在动宾搭配中的区别。 Abstract: Metaphor pervades everyday language, allowing speakers to express abstract concepts via concrete domains. While prior work has studied metaphors cognitively and psycholinguistically, large-scale comparisons with literal language remain limited, especially for near-synonymous expressions. We analyze 297 English verb-object pairs (e.g., float idea vs. suggest idea) in ~2M corpus sentences, examining their contextual usage. Using five NLP tools, we extract 2,293 cognitive and linguistic features capturing affective, lexical, syntactic, and discourse-level properties. We address: (i) whether features differ between metaphorical and literal contexts (cross-pair analysis), and (ii) whether individual VO pairs diverge internally (within-pair analysis). Cross-pair results show literal contexts have higher lexical frequency, cohesion, and structural regularity, while metaphorical contexts show greater affective load, imageability, lexical diversity, and constructional specificity. Within-pair analyses reveal substantial heterogeneity, with most pairs showing non-uniform effects. These results suggest no single, consistent distributional pattern that distinguishes metaphorical from literal usage. Instead, differences are largely construction-specific. Overall, large-scale data combined with diverse features provides a fine-grained understanding of metaphor-literal contrasts in VO usage.

[50] When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

Ruotao Xu,Yixin Ji,Yu Luo,Jinpeng Li,Dong Li,Peifeng Li,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了一种自适应工具信任校准框架（ATTC），旨在解决工具集成推理（TIR）模型中模型忽视正确工具结果（即“Tool Ignored”）的问题，通过代码块置信度动态决定是否信任工具输出，显著提升多个开源TIR模型在多数据集上的性能（+4.1%~7.5%）。

Details

Motivation: 现有工具集成推理（TIR）模型在模型推理与工具结果冲突时倾向于信任自身推理、忽略正确工具输出（即“Tool Ignored”），缺乏对工具结果的可信度判断能力。 Method: 提出自适应工具信任校准（ATTC）框架，依据生成代码块的置信度分数，动态引导模型决定信任或忽略工具执行结果。 Result: 在多个不同规模的开源TIR模型和多个数据集上验证，ATTC有效缓解‘Tool Ignored’问题，性能提升4.1%至7.5%。 Conclusion: ATTC为TIR模型提供了可泛化的信任决策机制，提升了模型对工具结果的合理利用能力，是增强大型推理模型可靠性的重要方向。 Abstract: Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored''. This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, We introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.

[51] Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language Models

Yating Wang,Wenting Zhao,Yaqi Zhao,Yongshun Gong,Yilong Yin,Haoliang Sun

Main category: cs.CL

TL;DR: 本文研究了大语言模型中规则级知识的编辑问题，发现规则知识在不同transformer层中按形式（公式、描述、实例）分布，因此提出分布式多层编辑方法（DMLE），显著提升了规则级编辑效果。

Details

Motivation: 现有模型编辑方法主要针对事实级知识，假设可通过局部干预实现编辑，但规则级知识具有跨多种表达形式（符号表达、自然语言解释、具体实例）的一致性要求，无法通过单层或连续块干预可靠编辑。 Method: 通过扩展RuleEdit基准（从80到200条人工验证规则），结合细粒度因果追踪，分析规则知识在transformer各层中的分布特性，并据此提出分布式多层编辑方法（DMLE）：对公式和描述在早期层施加共享更新，对实例在中间层单独更新。 Result: DMLE在GPT-J-6B、Qwen2.5-7B、Qwen2-7B和LLaMA-3-8B上平均提升实例可迁移性和规则理解能力分别达13.91和50.19个百分点，显著优于最强基线。 Conclusion: 规则级知识具有形式依赖的分层组织结构，需采用跨层协同的分布式编辑策略；DMLE验证了该思路的有效性，为规则级知识编辑提供了新范式。 Abstract: Large language models store not only isolated facts but also rules that support reasoning across symbolic expressions, natural language explanations, and concrete instances. Yet most model editing methods are built for fact-level knowledge, assuming that a target edit can be achieved through a localized intervention. This assumption does not hold for rule-level knowledge, where a single rule must remain consistent across multiple interdependent forms. We investigate this problem through a mechanistic study of rule-level knowledge editing. To support this study, we extend the RuleEdit benchmark from 80 to 200 manually verified rules spanning mathematics and physics. Fine-grained causal tracing reveals a form-specific organization of rule knowledge in transformer layers: formulas and descriptions are concentrated in earlier layers, while instances are more associated with middle layers. These results suggest that rule knowledge is not uniformly localized, and therefore cannot be reliably edited by a single-layer or contiguous-block intervention. Based on this insight, we propose Distributed Multi-Layer Editing (DMLE), which applies a shared early-layer update to formulas and descriptions and a separate middle-layer update to instances. While remaining competitive on standard editing metrics, DMLE achieves substantially stronger rule-level editing performance. On average, it improves instance portability and rule understanding by 13.91 and 50.19 percentage points, respectively, over the strongest baseline across GPT-J-6B, Qwen2.5-7B, Qwen2-7B, and LLaMA-3-8B. The code is available at https://github.com/Pepper66/DMLE.

[52] SeLaR: Selective Latent Reasoning in Large Language Models

Renyu Fu,Guibo Luo

Main category: cs.CL

TL;DR: 本文提出SeLaR（Selective Latent Reasoning），一种无需训练的轻量级框架，通过熵门控机制选择性启用软嵌入（仅在低置信度推理步），并引入熵感知对比正则化防止软嵌入坍缩，从而提升大模型链式推理的稳定性与探索能力。

Details

Motivation: 现有隐式推理方法受限于全局激活扰动高置信步骤、以及软嵌入易坍缩至最高概率token导致探索不足的问题。 Method: 提出熵门控机制（仅在低置信度步骤启用软嵌入）和熵感知对比正则化（推动软嵌入远离主导token方向）。 Result: 在五个推理基准上，SeLaR持续优于标准CoT及现有最优无训练方法。 Conclusion: 选择性地结合离散与连续表征，并辅以针对性正则化，可有效提升推理稳定性与路径多样性，且无需额外训练。 Abstract: Chain-of-Thought (CoT) has become a cornerstone of reasoning in large language models, yet its effectiveness is constrained by the limited expressiveness of discrete token sampling. Recent latent reasoning approaches attempt to alleviate this limitation by replacing discrete tokens with soft embeddings (probability-weighted mixtures of token embeddings) or hidden states, but they commonly suffer from two issues: (1) global activation injects perturbations into high-confidence steps, impairing reasoning stability; and (2) soft embeddings quickly collapse toward the highest-probability token, limiting exploration of alternative trajectories. To address these challenges, we propose SeLaR (Selective Latent Reasoning), a lightweight and training-free framework. SeLaR introduces an entropy-gated mechanism that activates soft embeddings only at low-confidence steps, while preserving discrete decoding at high-confidence steps. Additionally, we propose an entropy-aware contrastive regularization that pushes soft embeddings away from the dominant (highest-probability) token's direction, encouraging sustained exploration of multiple latent reasoning paths. Experiments on five reasoning benchmarks demonstrate that SeLaR consistently outperforms standard CoT and state-of-the-art training-free methods.

[53] Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen,Ruoxi Xu,Boxi Cao,Ruotong Pan,Yunfei Zhang,Yifei Hu,Yong Du,Tingting Gao,Yaojie Lu,Yingfei Sun,Xianpei Han,Le Sun,Xiangyu Wu,Hongyu Lin

Main category: cs.CL

TL;DR: 本文提出了首个完全基于真实世界数据构建的用户模拟基准OmniBehavior，揭示了现有LLM在模拟复杂、长周期、跨场景人类行为时存在结构性偏差（如乌托邦偏差、人格同质化），难以捕捉个体差异与长尾行为。

Details

Motivation: 现有用户模拟基准受限于孤立场景、狭窄动作空间或合成数据，无法反映真实人类行为的整体性与复杂性。 Method: 构建全真实数据驱动的OmniBehavior基准，涵盖长周期、跨场景、异构行为模式；通过实证分析和大规模LLM评估，对比模拟行为与真实行为的差异，识别结构性偏差。 Result: 发现当前LLM在长周期跨场景行为建模上性能饱和；系统性揭示其存在‘乌托邦偏差’：过度活跃、人格同质化、趋近正向平均人，导致个体差异和长尾行为丢失。 Conclusion: 真实用户模拟需突破当前LLM的结构性局限，未来研究应聚焦高保真建模个体差异与长尾行为。 Abstract: The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

[54] A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection

Wenxian Wang,Xiaohu Luo,Junfeng Hao,Xiaoming Gu,Xingshu Chen,Zhu Wang,Haizhou Wang

Main category: cs.CL

TL;DR: 本文提出了一种结合GAN与大语言模型（LLM）的数据增强框架，用于建模用户语言模式以提升中文讽刺检测性能，并构建了包含用户历史行为的SinaSarc数据集；所提方法在F1-score上超越现有SOTA。

Details

Motivation: 现有中文讽刺检测方法受限于数据集规模小、构建成本高，且多忽略用户特异性语言模式对讽刺表达的影响。 Method: 提出GAN与LLM联合驱动的数据增强框架：先采集新浪微博多主题原始数据，用GAN训练并结合GPT-3.5进行数据增强，构建含目标评论、上下文及用户历史行为的SinaSarc数据集；再扩展BERT架构，融入用户历史行为等多维信息以捕获动态语言模式。 Result: 模型在非讽刺和讽刺类别上F1-score分别达0.9138和0.9151，均优于所有现有SOTA方法。 Conclusion: 该研究为中文讽刺检测提供了动态建模用户长期语言模式的新范式，在数据集构建与方法论两方面均有重要贡献。 Abstract: Sarcasm is a rhetorical device that expresses criticism or emphasizes characteristics of certain individuals or situations through exaggeration, irony, or comparison. Existing methods for Chinese sarcasm detection are constrained by limited datasets and high construction costs, and they mainly focus on textual features, overlooking user-specific linguistic patterns that shape how opinions and emotions are expressed. This paper proposes a Generative Adversarial Network (GAN) and Large Language Model (LLM)-driven data augmentation framework to dynamically model users' linguistic patterns for enhanced Chinese sarcasm detection. First, we collect raw data from various topics on Sina Weibo. Then, we train a GAN on these data and apply a GPT-3.5 based data augmentation technique to synthesize an extended sarcastic comment dataset, named SinaSarc. This dataset contains target comments, contextual information, and user historical behavior. Finally, we extend the BERT architecture to incorporate multi-dimensional information, particularly user historical behavior, enabling the model to capture dynamic linguistic patterns and uncover implicit sarcastic cues in comments. Experimental results demonstrate the effectiveness of our proposed method. Specifically, our model achieves the highest F1-scores on both the non-sarcastic and sarcastic categories, with values of 0.9138 and 0.9151 respectively, which outperforms all existing state-of-the-art (SOTA) approaches. This study presents a novel framework for dynamically modeling users' long-term linguistic patterns in Chinese sarcasm detection, contributing to both dataset construction and methodological advancement in this field.

[55] Synthetic Data for any Differentiable Target

Tristan Thrush,Sung Min Park,Herman Brunborg,Luke Bailey,Marcel Roed,Neil Band,Christopher Potts,Tatsunori Hashimoto

Main category: cs.CL

TL;DR: 本文提出了一种名为Dataset Policy Gradient（DPG）的强化学习方法，通过高阶梯度精确优化合成数据生成器，从而用合成数据微调语言模型，实现对模型权重、输出格式等特性的定向控制。

Details

Motivation: 探索通过合成训练数据控制语言模型的极限，解决传统方法难以精准引导模型行为的问题。 Method: 提出Dataset Policy Gradient（DPG），利用高阶梯度进行精确数据归因，并将其作为策略梯度奖励来优化合成数据生成器；该方法在监督微调（SFT）中使用生成的数据训练目标模型。 Result: 成功实现五种控制目标：(1)使LM头权重嵌入QR码；(2)嵌入模式'67'；(3)降低ℓ²范数；(4)将输入重述为新语言；(5)生成特定UUID，且后两者未在生成器提示中显式指定。 Conclusion: DPG是一种强大而灵活的技术，仅通过合成训练样本即可有效塑造语言模型的内部参数和输出行为，揭示了数据驱动控制的新潜力。 Abstract: What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

[56] AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Lilian Wanzare,Cynthia Amol,zekiel Maina,Nelson Odhiambo,Hope Kerubo,Leila Misula,Vivian Oloo,Rennish Mboya,Edwin Onkoba,Edward Ombui,Joseph Muguro,Ciira wa Maina,Andrew Kipkebut,Alfred Omondi Otom,Ian Ndung'u Kang'ethe,Angela Wambui Kanyi,Brian Gichana Omwenga

Main category: cs.CL

TL;DR: AfriVoices-KE 是一个涵盖五种肯尼亚语言、总计约3000小时的大型多语种语音数据集，包含脚本与自发语音，旨在缓解非洲语言在语音技术中的严重代表性不足问题。

Details

Motivation: 解决非洲语言在语音技术中长期被严重低估和缺乏高质量数据资源的问题，推动包容性语音技术发展并助力肯尼亚语言遗产的数字化保存。 Method: 采用双轨数据采集策略：脚本语音基于11个肯尼亚相关领域文本构建；自发语音通过文字与图像提示激发；使用定制移动App实现智能手机录音；结合自动信噪比检测与人工内容审核进行多层质量控制。 Result: 建成AfriVoices-KE数据集，含约3000小时语音（750小时脚本+2250小时自发），覆盖4777名来自不同地域与人口统计背景的母语者，支持ASR与TTS系统开发。 Conclusion: AfriVoices-KE为低资源非洲语言语音技术提供了高质量、多样化、可扩展的基础数据资源，同时为社区参与式数据建设提供了可复用的方法论范例。 Abstract: AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.

[57] AI generates well-liked but templatic empathic responses

Emma Gueorguieva,Hongli Zhan,Jina Suh,Javier Hernandez,Tatiana Lau,Junyi Jessy Li,Desmond C. Ong

Main category: cs.CL

TL;DR: 本文发现大语言模型（LLMs）在提供情感支持时表现更富共情，原因在于其高度一致地使用了一种结构化的共情语言模板；研究构建了10种共情语言策略的分类法，并通过两项实验（共4555条响应）发现LLM响应高度公式化（83–90%匹配该模板），而人类响应则更具多样性。

Details

Motivation: 解释为何LLM生成的情感支持响应被用户评价为比人类写的更具共情能力。 Method: 构建包含10种共情语言策略（如情绪确认、复述等）的分类法，并对六种LLM生成的3265条及人类撰写的1290条共情响应进行编码与模板匹配分析。 Result: 发现一个高覆盖率的共情表达模板：在LLM响应中匹配率达83–90%（留出样本中60–83%），且覆盖响应内容的81–92%；而人类响应呈现显著更高的多样性。 Conclusion: LLM的共情优势源于其稳定复用结构化语言模板，而非真正理解；这一发现对AI共情的设计、评估与伦理应用具有重要启示。 Abstract: Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.

[58] What do Language Models Learn and When? The Implicit Curriculum Hypothesis

Emmy Liu,Kaiser Sun,Millicent Li,Isabelle Lee,Lindia Tjuatja,Jen-tse Huang,Graham Neubig

Main category: cs.CL

TL;DR: 本文提出隐式课程假说，认为大语言模型预训练过程遵循一种可预测的、组合式的技能习得顺序，并通过设计一系列简单可组合的任务验证了该假说，发现不同模型间技能涌现顺序高度一致，且任务表征可有效预测新任务的训练轨迹。

Details

Motivation: 现有研究仅通过验证损失缩放律了解模型整体性能提升，但对预训练过程中具体能力如何、以何种顺序涌现缺乏细粒度理解。 Method: 提出隐式课程假说；设计涵盖检索、形态变换、共指消解、逻辑推理和数学等领域的可组合简单任务集；在4个模型族（410M–13B参数）上追踪各任务达到固定准确率阈值的涌现点；分析任务表征向量相似性与训练轨迹关系；利用表征空间预测未见组合任务的训练轨迹。 Result: 不同模型间技能涌现顺序高度一致（ρ = .81）；复合任务通常在其子任务之后涌现；任务功能向量表征相似性与训练轨迹相似性正相关；基于任务表征可高精度预测新任务训练轨迹（R² = .68–.84）。 Conclusion: 大语言模型预训练过程具有内在结构化特性：技能以可预测、组合化的方式涌现，该结构跨模型具有一致性，且可从模型内部表征中读取。 Abstract: Large language models (LLMs) can perform remarkably complex tasks, yet the fine-grained details of how these capabilities emerge during pretraining remain poorly understood. Scaling laws on validation loss tell us how much a model improves with additional compute, but not what skills it acquires in which order. To remedy this, we propose the Implicit Curriculum Hypothesis: pretraining follows a compositional and predictable curriculum across models and data mixtures. We test this by designing a suite of simple, composable tasks spanning retrieval, morphological transformations, coreference, logical reasoning, and mathematics. Using these tasks, we track emergence points across four model families spanning sizes from 410M-13B parameters. We find that emergence orderings of when models reach fixed accuracy thresholds are strikingly consistent ($ρ= .81$ across 45 model pairs), and that composite tasks most often emerge after their component tasks. Furthermore, we find that this structure is encoded in model representations: tasks with similar function vector representations also tend to follow similar trajectories in training. By using the space of representations derived from our task set, we can effectively predict the training trajectories of simple held-out compositional tasks throughout the course of pretraining ($R^2 = .68$-$.84$ across models) without previously evaluating them. Together, these results suggest that pretraining is more structured than loss curves reveal: skills emerge in a compositional order that is consistent across models and readable from their internals.

[59] Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Jiayuan Ye,Vitaly Feldman,Kunal Talwar

Main category: cs.CL

TL;DR: 本文从信息论角度形式化了大语言模型的事实记忆问题，指出当训练数据中事实信息量超过模型容量或事实频率分布偏斜时，事实准确性会下降；为此提出仅基于训练损失的数据选择方法，以限制事实数量并平滑频率分布，显著提升模型事实记忆能力。

Details

Motivation: 大型语言模型在参数中记忆事实知识能力有限，易产生幻觉且在知识密集型任务中表现差，需从信息论角度理解并改善其事实记忆能力。 Method: 从信息论视角形式化事实记忆问题，分析训练数据分布对事实准确性的影响，并提出仅依赖训练损失的数据选择策略，以控制事实数量和平衡频率分布。 Result: 在半合成高熵事实数据集上，该方法将事实准确性提升至容量极限；在维基百科语料上预训练时，GPT2-Small模型记忆实体事实数量提升1.3倍，达到使用全量数据训练的10倍更大模型（1.3B）的水平。 Conclusion: 训练数据中事实信息量与分布特性显著影响模型事实记忆能力，合理数据选择可大幅提升小模型的事实记忆效率，逼近大模型性能。 Abstract: Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

[60] ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang,Yubo Wang,Yipeng Zhu,Penghui Du,Junwen Miao,Xuan Lu,Wendong Xu,Yunzhuo Hao,Songcheng Cai,Xiaochen Wang,Huaisong Zhang,Xian Wu,Yi Lu,Minyi Lei,Kai Zou,Huifeng Yin,Ping Nie,Liang Chen,Dongfu Jiang,Wenhu Chen,Kelsey R. Allen

Main category: cs.CL

TL;DR: 本文提出了ClawBench，一个面向真实线上平台的AI代理评估框架，包含153个日常任务，覆盖144个活跃网站、15类生活与工作场景，强调信息提取、跨平台多步导航和大量表单填写等高难度能力；实验表明当前前沿模型完成率仍很低（如Claude Sonnet 4.6仅33.3%），凸显现实Web交互评估的重要性。

Details

Motivation: 现有AI代理基准多在离线沙箱或静态网页中评估，无法反映真实、动态、复杂的线上交互挑战；亟需一个能评估AI代理在真实生活与工作场景中完成日常线上任务能力的新基准。 Method: 构建ClawBench：包含153个来自144个真实在线平台、15类日常任务（如购物、预约、求职）的评测任务；设计轻量级拦截层，在不产生实际副作用前提下安全执行于生产环境；评估7个前沿模型（含闭源与开源）在该框架下的任务完成率。 Result: 所有被测模型在ClawBench上表现有限，最高完成率仅为33.3%（Claude Sonnet 4.6），表明当前AI代理尚难胜任复杂、动态的真实Web交互任务。 Conclusion: ClawBench填补了真实世界线上任务评估的空白，为推动AI代理成为可靠通用助手提供了更具挑战性与实用性的评测标准；提升其在该基准上的表现是迈向实用化的重要一步。 Abstract: AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

[61] Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo,Yu-Neng Chuang,Guanchu Wang,Zicheng Xu,Xiaotian Han,Tianyi Zhang,Vladimir Braverman

Main category: cs.CL

TL;DR: 本文提出StableOPD框架，解决On-policy distillation中因学生模型生成过长重复轨迹导致的截断崩溃与训练不稳定问题，通过参考分布约束与混合rollout蒸馏提升数学推理性能。

Details

Motivation: On-policy distillation（OPD）在训练过程中会出现轨迹长度突增、截断崩溃、重复饱和等问题，导致梯度偏差与性能骤降，其根源在于学生诱导的数据采集与蒸馏目标之间的不良交互。 Method: 提出StableOPD框架，结合基于参考分布的散度约束与rollout混合蒸馏策略，抑制由重复引发的长度膨胀，稳定训练过程。 Result: 在多个数学推理数据集上，StableOPD有效防止截断崩溃，稳定训练动态，并平均提升性能7.2%。 Conclusion: StableOPD通过显式控制学生rollout的长度与多样性，缓解了OPD固有的不稳定性，为基于策略的模型蒸馏提供了更鲁棒的范式。 Abstract: On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

cs.CV [Back]

[62] FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Xiangru Jian,Hao Xu,Wei Pang,Xinjian Zhao,Chengyu Tao,Qixin Zhang,Xikun Zhang,Chao Zhang,Guanzhi Deng,Alex Xue,Juan Du,Tianshu Yu,Garth Tarr,Linqi Song,Qiuzhuang Sun,Dacheng Tao

Main category: cs.CV

TL;DR: 本文提出FORGE，一个面向制造业的多模态大语言模型评估框架，通过构建高质量多模态数据集（2D图像+3D点云+细粒度语义标注），揭示当前MLLM在制造任务中性能瓶颈在于领域知识不足而非视觉定位能力，并验证了基于该数据集微调小模型的有效性。

Details

Motivation: 现有评估方法无法反映真实制造环境的严苛需求，且受限于数据稀缺和缺乏细粒度领域语义标注。 Method: 构建融合真实2D图像与3D点云、带细粒度领域语义（如精确型号）标注的高质量多模态数据集；系统评估18个前沿MLLM在三个制造任务上的表现；进行瓶颈分析；开展监督微调实验验证数据集的训练价值。 Result: 发现领域知识不足是主要瓶颈而非视觉定位能力；在持留制造场景上，对3B参数模型进行监督微调可获得最高达90.8%的准确率相对提升。 Conclusion: FORGE为制造业MLLM提供了更贴近实际的评估基准与实用训练资源，明确了未来研究应聚焦于增强领域知识建模能力。 Abstract: The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.

[63] Personalizing Text-to-Image Generation to Individual Taste

Anne-Sofie Maerten,Juliane Verwiebe,Shyamgopal Karthik,Ameya Prabhu,Johan Wagemans,Matthias Bethge

Main category: cs.CV

TL;DR: 本文提出PAMELA框架与新数据集，用于建模个性化图像审美评价，显著提升对个体偏好的预测准确率，并支持基于提示优化的个性化图像生成。

Details

Motivation: 现有文本到图像模型及奖励模型仅优化“平均”人类偏好，无法建模审美判断的主观性。 Method: 构建含70,000条评分、覆盖5,000张AI生成图像的个性化评估数据集（PAMELA），并训练联合多源标注的个性化奖励模型；结合提示优化实现偏好引导生成。 Result: 所提个性化奖励模型在个体喜好预测上优于多数现有方法对群体偏好的预测性能；验证了提示优化可有效提升生成结果与用户偏好的一致性。 Conclusion: 高质量个性化数据与建模对解决T2I中主观偏好建模至关重要；作者开源数据集与模型，推动个性化对齐与主观视觉质量评估研究。 Abstract: Modern text-to-image (T2I) models generate high-fidelity visuals but remain indifferent to individual user preferences. While existing reward models optimize for "average" human appeal, they fail to capture the inherent subjectivity of aesthetic judgment. In this work, we introduce a novel dataset and predictive framework, called PAMELA, designed to model personalized image evaluations. Our dataset comprises 70,000 ratings across 5,000 diverse images generated by state-of-the-art models (Flux 2 and Nano Banana). Each image is evaluated by 15 unique users, providing a rich distribution of subjective preferences across domains such as art, design, fashion, and cinematic photography. Leveraging this data, we propose a personalized reward model trained jointly on our high-quality annotations and existing aesthetic assessment subsets. We demonstrate that our model predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences. Using our personalized predictor, we demonstrate how simple prompt optimization methods can be used to steer generations towards individual user preferences. Our results highlight the importance of data quality and personalization to handle the subjectivity of user preferences. We release our dataset and model to facilitate standardized research in personalized T2I alignment and subjective visual quality assessment.

[64] GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang,Siyuan Hu,Kevin Qinghong Lin,Hwee Tou Ng,Mike Zheng Shou

Main category: cs.CV

TL;DR: 本文提出了GameWorld基准，用于标准化和可验证地评估多模态大语言模型（MLLM）作为通用游戏代理在浏览器环境中的表现，涵盖34款多样化的游戏和170项任务，并揭示当前模型距离人类水平仍有较大差距。

Details

Motivation: 现有MLLM代理在真实世界交互中面临高延迟、稀疏反馈和不可逆错误等挑战；视频游戏虽是理想测试平台，但缺乏统一、可验证的评估标准。 Method: 构建GameWorld基准，定义两种代理接口（计算机使用型与语义动作型），设计状态可验证指标，并在18种模型-接口组合上进行系统性实验。 Result: 最佳代理仍远未达到人类游戏能力；重复基准测试验证了其鲁棒性；实时交互、上下文记忆敏感性和动作有效性分析进一步揭示了当前挑战。 Conclusion: GameWorld为多模态游戏代理研究提供了标准化、可验证、可复现的评估框架，奠定了该领域未来发展的坚实基础。 Abstract: Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.

[65] HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

Tencent Robotics X,HY Vision Team,:,Xumin Yu,Zuyan Liu,Ziyi Wang,He Zhang,Yongming Rao,Fangfu Liu,Yani Zhang,Ruowen Zhao,Oran Wang,Yves Liang,Haitao Lin,Minghui Wang,Yubo Dong,Kevin Cheng,Bolin Ni,Rui Huang,Han Hu,Zhengyou Zhang,Linus,Shunyu Yao

Main category: cs.CV

TL;DR: 本文提出了HY-Embodied-0.5系列基础模型，专为真实世界具身智能体设计，通过MoT架构与迭代自进化后训练范式，提升空间/时间视觉感知与具身推理能力，并在多类基准与真实机器人控制任务中取得优异性能。

Details

Motivation: 弥合通用视觉语言模型（VLM）与真实具身智能体需求之间的差距，增强其所需的核心能力：空间与时间视觉感知、以及预测、交互与规划等具身推理能力。 Method: 提出HY-Embodied-0.5模型族，含2B（边缘部署）和32B（复杂推理）两种变体；采用模态特异性计算的Mixture-of-Transformers（MoT）架构并引入隐变量增强感知表征；设计迭代式自进化后训练范式提升推理能力；使用on-policy蒸馏将大模型能力迁移至小模型。 Result: 在22个涵盖视觉感知、空间推理与具身理解的基准上全面验证有效性：MoT-2B在16个基准上超越同规模SOTA；32B变体性能媲美Gemini 3.0 Pro；下游VLA模型在真实物理机器人控制实验中表现优异。 Conclusion: HY-Embodied-0.5系列模型显著提升了具身智能所需的多维度能力，兼具高效性与强推理性，开源代码与模型推动社区发展。 Abstract: We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

[66] SMFD-UNet: Semantic Face Mask Is The Only Thing You Need To Deblur Faces

Abduz Zami

Main category: cs.CV

TL;DR: 本文提出了一种名为SMFD-UNet的轻量级人脸图像去模糊框架，利用语义面部掩码引导去模糊过程，无需高质量参考图像；该方法通过双阶段UNet结构分别生成面部部件掩码并融合特征进行重建，在CelebA数据集上优于现有方法，具备高PSNR/SSIM及良好自然度指标。

Details

Motivation: 传统去模糊方法依赖通用图像先验，难以建模人脸特有的结构和身份特征；且多数方法需高质量参考图像，限制了实际应用。 Method: 提出SMFD-UNet：第一阶段用UNet生成模糊输入对应的细粒度语义面部掩码（如眼、鼻、嘴）；第二阶段将掩码与模糊输入通过多阶段特征融合（含RDC块、CBAM注意力、高效上采样与后处理）集成于轻量UNet中完成重建；并构建覆盖1.74万亿退化场景的随机模糊管线以增强鲁棒性。 Result: 在CelebA数据集上，SMFD-UNet在PSNR和SSIM指标上超越当前最优方法，同时NIQE、LPIPS、FID等自然度指标表现良好；模型轻量、高效，具备良好可扩展性。 Conclusion: SMFD-UNet通过引入语义面部先验驱动去模糊，有效提升了人脸图像复原的质量与真实性，为面向实际应用的人脸图像增强提供了灵活、高效的新范式。 Abstract: For applications including facial identification, forensic analysis, photographic improvement, and medical imaging diagnostics, facial image deblurring is an essential chore in computer vision allowing the restoration of high-quality images from blurry inputs. Often based on general picture priors, traditional deblurring techniques find it difficult to capture the particular structural and identity-specific features of human faces. We present SMFD-UNet (Semantic Mask Fusion Deblurring UNet), a new lightweight framework using semantic face masks to drive the deblurring process, therefore removing the need for high-quality reference photos in order to solve these difficulties. First, our dual-step method uses a UNet-based semantic mask generator to directly extract detailed facial component masks (e.g., eyes, nose, mouth) straight from blurry photos. Sharp, high-fidelity facial images are subsequently produced by integrating these masks with the blurry input using a multi-stage feature fusion technique within a computationally efficient UNet framework. We created a randomized blurring pipeline that roughly replicates real-world situations by simulating around 1.74 trillion deterioration scenarios, hence guaranteeing resilience. Examined on the CelebA dataset, SMFD-UNet shows better performance than state-of-the-art models, attaining higher Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) while preserving satisfactory naturalness measures, including NIQE, LPIPS, and FID. Powered by Residual Dense Convolution Blocks (RDC), a multi-stage feature fusion strategy, efficient and effective upsampling techniques, attention techniques like CBAM, post-processing techniques, and the lightweight design guarantees scalability and efficiency, enabling SMFD-UNet to be a flexible solution for developing facial image restoration research and useful applications.

[67] Training-free Spatially Grounded Geometric Shape Encoding (Technical Report)

Yuhang He

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、通用的2D几何形状编码方法XShapeEnc，通过Zernike正交基和频率传播操作，实现对空间形状的可逆、自适应且富含频率信息的紧凑编码，适用于多种形状感知任务。

Details

Motivation: 将位置编码从1D序列扩展到2D空间几何形状面临几何、姿态建模及神经网络兼容性等挑战，亟需一种通用、高效、可学习的编码策略。 Method: XShapeEnc将2D形状分解为单位圆内的归一化几何与姿态向量；姿态被映射为单位圆内的谐波姿态场；利用Zernike正交基独立或联合编码几何与姿态，并通过频率传播增强高频内容。 Result: XShapeEnc在理论有效性、计算效率、判别能力及多任务适用性（如形状匹配、重建、分类）上均得到验证，且在自建数据集XShapeCorpus上表现优异。 Conclusion: XShapeEnc是一种训练自由、通用性强、具备可逆性、自适应性和频率丰富性的2D形状编码基础工具，有望推动面向2D空间智能的研究发展。 Abstract: Positional encoding has become the de facto standard for grounding deep neural networks on discrete point-wise positions, and it has achieved remarkable success in tasks where the input can be represented as a one-dimensional sequence. However, extending this concept to 2D spatial geometric shapes demands carefully designed encoding strategies that account not only for shape geometry and pose, but also for compatibility with neural network learning. In this work, we address these challenges by introducing a training-free, general-purpose encoding strategy, dubbed XShapeEnc, that encodes an arbitrary spatially grounded 2D geometric shape into a compact representation exhibiting five favorable properties, including invertibility, adaptivity, and frequency richness. Specifically, a 2D spatially grounded geometric shape is decomposed into its normalized geometry within the unit disk and its pose vector, where the pose is further transformed into a harmonic pose field that also lies within the unit disk. A set of orthogonal Zernike bases is constructed to encode shape geometry and pose either independently or jointly, followed by a frequency-propagation operation to introduce high-frequency content into the encoding. We demonstrate the theoretical validity, efficiency, discriminability, and applicability of XShapeEnc via extensive analysis and experiments across a wide range of shape-aware tasks and our self-curated XShapeCorpus. We envision XShapeEnc as a foundational tool for research that goes beyond one-dimensional sequential data toward frontier 2D spatial intelligence.

[68] On the Uphill Battle of Image frequency Analysis

Nader Bazyari,Hedieh Sajedi

Main category: cs.CV

TL;DR: 本文提出了针对非均匀数据的逆平方均值漂移算法的特例，并利用三维快速傅里叶变换分析图像以发现隐藏模式。

Details

Motivation: 为处理非均匀数据并挖掘图像中的隐藏模式，对逆平方均值漂移算法进行拓展。 Method: 构建逆平方均值漂移算法在非均匀数据下的特例，并应用三维快速傅里叶变换于图像分析。 Result: 实现了针对非均匀数据的算法变体，并探索了三维FFT在图像隐模式识别中的潜力。 Conclusion: 该工作扩展了逆平方均值漂移算法的应用范围，并初步验证了三维FFT在图像模式挖掘中的有效性。 Abstract: This work is a follow up on the newly proposed clustering algorithm called The Inverse Square Mean Shift Algorithm. In this paper a special case of algorithm for dealing with non-homogenous data is formulated and the three dimensional Fast Fourier Transform of images is investigated with the aim of finding hidden patterns.

[69] Mathematical Analysis of Image Matching Techniques

Oleh Samoilenko

Main category: cs.CV

TL;DR: 本文对SIFT和ORB两种经典局部特征匹配算法在卫星影像上的性能进行了分析与实验评估，通过统一处理流程（关键点检测、描述子提取、匹配及RANSAC几何验证）并以Inlier Ratio为指标进行比较。

Details

Motivation: 图像匹配是计算机视觉的基础问题，在遥感、机器人和地理空间数据分析中具有重要应用；然而，现有方法在卫星影像这类特殊数据上的表现缺乏系统评估。 Method: 采用统一处理流程：关键点检测→描述子提取→描述子匹配→RANSAC+单应性估计的几何验证；使用自建GPS标注卫星影像数据集，分析关键点数量对Inlier Ratio的影响。 Result: 给出了SIFT与ORB在卫星影像上匹配性能的定量对比结果，特别是不同关键点数量下的Inlier Ratio变化趋势。 Conclusion: SIFT与ORB在卫星影像匹配任务中各有优劣，其性能受关键点数量显著影响，需根据实际场景权衡精度与效率。 Abstract: Image matching is a fundamental problem in Computer Vision with direct applications in robotics, remote sensing, and geospatial data analysis. We present an analytical and experimental evaluation of classical local feature-based image matching algorithms on satellite imagery, focusing on the Scale-Invariant Feature Transform (SIFT) and the Oriented FAST and Rotated BRIEF (ORB). Each method is evaluated through a common pipeline: keypoint detection, descriptor extraction, descriptor matching, and geometric verification via RANSAC with homography estimation. Matching quality is assessed using the Inlier Ratio - the fraction of correspondences consistent with the estimated homography. The study uses a manually constructed dataset of GPS-annotated satellite image tiles with intentional overlaps. We examine the impact of the number of extracted keypoints on the resulting Inlier Ratio.

[70] Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

Katerina Katsarou,George Zountsas,Karam Tomotaki-Dawoud,Alexander Ehrenhoefer,Paul Chojecki,David Przewozny,Igor Maximilian Sauer,Amira Mouakher,Sebastian Bosse

Main category: cs.CV

TL;DR: 本文提出了一种基于ViT-LSTM的时空视觉框架，用于手术视频中器械交接事件的检测与方向分类，通过多任务学习和峰值检测实现高精度识别，并用Layer-CAM提升可解释性。

Details

Motivation: 手术器械交接的自动监测对提高手术效率和患者安全至关重要，但现有方法受限于频繁遮挡、背景杂乱及交互事件的时间动态性。 Method: 采用Vision Transformer（ViT）提取空间特征，结合单向LSTM进行时间建模；设计统一多任务损失函数联合预测交接发生与否及方向；利用置信度时序信号配合峰值检测定位离散交接事件；引入Layer-CAM进行可视化归因分析。 Result: 在肾移植手术数据集上，交接检测F1达0.84，方向分类平均F1为0.72，优于单任务模型和VideoMamba基线；Layer-CAM成功定位手-器械交互关键区域。 Conclusion: 所提框架在准确性和可解释性方面均表现优异，为手术视频理解提供了有效的事件级建模新范式。 Abstract: Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence scores form a temporal signal over the video, from which discrete handover events are identified via peak detection. Experiments on a dataset of kidney transplant procedures demonstrate strong performance, achieving an F1-score of 0.84 for handover detection and a mean F1-score of 0.72 for direction classification, outperforming both a single-task variant and a VideoMamba-based baseline for direction prediction while maintaining comparable detection performance. To improve interpretability, we employ Layer-CAM attribution to visualize spatial regions driving model decisions, highlighting hand-instrument interaction cues.

Muhammad Imran Sharif,Doina Caragea

Main category: cs.CV

TL;DR: 本文提出MSGL-Transformer模型，通过多尺度全局-局部注意力机制和行为感知调制模块，有效提升基于姿态序列的啮齿类动物社交行为识别性能，在两个公开数据集上均取得SOTA结果。

Details

Motivation: 传统人工标注啮齿类动物行为耗时且易出错，亟需自动、鲁棒、可泛化的自动识别方法。 Method: 提出MSGL-Transformer：轻量级Transformer编码器，含并行短程、中程与全局注意力分支，并引入受SE网络启发的行为感知调制（BAM）模块，对时间嵌入进行行为相关特征增强。 Result: 在RatSI数据集上达75.4%平均准确率（F1=0.745），超越TCN、LSTM等；在CalMS21上达87.1%准确率（F1=0.8745），较HSTWFormer提升10.7%，并优于ST-GCN、MS-G3D等图卷积模型；同一架构仅调整输入维度与类别数即可跨数据集泛化。 Conclusion: MSGL-Transformer验证了多尺度时序建模与行为感知特征调制对姿态驱动行为识别的有效性，具备强泛化能力与实用性。 Abstract: Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.

[72] Bootstrapping Sign Language Annotations with Sign Language Models

Colin Lea,Vasileios Baltatzis,Connor Gillis,Raja Kushalnagar,Lorna Quandt,Leah Findlater

Main category: cs.CV

TL;DR: 本文提出了一种伪标注流水线，用于在缺乏高质量标注数据的情况下提升手语识别性能，结合稀疏预测模型与K-shot大语言模型，生成带时间戳的手语词素、手指拼写和分类符的候选标注，并发布了人工标注基准与大量伪标注数据。

Details

Motivation: AI驱动的手语解释受限于高质量标注数据的缺乏，现有大型数据集（如ASL STEM Wiki和FLEURS-ASL）虽包含数百小时专业手语视频，但仅部分标注，高昂的人工标注成本导致其未被充分利用。 Method: 构建伪标注流水线：输入手语视频和对应英文文本，输出带时间区间的词素（gloss）、手指拼写词及分类符的排序候选标注；结合自研的稀疏手指拼写识别器与孤立手语识别器（ISR），并引入K-Shot LLM进行上下文建模与标注估计；同时建立轻量高效的手指拼写与ISR基线模型。 Result: 手指拼写识别在FSBoard上达到6.7%字符错误率（CER），孤立手语识别在ASL Citizen上达74% top-1准确率；专业译员对近500个ASL STEM Wiki视频完成序列级人工标注（含gloss、classifier、fingerspelling）；发布该黄金标准标注集及超300小时伪标注数据。 Conclusion: 所提伪标注流水线可显著缓解手语数据标注瓶颈，推动端到端手语理解研究；发布的基准与伪标注资源为社区提供了重要基础支撑。 Abstract: AI-driven sign language interpretation is limited by a lack of high-quality annotated data. New datasets including ASL STEM Wiki and FLEURS-ASL contain professional interpreters and 100s of hours of data but remain only partially annotated and thus underutilized, in part due to the prohibitive costs of annotating at this scale. In this work, we develop a pseudo-annotation pipeline that takes signed video and English as input and outputs a ranked set of likely annotations, including time intervals, for glosses, fingerspelled words, and sign classifiers. Our pipeline uses sparse predictions from our fingerspelling recognizer and isolated sign recognizer (ISR), along with a K-Shot LLM approach, to estimate these annotations. In service of this pipeline, we establish simple yet effective baseline fingerspelling and ISR models, achieving state-of-the-art on FSBoard (6.7% CER) and on ASL Citizen datasets (74% top-1 accuracy). To validate and provide a gold-standard benchmark, a professional interpreter annotated nearly 500 videos from ASL STEM Wiki with sequence-level gloss labels containing glosses, classifiers, and fingerspelling signs. These human annotations and over 300 hours of pseudo-annotations are being released in supplemental material.

[73] VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models

Pavan Kumar Anasosalu Vasu,Cem Koc,Fartash Faghri,Chun-Liang Li,Bo Feng,Zhengfeng Lai,Meng Cao,Oncel Tuzel,Hadi Pouransari

Main category: cs.CV

TL;DR: 本文提出VSAS-Bench，一个面向视觉流式助手（Streaming VLMs）的新基准与评估框架，聚焦于实时响应的‘主动性’和‘一致性’等新指标，并通过大规模实验揭示了准确率与延迟的权衡规律，发现部分传统VLM经简单适配即可超越专用流式模型。

Details

Motivation: 现有VLM评估多在离线场景下进行，而流式VLM需关注响应及时性（proactiveness）和时序鲁棒性（consistency）等新维度，缺乏针对性基准。 Method: 构建VSAS-Bench：包含18,000+时序密集标注、覆盖多领域多任务；设计同步/异步评估协议；定义新指标；系统评测主流视频/流式VLM，分析内存缓冲长度、访问策略、输入分辨率等对准确率-延迟权衡的影响。 Result: 发现常规VLM（如Qwen3-VL-4B）无需额外训练即可适配流式设置，且在异步协议下比当前最优流式模型Dispider高出3%；获得若干关于架构设计的实际洞见。 Conclusion: VSAS-Bench填补了流式VLM评估空白，证明简单适配传统VLM是高效可行路径，为未来实时视觉助手研究提供标准化工具与实证基础。 Abstract: Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model's responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new framework and benchmark for Visual Streaming Assistants. In contrast to prior benchmarks that primarily employ single-turn question answering on video inputs, VSAS-Bench features temporally dense annotations with over 18,000 annotations across diverse input domains and task types. We introduce standardized synchronous and asynchronous evaluation protocols, along with metrics that isolate and measure distinct capabilities of streaming VLMs. Using this framework, we conduct large-scale evaluations of recent video and streaming VLMs, analyzing the accuracy-latency trade-off under key design factors such as memory buffer length, memory access policy, and input resolution, yielding several practical insights. Finally, we show empirically that conventional VLMs can be adapted to streaming settings without additional training, and demonstrate that these adapted models outperform recent streaming VLMs. For example, Qwen3-VL-4B surpasses Dispider, the best streaming VLM on our benchmark, by 3% under the asynchronous protocol. The benchmark and code will be available at https://github.com/apple/ml-vsas-bench.

[74] Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach

Huibin Bai,Shuai Li,Hanxiao Zhai,Yanbo Gao,Chong Lv,Yibo Wang,Haipeng Ping,Wei Hua,Xingyu Gao

Main category: cs.CV

TL;DR: 本文提出了一种从特征恢复视角解决单目深度估计问题的新方法，通过可逆变换增强的间接扩散模块（InvT-IndDiffusion）恢复预训练编码器特征，并引入辅助视角低层特征增强模块（AV-LFE），在KITTI等数据集上显著优于现有方法。

Details

Motivation: 当前主流单目深度估计（MDE）方法虽采用多尺度编码器-解码器结构，但其架构局限性及不同层级特征对预测精度的影响尚未被系统评估；作者发现若能提升编码器特征质量，现有框架仍有较大提升潜力。 Method: 提出特征恢复视角的MDE建模方式，将预训练编码器特征视为理想特征的退化版本；设计Invertible Transform-enhanced Indirect Diffusion（InvT-IndDiffusion）模块，在仅依赖稀疏深度图间接监督下，利用满足双Lipschitz条件的可逆变换解码器缓解扩散过程中特征偏差；并引入插件式Auxiliary Viewpoint-based Low-level Feature Enhancement（AV-LFE）模块增强局部细节。 Result: 在多个标准数据集（如KITTI）上超越现有最先进方法；在KITTI上，相比基线模型，RMSE指标分别提升4.09%和37.77%（不同训练设置下）。 Conclusion: 特征质量是提升单目深度估计性能的关键瓶颈；所提InvT-IndDiffusion与AV-LFE模块有效提升了编码器特征表达能力与局部细节建模能力，验证了特征恢复范式的有效性与通用性。 Abstract: Monocular Depth Estimation (MDE) is a fundamental computer vision task with important applications in 3D vision. The current mainstream MDE methods employ an encoder-decoder architecture with multi-level/scale feature processing. However, the limitations of the current architecture and the effects of different-level features on the prediction accuracy are not evaluated. In this paper, we first investigate the above problem and show that there is still substantial potential in the current framework if encoder features can be improved. Therefore, we propose to formulate the depth estimation problem from the feature restoration perspective, by treating pretrained encoder features as degraded features of an assumed ground truth feature that yields the ground truth depth map. Then an Invertible Transform-enhanced Indirect Diffusion (InvT-IndDiffusion) module is developed for feature restoration. Due to the absence of direct supervision on feature, only indirect supervision from the final sparse depth map is used. During the iterative procedure of diffusion, this results in feature deviations among steps. The proposed InvT-IndDiffusion solves this problem by using an invertible transform-based decoder under the bi-Lipschitz condition. Finally, a plug-and-play Auxiliary Viewpoint-based Low-level Feature Enhancement module (AV-LFE) is developed to enhance local details with auxiliary viewpoint when available. Experiments demonstrate that the proposed method achieves better performance than the state-of-the-art methods on various datasets. Specifically on the KITTI benchmark, compared with the baseline, the performance is improved by 4.09% and 37.77% under different training settings in terms of RMSE. Code is available at https://github.com/whitehb1/IID-RDepth.

[75] Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation

Yanbo Gao,Huibin Bai,Huasong Zhou,Xingyu Gao,Shuai Li,Xun Cai,Hui Yuan,Wei Hua,Tian Xie

Main category: cs.CV

TL;DR: 本文提出了一种深度转换尺度卷积（DcSConv）增强的单目深度估计框架，通过建模物体深度与尺度的关系，自适应调整卷积核尺度以缓解单目视频中因物体尺度变化引起的深度歧义，并设计了DcS-F模块融合新旧特征；该方法作为即插即用模块可提升现有CNN模型，在KITTI上相对误差（SqRel）最高降低11.6%。

Details

Motivation: 单目视频中同一物体因深度变化导致尺度连续变化，引发尺度与深度歧义，而以往方法缺乏对物体尺度随深度变化的显式建模。 Method: 提出Depth-converted-Scale Convolution（DcSConv），将深度先验转化为卷积核尺度自适应机制，强调尺度而非形变；并设计Depth-converted-Scale aware Fusion（DcS-F）模块融合DcSConv与常规卷积特征；整体作为插件式模块嵌入现有CNN深度估计框架。 Result: 在KITTI基准上取得最优性能，SqRel指标最高降低11.6%；消融实验验证了DcSConv和DcS-F模块的有效性。 Conclusion: 卷积核的尺度对单目深度估计至关重要，显式引入深度-尺度先验并自适应调整卷积尺度，能显著提升模型性能；所提DcSConv是一种通用、即插即用的增强模块。 Abstract: Self-supervised monocular depth estimation (MDE) has received increasing interests in the last few years. The objects in the scene, including the object size and relationship among different objects, are the main clues to extract the scene structure. However, previous works lack the explicit handling of the changing sizes of the object due to the change of its depth. Especially in a monocular video, the size of the same object is continuously changed, resulting in size and depth ambiguity. To address this problem, we propose a Depth-converted-Scale Convolution (DcSConv) enhanced monocular depth estimation framework, by incorporating the prior relationship between the object depth and object scale to extract features from appropriate scales of the convolution receptive field. The proposed DcSConv focuses on the adaptive scale of the convolution filter instead of the local deformation of its shape. It establishes that the scale of the convolution filter matters no less (or even more in the evaluated task) than its local deformation. Moreover, a Depth-converted-Scale aware Fusion (DcS-F) is developed to adaptively fuse the DcSConv features and the conventional convolution features. Our DcSConv enhanced monocular depth estimation framework can be applied on top of existing CNN based methods as a plug-and-play module to enhance the conventional convolution block. Extensive experiments with different baselines have been conducted on the KITTI benchmark and our method achieves the best results with an improvement up to 11.6% in terms of SqRel reduction. Ablation study also validates the effectiveness of each proposed module.

[76] Weight Group-wise Post-Training Quantization for Medical Foundation Model

Yineng Chen,Peng Huang,Aozhong Zhang,Hui Guo,Penghang Yin,Shu Hu,Shao Lin,Xin Li,Tzu-Jen Kao,Balakrishnan Prabhakaran,MingChing Chang,Xin Wang

Main category: cs.CV

TL;DR: 本文提出了一种无需反向传播的后训练量化算法Permutation-COMQ，通过重排序权重和简单运算提升低比特量化精度，在2/4/8比特下均达到最优效果。

Details

Motivation: 基础模型在医学图像分析中表现优异，但其大网络结构和高计算复杂度限制了在终端医疗设备上的实时推理应用，亟需高效轻量化的量化方案。 Method: 提出后训练量化算法Permutation-COMQ：1）仅使用点积和舍入操作，避免反向传播与超参调优；2）引入权重感知策略，对每层权重进行重排序以缓解通道缩放导致的精度下降，同时保持通道结构。 Result: 在2比特、4比特和8比特量化设置下，该方法在多个医学图像任务上均取得当前最优性能。 Conclusion: Permutation-COMQ是一种高效、简洁且高精度的后训练量化方法，特别适用于资源受限的终端医疗设备部署。 Abstract: Foundation models have achieved remarkable results in medical image analysis. However, its large network architecture and high computational complexity significantly impact inference speed, limiting its application on terminal medical devices. Quantization, a technique that compresses models into low-bit versions, is a solution to this challenge. In this paper, we propose a post-training quantization algorithm, Permutation-COMQ. It eliminates the need for backpropagation by using simple dot products and rounding operations, thereby removing hyperparameter tuning and simplifying the process. Additionally, we introduce a weight-aware strategy that reorders the weight within each layer to address the accuracy degradation induced by channel-wise scaling during quantization, while preserving channel structure. Experiments demonstrate that our method achieves the best results in 2-bit, 4-bit, and 8-bit quantization.

[77] FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction

Jinzhen Han,JinByeong Lee,Hak Han,YeonJu Na,Jae-Joon Lee

Main category: cs.CV

TL;DR: 本文提出FireSenseNet，一种双分支CNN模型，通过CAFIM模块显式建模燃料与气象要素的空间交互，在野火蔓延预测任务中显著优于现有方法，并揭示了评估偏差与关键特征重要性。

Details

Motivation: 现有深度学习方法将异构地理空间输入简单拼接，忽视静态燃料/地形与动态气象条件之间的物理本质差异，导致建模不准确。 Method: 提出双分支CNN架构FireSenseNet，引入跨注意力特征交互模块（CAFIM），在多个编码器尺度上通过可学习注意力门控建模燃料与天气模态的空间变化交互；结合蒙特卡洛Dropout实现像素级不确定性量化。 Result: 在Google Next-Day Wildfire Spread基准上，FireSenseNet取得F1=0.4176、AUC-PR=0.3435，优于所有对比模型（包括参数多3.8倍的SegFormer）；CAFIM带来7.1%相对F1提升；发现前一日火场掩膜主导预测，风速在粗时间分辨率下呈噪声特性；指出常见评估捷径使F1虚高超44%。 Conclusion: 显式建模多源地理模态间的物理交互对野火预测至关重要；CAFIM有效提升性能；评估需规避捷径偏差；特征重要性分析与不确定性量化增强了模型可信度与可解释性。 Abstract: Accurate prediction of next-day wildfire spread is critical for disaster response and resource allocation. Existing deep learning approaches typically concatenate heterogeneous geospatial inputs into a single tensor, ignoring the fundamental physical distinction between static fuel/terrain properties and dynamic meteorological conditions. We propose FireSenseNet, a dual-branch convolutional neural network equipped with a novel Cross-Attentive Feature Interaction Module (CAFIM) that explicitly models the spatially varying interaction between fuel and weather modalities through learnable attention gates at multiple encoder scales. Through a systematic comparison of seven architectures -- spanning pure CNNs, Vision Transformers, and hybrid designs -- on the Google Next-Day Wildfire Spread benchmark, we demonstrate that FireSenseNet achieves an F1 of 0.4176 and AUC-PR of 0.3435, outperforming all alternatives including a SegFormer with 3.8* more parameters (F1 = 0.3502). Ablation studies confirm that CAFIM provides a 7.1% relative F1 gain over naive concatenation, and channel-wise feature importance analysis reveals that the previous-day fire mask dominates prediction while wind speed acts as noise at the dataset's coarse temporal resolution. We further incorporate Monte Carlo Dropout for pixel-level uncertainty quantification and present a critical analysis showing that common evaluation shortcuts inflate reported F1 scores by over 44%.

[78] Needle in a Haystack -- One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology

Swarnadip Chatterjee,Vladimir Basic,Arrigo Capitanio,Orcun Goksel,Joakim Lindblad

Main category: cs.CV

TL;DR: 本文提出使用单类表示学习（OCC）方法（如DSVDD和DROC）在极低见证率（≤1%）下检测罕见恶性细胞，仅需阴性切片训练，无需实例级标注，在骨髓与口腔癌细胞数据集上达到SOTA性能，优于传统MIL及部分监督方法。

Details

Motivation: 恶性细胞在全切片图像中形态多样且极其稀少，导致严重类别不平衡和标注匮乏，传统弱监督方法（如MIL）在极低见证率下泛化能力差。 Method: 采用仅基于阴性补丁训练的单类表示学习方法（DSVDD、DROC），学习正常性的紧凑表征，并在测试时检测偏离；对比FS-SIL、WS-SIL和ItS2CLR。 Result: DSVDD在≤1%见证率下实现实例级异常排序SOTA性能，甚至超越全监督方法；DROC通过分布增强对比学习在极端稀疏场景下也具竞争力。 Conclusion: 单类表示学习是极端稀有恶性细胞检测中比MIL更鲁棒、可解释性更强的优选方案。 Abstract: In computational cytology, detecting malignancy on whole-slide images is difficult because malignant cells are morphologically diverse yet vanishingly rare amid a vast background of normal cells. Accurate detection of these extremely rare malignant cells remains challenging due to large class imbalance and limited annotations. Conventional weakly supervised approaches, such as multiple instance learning (MIL), often fail to generalize at the instance level, especially when the fraction of malignant cells (witness rate) is exceedingly low. In this study, we explore the use of one-class representation learning techniques for detecting malignant cells in low-witness-rate scenarios. These methods are trained exclusively on slide-negative patches, without requiring any instance-level supervision. Specifically, we evaluate two OCC approaches, DSVDD and DROC, and compare them with FS-SIL, WS-SIL, and the recent ItS2CLR method. The one-class methods learn compact representations of normality and detect deviations at test time. Experiments on a publicly available bone marrow cytomorphology dataset (TCIA) and an in-house oral cancer cytology dataset show that DSVDD achieves state-of-the-art performance in instance-level abnormality ranking, particularly in ultra-low witness-rate regimes ($\leq 1\%$) and, in some cases, even outperforming fully supervised learning, which is typically not a practical option in whole-slide cytology due to the infeasibility of exhaustive instance-level annotations. DROC is also competitive under extreme rarity, benefiting from distribution-augmented contrastive learning. These findings highlight one-class representation learning as a robust and interpretable superior choice to MIL for malignant cell detection under extreme rarity.

[79] Direct Segmentation without Logits Optimization for Training-Free Open-Vocabulary Semantic Segmentation

Jiahao Li,Yang Lu,Yachao Zhang,Fangyong Wang,Yuan Xie,Yanyun Qu

Main category: cs.CV

TL;DR: 本文提出了一种无需迭代训练和模型特定注意力调制的开词汇语义分割新方法，通过直接解析分布差异来生成语义图，实现了SOTA性能。

Details

Motivation: 现有开词汇语义分割方法依赖耗时的迭代训练或模型特定的注意力调制来优化视觉-语言特征的logits，限制了效率与泛化性。 Method: 提出直接求解分布差异的解析解作为语义图，基于‘同类别的patch间分布差异一致、不同类别间不一致’的核心假设，绕过logits优化过程。 Result: 在八个基准数据集上达到SOTA性能，同时消除迭代训练和模型特定注意力调制依赖。 Conclusion: 分布差异本身蕴含语义信息，其解析解可直接用于分割，为OVSS提供更高效、通用的新范式。 Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.

Jialin Li,Bin Fu,Ruiping Wang,Xilin Chen

Main category: cs.CV

TL;DR: GEAR is an EM-style alternating optimization framework using Gaussian Splatting to jointly reconstruct geometry and motion of articulated objects, improving stability and generalization via latent part segmentation and weakly supervised multi-view priors.

Details

Motivation: High-fidelity reconstruction of articulated objects is challenging due to complex structures and coupled geometry-motion relationships; existing methods suffer from unstable joint optimization and poor generalization on multi-joint or out-of-distribution objects. Method: GEAR proposes an EM-style alternating optimization framework within a Gaussian Splatting representation, treating part segmentation as a latent variable and joint motion parameters as explicit variables; it uses a vanilla 2D segmentation model for multi-view part priors and a weakly supervised constraint to regularize segmentation. Result: GEAR achieves state-of-the-art performance in geometric reconstruction and motion parameter estimation on multiple benchmarks and the new GEAR-Multi dataset, especially for complex multi-part articulated objects. Conclusion: GEAR effectively addresses instability and generalization limitations in articulated object reconstruction by decoupling and alternately refining geometry and motion with latent segmentation and multi-view priors. Abstract: High-fidelity interactive digital assets are essential for embodied intelligence and robotic interaction, yet articulated objects remain challenging to reconstruct due to their complex structures and coupled geometry-motion relationships. Existing methods suffer from instability in geometry-motion joint optimization, while their generalization remains limited on complex multi-joint or out-of-distribution objects. To address these challenges, we propose GEAR, an EM-style alternating optimization framework that jointly models geometry and motion as interdependent components within a Gaussian Splatting representation. GEAR treats part segmentation as a latent variable and joint motion parameters as explicit variables, alternately refining them for improved convergence and geometric-motion consistency. To enhance part segmentation quality without sacrificing generalization, we leverage a vanilla 2D segmentation model to provide multi-view part priors, and employ a weakly supervised constraint to regularize the latent variable. Experiments on multiple benchmarks and our newly constructed dataset GEAR-Multi demonstrate that GEAR achieves state-of-the-art results in geometric reconstruction and motion parameters estimation, particularly on complex articulated objects with multiple movable parts.

[81] Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

Shogo Hamano,Shunya Wakasugi,Tatsuhito Sato,Sayaka Nakamura

Main category: cs.CV

TL;DR: 本文提出CG-CLIP框架，结合文本描述与可学习token，通过Caption-guided Memory Refinement（CMR）和Token-based Feature Extraction（TFE）提升视频行人重识别在高难度场景（如体育、舞蹈）下的性能，并在多个数据集上取得SOTA结果。

Details

Motivation: 现有视频行人重识别方法在多人穿着相似、动作动态性强的高难度场景（如体育、舞蹈）中表现不佳，亟需利用更丰富的语义信息和高效时空特征建模能力。 Method: 提出CG-CLIP框架，包含两个核心模块：1）Caption-guided Memory Refinement（CMR），利用多模态大语言模型生成的文本描述精炼身份特征；2）Token-based Feature Extraction（TFE），采用固定长度可学习token与交叉注意力机制聚合时空特征。 Result: 在MARS、iLIDS-VID、SportsVReID和DanceVReID四个数据集上均超越当前最先进方法，尤其在新构建的高难度SportsVReID和DanceVReID上提升显著。 Conclusion: CG-CLIP通过融合显式文本引导与可学习token建模，有效提升了视频行人重识别在复杂动态场景下的鲁棒性与判别力，为细粒度跨摄像头人物匹配提供了新思路。 Abstract: In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.

Fangda Wei,Miao Liu,Yingxue Wang,Jing Wang,Shenghui Zhao,Nan Li

Main category: cs.CV

TL;DR: 本文提出了一种多尺度跨模态Transformer编码器（MSCT）用于音视频深度伪造检测，通过多尺度自注意力和差分跨模态注意力提升特征提取与模态对齐能力，在FakeAVCeleb数据集上验证了有效性。

Details

Motivation: 传统多模态伪造检测方法存在特征提取不足和模态对齐偏差的问题。 Method: 提出多尺度跨模态Transformer编码器（MSCT），包含多尺度自注意力机制以整合相邻嵌入特征，以及差分跨模态注意力机制以融合多模态特征。 Result: 在FakeAVCeleb数据集上取得了具有竞争力的检测性能。 Conclusion: MSCT结构能更有效地捕捉音视频间的不一致性伪造痕迹，提升了深度伪造检测精度。 Abstract: Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.

[83] Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding

Xiangyue Liu,Zijian Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Ping Tan

Main category: cs.CV

TL;DR: 本文提出Symbiotic-MoE框架，在零参数开销下统一预训练多模态大模型的理解与生成能力，通过模态感知专家解耦和渐进式训练策略缓解梯度冲突与路由崩溃，实现跨模态协同增强。

Details

Motivation: 现有方法（如MoT）虽能缓解理解与生成任务间的梯度冲突，但牺牲了跨模态协同并导致容量碎片化；需一种兼顾任务隔离与语义共享的新架构。 Method: 提出Symbiotic-MoE：1）模态感知专家解耦——将专家划分为任务专用组，并设共享专家作为多模态语义桥梁；2）渐进式训练策略——含差异化学习率与早期梯度屏蔽机制。 Result: 在MMLU和OCRBench等理解基准上显著提升性能，同时实现快速生成收敛，验证了跨模态协同增强效果。 Conclusion: Symbiotic-MoE在不增加参数的前提下，有效解决多模态大模型中理解与生成任务的干扰问题，揭示共享专家可将生成任务的细粒度视觉语义反哺文本理解，为统一多模态预训练提供新范式。 Abstract: Empowering Large Multimodal Models (LMMs) with image generation often leads to catastrophic forgetting in understanding tasks due to severe gradient conflicts. While existing paradigms like Mixture-of-Transformers (MoT) mitigate this conflict through structural isolation, they fundamentally sever cross-modal synergy and suffer from capacity fragmentation. In this work, we present Symbiotic-MoE, a unified pre-training framework that resolves task interference within a native multimodal Mixture-of-Experts (MoE) Transformers architecture with zero-parameter overhead. We first identify that standard MoE tuning leads to routing collapse, where generative gradients dominate expert utilization. To address this, we introduce Modality-Aware Expert Disentanglement, which partitions experts into task-specific groups while utilizing shared experts as a multimodal semantic bridge. Crucially, this design allows shared experts to absorb fine-grained visual semantics from generative tasks to enrich textual representations. To optimize this, we propose a Progressive Training Strategy featuring differential learning rates and early-stage gradient shielding. This mechanism not only shields pre-trained knowledge from early volatility but eventually transforms generative signals into constructive feedback for understanding. Extensive experiments demonstrate that Symbiotic-MoE achieves rapid generative convergence while unlocking cross-modal synergy, boosting inherent understanding with remarkable gains on MMLU and OCRBench.

[84] DailyArt: Discovering Articulation from Single Static Images via Latent Dynamics

Hang Zhang,Qijian Tian,Jingyu Gong,Daoguo Dong,Xuhong Wang,Yuan Xie,Xin Tan

Main category: cs.CV

TL;DR: DailyArt提出一种从单张闭合状态图像中推断铰接物体运动学结构的新方法，通过先合成最大展开状态以暴露运动线索，再基于观测与合成状态差异估计全部关节参数，无需多视角、模板或显式部件标注。

Details

Motivation: 现有方法难以仅从单张闭合状态图像中准确推断铰接物体的运动学结构，因为关键运动线索常被遮挡；且多数方法依赖多状态观察、显式部件先验或辅助输入，泛化性和实用性受限。 Method: DailyArt将单图铰接关节估计建模为合成引导的推理问题：首先在相同视角下合成物体的最大展开状态以暴露关节信息，再通过观测图像与合成图像之间的差异联合预测所有关节参数；采用集合预测框架，支持端到端、无模板、无标注的关节估计，并可进一步以估计关节为条件进行部件级新状态合成。 Result: 在多个基准上，DailyArt在铰接关节估计任务中取得优异性能，并成功支持以关节为条件的部件级新状态合成，验证了其推理与生成能力。 Conclusion: DailyArt证明了通过合成中介推理可有效克服单图铰接结构估计中的遮挡挑战，为具身AI和世界模型提供了更鲁棒、通用的铰接理解范式。 Abstract: Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed-state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi-state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis-mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set-prediction formulation, DailyArt recovers all joints simultaneously without requiring object-specific templates, multi-view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part-level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part-level novel state synthesis conditioned on joints. Project page is available at https://rangooo123.github.io/DaliyArt.github.io/.

[85] WUTDet: A 100K-Scale Ship Detection Dataset and Benchmarks with Dense Small Objects

Junxiong Liang,Mengwei Bao,Tianxiang Wang,Xinggang Wang,An-An Liu,Ryan Wen Liu

Main category: cs.CV

TL;DR: 本文构建了大规模船舶检测数据集WUTDet，包含10万+图像和38万+标注实例，覆盖多样复杂海事场景，并基于该数据集系统评估了20种主流检测模型（CNN、Transformer、Mamba），揭示其在精度、小目标检测与推理效率等方面的特性，并提出Ship-GEN跨数据集测试集验证泛化能力。

Details

Motivation: 现有公开船舶检测数据集在规模、小目标比例和场景多样性方面受限，难以支撑复杂海事环境下检测算法的系统评估与泛化研究。 Method: 构建大规模多场景船舶检测数据集WUTDet；在WUTDet上系统评测20种CNN/Transformer/Mamba基线模型；构建统一跨数据集测试集Ship-GEN评估泛化能力。 Result: Transformer在检测精度（AP）和小目标检测（APs）上最优；CNN推理效率最高，适合实时应用；Mamba在精度与效率间取得较好平衡；WUTDet训练模型在Ship-GEN上展现更强泛化能力。 Conclusion: WUTDet为复杂海事场景下的船舶检测算法研究、评估与泛化分析提供了高质量数据支撑，推动智能水运感知技术发展。 Abstract: Ship detection for navigation is a fundamental perception task in intelligent waterway transportation systems. However, existing public ship detection datasets remain limited in terms of scale, the proportion of small-object instances, and scene diversity, which hinders the systematic evaluation and generalization study of detection algorithms in complex maritime environments. To this end, we construct WUTDet, a large-scale ship detection dataset. WUTDet contains 100,576 images and 381,378 annotated ship instances, covering diverse operational scenarios such as ports, anchorages, navigation, and berthing, as well as various imaging conditions including fog, glare, low-lightness, and rain, thereby exhibiting substantial diversity and challenge. Based on WUTDet, we systematically evaluate 20 baseline models from three mainstream detection architectures, namely CNN, Transformer, and Mamba. Experimental results show that the Transformer architecture achieves superior overall detection accuracy (AP) and small-object detection performance (APs), demonstrating stronger adaptability to complex maritime scenes; the CNN architecture maintains an advantage in inference efficiency, making it more suitable for real-time applications; and the Mamba architecture achieves a favorable balance between detection accuracy and computational efficiency. Furthermore, we construct a unified cross-dataset test set, Ship-GEN, to evaluate model generalization. Results on Ship-GEN show that models trained on WUTDet exhibit stronger generalization under different data distributions. These findings demonstrate that WUTDet provides effective data support for the research, evaluation, and generalization analysis of ship detection algorithms in complex maritime scenarios. The dataset is publicly available at: https://github.com/MAPGroup/WUTDet.

[86] Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

Jingtong Dou,Chuancheng Shi,Jian Wang,Fei Shen,Zhiyong Wang,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出首个模态无关的伪造检测（MAF）框架，通过解耦模态特异性风格，提取跨模态共性伪造知识，并定义弱/强MAF两个维度衡量泛化能力；构建DeepModal-Bench基准评估模型在未知‘暗模态’上的鲁棒性，实证验证通用伪造痕迹存在并显著提升跨模态检测性能。

Details

Motivation: 现有深度伪造检测方法过度依赖模态特定表层伪影，忽视底层共享的伪造隐知识，导致在未见过的‘暗模态’上泛化能力严重不足。 Method: 提出模态无关伪造（MAF）检测框架，显式解耦模态特异性风格以提取跨模态共性伪造知识；定义Weak MAF（语义相关模态迁移）与Strong MAF（完全隔离‘暗模态’鲁棒性）两个泛化维度；构建DeepModal-Bench多模态伪造检测基准。 Result: 实证验证了通用伪造痕迹的存在，在未知模态上实现检测性能显著突破，为通用多模态防御提供新路径。 Conclusion: 将多模态取证从传统‘特征融合’范式转向‘模态泛化’范式是突破泛化瓶颈的关键，MAF框架及其评估体系为未来鲁棒深度伪造检测奠定基础。 Abstract: As generative artificial intelligence evolves, deepfake attacks have escalated from single-modality manipulations to complex, multimodal threats. Existing forensic techniques face a severe generalization bottleneck: by relying excessively on superficial, modality-specific artifacts, they neglect the shared latent forgery knowledge hidden beneath variable physical appearances. Consequently, these models suffer catastrophic performance degradation when confronted with unseen "dark modalities." To break this limitation, this paper introduces a paradigm shift that redefines multimodal forensics from conventional "feature fusion" to "modality generalization." We propose the first modality-agnostic forgery (MAF) detection framework. By explicitly decoupling modality-specific styles, MAF precisely extracts the essential, cross-modal latent forgery knowledge. Furthermore, we define two progressive dimensions to quantify model generalization: transferability toward semantically correlated modalities (Weak MAF), and robustness against completely isolated signals of "dark modality" (Strong MAF). To rigorously assess these generalization limits, we introduce the DeepModal-Bench benchmark, which integrates diverse multimodal forgery detection algorithms and adapts state-of-the-art generalized learning methods. This study not only empirically proves the existence of universal forgery traces but also achieves significant performance breakthroughs on unknown modalities via the MAF framework, offering a pioneering technical pathway for universal multimodal defense.

[87] RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

Liang Yao,Shengxiang Xu,Fan Liu,Chuanyi Zhang,Bishun Yao,Rui Min,Yongjun Li,Chaoqian Ouyang,Shimin Di,Min-Ling Zhang

Main category: cs.CV

TL;DR: 本文提出RemoteAgent框架，通过构建VagueEO数据集并结合强化微调，使多模态大语言模型（MLLM）能理解模糊自然语言查询，并智能分配任务：内部处理图像级和稀疏区域级任务，仅对密集预测任务调用外部专用工具，从而在遥感任务中实现高效且高精度的意图识别与执行。

Details

Motivation: 遥感系统用户常以模糊自然语言表达需求，而现有MLLM难以直接输出高精度空间预测结果；现有代理框架又存在工具调用低效、未充分利用MLLM原生能力的问题。 Method: 提出RemoteAgent代理框架，构建人类中心的VagueEO指令数据集，利用其对MLLM进行强化微调，使其成为能直接处理图像级和稀疏区域级任务的认知核心；通过Model Context Protocol协议，仅在需要密集预测时才调度外部专业工具。 Result: RemoteAgent在多种遥感任务上展现出鲁棒的意图识别能力与极具竞争力的性能，显著提升模糊查询到多粒度视觉分析的映射效率与精度。 Conclusion: 尊重MLLM固有能力边界、结合人类真实模糊查询建模与有选择性工具调用的代理设计，是构建实用化遥感AI系统的关键路径。 Abstract: Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.

[88] ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

Zihao Liu,Xiaoyu Wu,Wenna Li,Jianqin Wu,Linlin Yang

Main category: cs.CV

TL;DR: 本文提出ESOM，一种无需训练的高效流式开放世界视频异常检测模型，解决了现有MLLM方法在效率、流式处理和动态异常定义支持上的不足，并引入了OpenDef-Bench新基准进行评估。

Details

Motivation: 现有基于多模态大语言模型（MLLM）的开放世界视频异常检测方法存在部署效率低、不适应流式处理、难以支持动态异常定义等局限性。 Method: 提出ESOM模型，包含定义归一化模块（减少幻觉）、帧间匹配-帧内令牌合并模块（压缩冗余视觉令牌）、混合流式记忆模块（实现高效因果推理）和概率评分模块（将区间级文本输出转为帧级异常分数）；同时构建OpenDef-Bench基准。 Result: ESOM在单GPU上实现实时效率，在异常时间定位、分类和描述生成任务上达到SOTA性能。 Conclusion: ESOM是一种高效、免训练、支持流式处理与动态异常定义的开放世界视频异常检测方案，显著提升了实际部署能力与泛化性能。 Abstract: Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.

[89] Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models

Gexin Huang,Anqi Li,Yusheng Tan,Beidi Zhao,Gang Wang,Gaozu Hua,Xiaoxiao Li

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、无需重训练的模型融合策略LogitProd，通过在logit层对多个独立训练的病理基础模型进行样本自适应加权融合，在22个基准任务上显著提升性能，平均提升约3%，且训练成本仅为特征融合方法的1/12。

Details

Motivation: 病理基础模型数量激增导致模型选择困难：单一模型无法在所有下游任务上最优，而逐一适配验证又成本过高。 Method: 提出LogitProd融合策略，将多个预训练病理基础模型视为固定专家，仅在其滑片级logit输出上学习样本自适应的加权乘积融合权重；不需重训练编码器，也不需跨异构骨干网络的特征对齐，并给出理论证明其性能不低于最优单个专家。 Result: 在22个涵盖WSI分类、切片分类、基因突变预测和离散时间生存建模的任务上，LogitProd在20个任务中排名第一，平均性能比最强单模型提升约3%，训练成本约为特征融合方法的1/12。 Conclusion: LogitProd是一种高效、即插即用的多模型融合方案，能以极低成本实现多专家协同增益，显著缓解病理基础模型的选择与部署瓶颈。 Abstract: Pathology foundation models (FMs) have become central to computational histopathology, offering strong transfer performance across a wide range of diagnostic and prognostic tasks. The rapid proliferation of pathology foundation models creates a model-selection bottleneck: no single model is uniformly best, yet exhaustively adapting and validating many candidates for each downstream endpoint is prohibitively expensive. We address this challenge with a lightweight and novel model fusion strategy, LogitProd, which treats independently trained FM-based predictors as fixed experts and learns sample-adaptive fusion weights over their slide-level outputs. The fusion operates purely on logits, requiring no encoder retraining and no feature-space alignment across heterogeneous backbones. We further provide a theoretical analysis showing that the optimal weighted product fusion is guaranteed to perform at least as well as the best individual expert under the training objective. We systematically evaluate LogitProd on \textbf{22} benchmarks spanning WSI-level classification, tile-level classification, gene mutation prediction, and discrete-time survival modeling. LogitProd ranks first on 20/22 tasks and improves the average performance across all tasks by ~3% over the strongest single expert. LogitProd enables practitioners to upgrade heterogeneous FM-based pipelines in a plug-and-play manner, achieving multi-expert gains with $\sim$12$\times$ lower training cost than feature-fusion alternatives.

Chanhyuk Choi,Taesoo Kim,Donggyu Lee,Siyeol Jung,Taehwan Kim

Main category: cs.CV

TL;DR: 本文提出了一种跨模态情感迁移方法C-MET，通过建模语音与视觉特征空间间的情感语义向量，实现更准确、更具表现力的说话人脸视频生成，尤其在未见扩展情感（如讽刺）上效果显著。

Details

Motivation: 现有方法受限于离散标签、语音-语言纠缠或对高质量参考图像的依赖，难以灵活、准确地编辑和生成丰富情感（尤其是扩展情感）的 talking face 视频。 Method: 提出Cross-Modal Emotion Transfer (C-MET)，利用大规模预训练音频编码器和解耦的人脸表情编码器，在跨模态空间中学习表征情感差异的语义向量，实现基于语音驱动的情感面部表达生成。 Result: 在MEAD和CREMA-D数据集上，情感准确率较SOTA提升14%，并能成功生成包括讽刺等未见扩展情感的逼真说话人脸视频。 Conclusion: C-MET通过跨模态情感语义建模有效解耦语音中的情感与语言内容，显著提升了 talking face 视频中情感编辑的准确性、灵活性与泛化能力。 Abstract: Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

[91] Image-Guided Geometric Stylization of 3D Meshes

Changwoon Choi,Hyunsoo Lee,Clément Jambon,Yael Vinker,Young Min Kim

Main category: cs.CV

TL;DR: 本文提出了一种几何风格化框架GeoStyle，利用预训练扩散模型提取图像风格，并通过粗到细的变形流程将风格迁移至3D网格，支持大幅几何变形同时保持拓扑和语义结构。

Details

Motivation: 现有生成模型难以支持超出数据分布的显著几何变形，且对隐式控制信号（如上下文描述）支持有限；风格定义模糊，需有效提取与迁移图像中的几何风格特征。 Method: 提出基于预训练扩散模型的抽象风格表征提取方法；构建粗到细的3D网格变形流水线；设计近似VAE编码器，实现从网格渲染图中高效、稳定地获取梯度。 Result: 实验表明该方法能生成反映图像中独特几何特征（如表现力姿态、轮廓）的风格化3D网格，支持艺术化3D创作。 Conclusion: GeoStyle实现了图像驱动的可控3D几何风格化，在保持原始网格拓扑与部件语义的前提下，显著拓展了3D内容的创造性表达能力。 Abstract: Recent generative models can create visually plausible 3D representations of objects. However, the generation process often allows for implicit control signals, such as contextual descriptions, and rarely supports bold geometric distortions beyond existing data distributions. We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. While style is inherently ambiguous, we utilize pre-trained diffusion models to extract an abstract representation of the provided image. Our coarse-to-fine stylization pipeline can drastically deform the input 3D model to express a diverse range of geometric variations while retaining the valid topology of the original mesh and part-level semantics. We also propose an approximate VAE encoder that provides efficient and reliable gradients from mesh renderings. Extensive experiments demonstrate that our method can create stylized 3D meshes that reflect unique geometric features of the pictured assets, such as expressive poses and silhouettes, thereby supporting the creation of distinctive artistic 3D creations. Project page: https://changwoonchoi.github.io/GeoStyle

[92] Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

Shaotian Li,Shangze Li,Chuancheng Shi,Wenhua Wu,Yanqiu Wu,Xiaohan Yu,Fei Shen,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出LAKE框架，无需训练即可从大规模视觉语言模型中挖掘潜在的异常知识，通过识别对异常敏感的稀疏神经元，实现零样本异常检测，并在工业基准上达到SOTA性能。

Details

Motivation: 现有方法将视觉语言模型视为黑箱特征提取器，依赖外部适配器或记忆库获取异常知识；本文质疑该假设，认为异常知识已内嵌于预训练模型中但处于潜伏状态。 Method: 提出无训练的LAKE框架，利用少量正常样本识别并激发对异常敏感的稀疏神经元，构建融合视觉结构偏差与跨模态语义激活的紧凑正常性表征。 Result: 在多个工业异常检测基准上取得SOTA性能，并提供神经元级别的内在可解释性。 Conclusion: 异常检测应被重新定义为对预训练模型中潜在知识的定向激活，而非下游任务知识的额外学习。 Abstract: Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs as black-box feature extractors, assuming that anomaly-specific knowledge must be acquired through external adapters or memory banks. In this paper, we challenge this assumption by arguing that anomaly knowledge is intrinsically embedded within pre-trained models but remains latent and under-activated. We hypothesize that this knowledge is concentrated within a sparse subset of anomaly-sensitive neurons. To validate this, we propose latent anomaly knowledge excavation (LAKE), a training-free framework that identifies and elicits these critical neuronal signals using only a minimal set of normal samples. By isolating these sensitive neurons, LAKE constructs a highly compact normality representation that integrates visual structural deviations with cross-modal semantic activations. Extensive experiments on industrial AD benchmarks demonstrate that LAKE achieves state-of-the-art performance while providing intrinsic, neuron-level interpretability. Ultimately, our work advocates for a paradigm shift: redefining anomaly detection as the targeted activation of latent pre-trained knowledge rather than the acquisition of a downstream task.

[93] HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

Qihui Zhu,Tao Zhang,Yuchen Wang,Zijian Wen,Mengjie Zhang,Shuangwu Chen,Xiaobin Tan,Jian Yang,Yang Liu,Zhenhua Dong,Xianzhi Yu,Yinfei Pan

Main category: cs.CV

TL;DR: HAWK是一种无需训练的头重要性感知视觉token剪枝方法，通过评估不同注意力头对视觉任务的重要性，有效保留关键视觉token、去除冗余token，在大幅减少视觉token数量的同时保持高精度和降低推理开销。

Details

Motivation: 现有视觉token剪枝方法假设所有注意力头贡献相同，但实际中不同头捕获不同视觉语义、作用不同，需更精细的剪枝策略。 Method: 提出HAWK方法，利用头重要性权重和文本引导注意力来评估视觉token重要性，实现训练-free、即插即用的视觉token剪枝。 Result: 在多个视觉语言基准上达到SOTA精度；应用于Qwen2.5-VL时，剪枝80.2%视觉token后仍保留96.0%原始精度，端到端延迟降至74.4%，GPU显存占用显著下降。 Conclusion: HAWK通过建模注意力头异质性，实现了高效、通用、无需训练的视觉token剪枝，显著提升MLLM在资源受限场景下的实用性。 Abstract: In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.

[94] AgriChain Visually Grounded Expert Verified Reasoning for Interpretable Agricultural Vision Language Models

Hazza Mahmood,Yongqiang Yu,Rao Anwer

Main category: cs.CV

TL;DR: 本文提出AgriChain数据集和AgriChain-VL3B模型，通过专家验证的链式推理（CoT）监督，提升视觉语言模型在农作物病害诊断中的准确性和可解释性。

Details

Motivation: 现有视觉语言模型在真实农业场景中难以兼顾病害诊断的准确性与可解释性，亟需结合专家知识构建高质量、带推理依据的数据集与专用模型。 Method: 构建包含约1.1万张专家标注叶片图像的AgriChain数据集，每张图像配有疾病标签、置信度分级及GPT-4o生成并经农业工程师验证的链式推理理由；在此基础上微调Qwen2.5-VL-3B模型，得到AgriChain-VL3B，实现疾病预测与可视化推理联合生成。 Result: 在1000张图像测试集上，AgriChain-VL3B达到73.1% top-1准确率（macro F1=0.466，weighted F1=0.655），优于Gemini 1.5 Flash、Gemini 2.5 Pro和GPT-4o Mini等强基线，且生成解释高度契合专家逻辑，能稳定引用关键视觉线索。 Conclusion: 专家验证的推理监督显著提升模型性能与可解释性，弥合了通用多模态模型与人类农业专家之间的鸿沟，推动可信、可全球部署的农业AI发展。 Abstract: Accurate and interpretable plant disease diagnosis remains a major challenge for vision-language models (VLMs) in real-world agriculture. We introduce AgriChain, a dataset of approximately 11,000 expert-curated leaf images spanning diverse crops and pathologies, each paired with (i) a disease label, (ii) a calibrated confidence score (High/Medium/Low), and (iii) an expert-verified chain-of-thought (CoT) rationale. Draft explanations were first generated by GPT-4o and then verified by a professional agricultural engineer using standardized descriptors (e.g., lesion color, margin, and distribution). We fine-tune Qwen2.5-VL-3B on AgriChain, resulting in a specialized model termed AgriChain-VL3B, to jointly predict diseases and generate visually grounded reasoning. On a 1,000-image test set, our CoT-supervised model achieves 73.1% top-1 accuracy (macro F1 = 0.466; weighted F1 = 0.655), outperforming strong baselines including Gemini 1.5 Flash, Gemini 2.5 Pro, and GPT-4o Mini. The generated explanations align closely with expert reasoning, consistently referencing key visual cues. These findings demonstrate that expert-verified reasoning supervision significantly enhances both accuracy and interpretability, bridging the gap between generic multimodal models and human expertise, and advancing trustworthy, globally deployable AI for sustainable agriculture. The dataset and code are publicly available at: https://github.com/hazzanabeel12-netizen/agrichain

[95] LPM 1.0: Video-based Character Performance Model

Ailing Zeng,Casper Yang,Chauncey Ge,Eddie Zhang,Garvey Xu,Gavin Lin,Gilbert Gu,Jeremy Pi,Leo Li,Mingyi Shi,Sheng Bi,Steven Tang,Thorn Hang,Tobey Guo,Vincent Li,Xin Tong,Yikang Li,Yuchen Sun,Yue,Zhao,Yuhan Lu,Yuwei Li,Zane Zhang,Zeshi Yang,Zi Ye

Main category: cs.CV

TL;DR: 本文提出LPM 1.0（大型性能模型），旨在解决视频驱动角色表演生成中的‘性能三难困境’（高表现力、实时推理与长时身份稳定性难以兼顾），通过构建人本多模态数据集、训练17B参数扩散Transformer（Base LPM）并蒸馏为流式因果模型（Online LPM），实现单人全双工音视频对话场景下的实时、无限长度、身份稳定的角色表演生成，并配套提出首个交互式角色性能基准LPM-Bench。

Details

Motivation: 现有视频生成模型难以同时满足高表达性、实时推理和长时身份稳定性，尤其在对话这一最复杂的角色表演场景中；需一种能支撑真实交互应用（如对话代理、直播角色、游戏NPC）的端到端视觉性能引擎。 Method: 构建严格筛选、听-说配对、性能理解增强、身份感知多参考提取的多模态人类中心数据集；训练17B参数的多模态条件Diffusion Transformer（Base LPM）；蒸馏为支持低延迟、无限长度生成的因果流式模型（Online LPM）；支持以角色图像+身份参考+用户语音/文本提示为输入，实时生成听觉与言语对应的视频。 Result: LPM 1.0在LPM-Bench所有维度上达到SOTA，支持实时、身份稳定、无限长度的音视频双向生成；Online LPM实现低延迟流式交互；配套发布首个交互式角色性能基准LPM-Bench。 Conclusion: LPM 1.0首次系统性地破解了角色性能生成的三难困境，为构建具身化、可交互的数字角色提供了可扩展、实用化的视觉生成基础模型框架。 Abstract: Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

[96] FlowGuard: Towards Lightweight In-Generation Safety Detection for Diffusion Models via Linear Latent Decoding

Jinghan Yang,Yihe Fan,Xudong Pan,Min Yang

Main category: cs.CV

TL;DR: 本文提出FlowGuard，一种在扩散模型生成过程中实时检测NSFW内容的跨模型框架，通过线性近似潜空间解码和课程学习，在中间去噪步骤中实现高效、准确的NSFW识别，显著降低计算与内存开销。

Details

Motivation: 现有NSFW检测方法（前置基于提示词、后置基于生成图像）均存在明显缺陷：前者无法保证提示安全即图像安全，后者无法适用于含噪中间图像；且潜扩散模型早期噪声严重干扰视觉信号，亟需一种能在生成过程中稳定检测的方案。 Method: 提出FlowGuard框架，核心包括：1）针对潜扩散模型设计新型线性近似潜空间解码器，以从含噪中间隐变量中恢复可判别视觉特征；2）采用课程学习策略分阶段训练检测器，逐步提升对噪声鲁棒性；3）将检测器嵌入扩散过程的多个中间去噪步，实现in-generation实时判断。 Result: 在涵盖9种扩散主干模型的跨模型基准上验证有效：F1分数较现有方法提升超30%；峰值GPU显存降低97%以上；潜空间投影时间从8.1秒降至0.2秒；同时支持分布内与分布外场景。 Conclusion: FlowGuard首次实现了高鲁棒、高效率的扩散模型in-generation NSFW检测，兼顾安全性与生成效率，为安全可控的AIGC部署提供了新范式。 Abstract: Diffusion-based image generation models have advanced rapidly but pose a safety risk due to their potential to generate Not-Safe-For-Work (NSFW) content. Existing NSFW detection methods mainly operate either before or after image generation. Pre-generation methods rely on text prompts and struggle with the gap between prompt safety and image safety. Post-generation methods apply classifiers to final outputs, but they are poorly suited to intermediate noisy images. To address this, we introduce FlowGuard, a cross-model in-generation detection framework that inspects intermediate denoising steps. This is particularly challenging in latent diffusion, where early-stage noise obscures visual signals. FlowGuard employs a novel linear approximation for latent decoding and leverages a curriculum learning approach to stabilize training. By detecting unsafe content early, FlowGuard reduces unnecessary diffusion steps to cut computational costs. Our cross-model benchmark spanning nine diffusion-based backbones shows the effectiveness of FlowGuard for in-generation NSFW detection in both in-distribution and out-of-distribution settings, outperforming existing methods by over 30% in F1 score while delivering transformative efficiency gains, including slashing peak GPU memory demand by over 97% and projection time from 8.1 seconds to 0.2 seconds compared to standard VAE decoding.

[97] ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

Boyuan Wang,Xiaofeng Wang,Yongkang Li,Zheng Zhu,Yifan Chang,Angen Ye,Guosheng Zhao,Chaojun Ni,Guan Huang,Yijie Ren,Yueqi Duan,Xingang Wang

Main category: cs.CV

TL;DR: ReconPhys是一种无需人工标注、基于单目视频的前馈式框架，能同时估计物理属性并重建3D高斯点阵，兼具物理合理性与高效性。

Details

Motivation: 现有非刚性物体重建方法依赖可微渲染进行逐场景优化，需大量调参或人工标注，实用性与泛化性受限。 Method: 提出ReconPhys，采用双分支自监督训练架构，联合学习物理属性估计与3D高斯点阵重建，输入为单目视频。 Result: 在合成数据集上，未来帧预测PSNR达21.64（SOTA基线仅13.27），Chamfer距离从0.349降至0.004；推理时间<1秒（此前需数小时）。 Conclusion: ReconPhys首次实现单目视频驱动的物理感知、实时、仿真就绪的非刚性重建，显著提升效率与可用性。 Abstract: Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for per-scene optimization, recovering geometry and dynamics but requiring expensive tuning or manual annotation, which limits practicality and generalizability. To address this, we propose ReconPhys, the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. Our method employs a dual-branch architecture trained via a self-supervised strategy, eliminating the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes. Experiments on a large-scale synthetic dataset demonstrate superior performance: our method achieves 21.64 PSNR in future prediction compared to 13.27 by state-of-the-art optimization baselines, while reducing Chamfer Distance from 0.349 to 0.004. Crucially, ReconPhys enables fast inference (<1 second) versus hours required by existing methods, facilitating rapid generation of simulation-ready assets for robotics and graphics.

[98] Reinforcement-Guided Synthetic Data Generation for Privacy-Sensitive Identity Recognition

Xuemei Jia,Jiawei Du,Hui Wei,Jun Chen,Joey Tianyi Zhou,Zheng Wang

Main category: cs.CV

TL;DR: 本文提出了一种强化引导的合成数据生成框架，用于在数据受限的隐私敏感场景中提升身份识别任务的生成模型性能。

Details

Motivation: 在隐私敏感场景中，由于法规和版权限制，数据获取受限，导致生成模型开发困难，形成‘数据少→模型差→更缺数据’的恶性循环。 Method: 提出强化引导的合成数据生成框架：1）冷启动适配预训练生成器以对齐目标域；2）设计多目标奖励函数（语义一致性、覆盖多样性、表达丰富性）优化生成质量；3）下游训练中引入动态样本选择机制提升数据效用。 Result: 在基准数据集上显著提升生成保真度与分类准确率，并在小样本场景下对新类别表现出强泛化能力。 Conclusion: 该框架有效打破数据稀缺与生成模型性能低下的循环，为隐私敏感任务提供了高保真、任务有效的合成数据生成新范式。 Abstract: High-fidelity generative models are increasingly needed in privacy-sensitive scenarios, where access to data is severely restricted due to regulatory and copyright constraints. This scarcity hampers model development--ironically, in settings where generative models are most needed to compensate for the lack of data. This creates a self-reinforcing challenge: limited data leads to poor generative models, which in turn fail to mitigate data scarcity. To break this cycle, we propose a reinforcement-guided synthetic data generation framework that adapts general-domain generative priors to privacy-sensitive identity recognition tasks. We first perform a cold-start adaptation to align a pretrained generator with the target domain, establishing semantic relevance and initial fidelity. Building on this foundation, we introduce a multi-objective reward that jointly optimizes semantic consistency, coverage diversity, and expression richness, guiding the generator to produce both realistic and task-effective samples. During downstream training, a dynamic sample selection mechanism further prioritizes high-utility synthetic samples, enabling adaptive data scaling and improved domain alignment. Extensive experiments on benchmark datasets demonstrate that our framework significantly improves both generation fidelity and classification accuracy, while also exhibiting strong generalization to novel categories in small-data regimes.

[99] Sampling-Aware 3D Spatial Analysis in Multiplexed Imaging

Ido Harlev,Tamar Oukhanov,Raz Ben-Uri,Leeat Keren,Shai Bagon

Main category: cs.CV

TL;DR: 本文研究了采样几何形状对空间统计稳定性的影响，并提出了一种几何感知的重建模块，支持从稀疏序列切片中进行稳定、一致的三维空间分析。

Details

Motivation: 高通量多重显微镜虽能实现单细胞分辨率的空间组织表征，但多数分析仍局限于二维切片，而获取密集三维空间蛋白质组数据成本高、技术难，导致研究者常需在2D与稀疏3D之间权衡。 Method: 通过受控仿真分析不同采样几何对空间统计（如细胞聚类、细胞互作）稳定性的影响；提出一种结合表型约束、邻近约束及细胞类型特异性形状先验的三维单细胞重建方法；系统分析切片间距、覆盖度与冗余度间的权衡关系。 Result: 发现平面采样可稳定估计全局细胞丰度，但局部统计（如细胞互作、邻域关系）方差大，尤其对稀有或局域化细胞群；所提重建模块在公开IMC数据集上验证有效，并在内部CODEX数据集中实现了可靠的结构级三维分析。 Conclusion: 提供了诊断工具和实用指南，帮助研究者判断何时2D采样已足够，何时需采用稀疏3D重建，从而在有限成像预算下优化空间分析质量。 Abstract: Highly multiplexed microscopy enables rich spatial characterization of tissues at single-cell resolution, yet most analyses rely on two-dimensional sections despite inherently three-dimensional tissue organization. Acquiring dense volumetric data in spatial proteomics remains costly and technically challenging, leaving practitioners to choose between 2D sections or 3D serial sections under limited imaging budgets. In this work, we study how sampling geometry impacts the stability of commonly used spatial statistics, and we introduce a geometry-aware reconstruction module that enables sparse yet consistent 3D analysis from serial sections. Using controlled simulations, we show that planar sampling reliably recovers global cell-type abundance but exhibits high variance for local statistics such as cell clustering and cell-cell interactions, particularly for rare or spatially localized populations. We observe consistent behavior in real multiplexed datasets, where interaction metrics and neighborhood relationships fluctuate substantially across individual sections. To support sparse 3D analysis in practice, we present a reconstruction approach that links cell projections across adjacent sections using phenotype and proximity constraints and recovers single-cell 3D centroids using cell-type-specific shape priors. We further analyze the trade-off between section spacing, coverage, and redundancy, identifying acquisition regimes that maximize reconstruction utility under fixed imaging budgets. We validate the reconstruction module on a public imaging mass cytometry dataset with dense axial sampling and demonstrate its downstream utility on an in-house CODEX dataset by enabling structure-level 3D analyses that are unreliable in 2D. Together, our results provide diagnostic tools and practical guidance for deciding when 2D sampling suffices and when sparse 3D reconstruction is warranted.

[100] AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning

Jiaming Su,Tengchao Yang,Ruikang Zhang,Zhengan Yan,Haoyu Sun,Linfeng Zhang

Main category: cs.CV

TL;DR: 本文提出AnomalyAgent，一种具备自我反思、知识检索和迭代优化能力的工业异常合成智能体，通过多工具协同与两阶段训练框架，在MVTec-AD数据集上显著提升异常生成质量与下游检测性能。

Details

Motivation: 现有异常合成方法多为单步生成，缺乏复杂推理与迭代优化能力，难以生成高语义真实性的异常样本，制约了工业异常检测的数据稀缺问题缓解效果。 Method: 提出AnomalyAgent智能体，集成Prompt生成、图像生成、质量评估、知识检索和掩码生成五种工具，构建闭环优化流程；采用基于真实异常图像构建的结构化轨迹进行两阶段训练（监督微调+强化学习），并设计包含任务奖励、反思奖励和行为奖励的三重奖励机制。 Result: 在MVTec-AD数据集上，异常生成指标IS/IC-L达2.10/0.33；ResNet34分类准确率达57.0%；UNet在图像级/像素级AP达99.3%/74.2%，全面超越零样本SOTA方法。 Conclusion: AnomalyAgent通过引入智能体范式与闭环优化机制，有效提升了工业异常合成的真实性、多样性与实用性，为数据稀缺场景下的异常检测提供了新范式。 Abstract: Industrial anomaly generation is a crucial method for alleviating the data scarcity problem in anomaly detection tasks. Most existing anomaly synthesis methods rely on single-step generation mechanisms, lacking complex reasoning and iterative optimization capabilities, making it difficult to generate anomaly samples with high semantic realism. We propose AnomalyAgent, an anomaly synthesis agent with self-reflection, knowledge retrieval, and iterative refinement capabilities, aiming to generate realistic and diverse anomalies. Specifically, AnomalyAgent is equipped with five tools: Prompt Generation (PG), Image Generation (IG), Quality Evaluation (QE), Knowledge Retrieval (KR), and Mask Generation (MG), enabling closed-loop optimization. To improve decision-making and self-reflection, we construct structured trajectories from real anomaly images and design a two-stage training framework: supervised fine-tuning followed by reinforcement learning. This process is driven by a three-part reward mechanism: (1) task rewards to supervise the quality and location rationality of generated anomalies; (2) reflection rewards to train the model's ability to improve anomaly synthesis prompt; (3) behavioral rewards to ensure adherence to the trajectory. On the MVTec-AD dataset, AnomalyAgent achieves IS/IC-L of 2.10/0.33 for anomaly generation, 57.0% classification accuracy using ResNet34, and 99.3%/74.2% AP at the image/pixel level using a simple UNet, surpassing all zero-shot SOTA methods. The code and data will be made publicly available.

[101] PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation

Dingwen Xiao,Weiming Zhang,Shiqi Wen,Lin Wang

Main category: cs.CV

TL;DR: 本文提出PanoSAM2，一种基于SAM2的360视频目标分割（360VOS）新框架，通过畸变感知解码器、畸变引导掩码损失和长短时记忆模块，有效缓解球面投影畸变、左右语义不一致及内存中目标稀疏等问题，在360VOTS和PanoVOS上分别提升5.6和6.7点。

Details

Motivation: 现有360视频目标分割面临高质量标注数据稀缺、球面投影畸变、左右边界语义不一致以及SAM2内存中目标掩码信息稀疏等挑战，直接应用SAM2效果不佳。 Method: 提出PanoSAM2框架：1）Pano-Aware解码器，采用缝合一致感受野与迭代畸变优化，保障0/360度边界连续性；2）畸变引导掩码损失，按畸变程度加权像素损失；3）长短时记忆模块，用紧凑长时指针重实例化并对其短时记忆，增强时序一致性。 Result: 在360VOTS和PanoVOS数据集上，相比SAM2分别提升5.6和6.7点，验证了方法有效性。 Conclusion: PanoSAM2通过轻量级畸变与内存感知适配策略，在保留SAM2交互式提示优势的同时，显著提升了360视频目标分割的精度与时序一致性，为VR/AR与具身AI提供可靠基础。 Abstract: 360 video object segmentation (360VOS) aims to predict temporally-consistent masks in 360 videos, offering full-scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high-quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 -- with its design of memory module -- shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left-right sides, and sparse object mask information in SAM2's memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion- and memory-aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2's user-friendly prompting design. Concretely, to tackle the projection distortion and semantic inconsistency issues, we propose a Pano-Aware Decoder with seam-consistent receptive fields and iterative distortion refinement to maintain continuity across the 0/360 degree boundary. Meanwhile, a Distortion-Guided Mask Loss is introduced to weight pixels by distortion magnitude, stressing stretched regions and boundaries. To address the object sparsity issue, we propose a Long-Short Memory Module to maintain a compact long-term object pointer to re-instantiate and align short-term memories, thereby enhancing temporal coherence. Extensive experiments show that PanoSAM2 yields substantial gains over SAM2: +5.6 on 360VOTS and +6.7 on PanoVOS, showing the effectiveness of our method.

[102] ParkSense: Where Should a Delivery Driver Park? Leveraging Idle AV Compute and Vision-Language Models

Die Hu,Henan Li

Main category: cs.CV

TL;DR: ParkSense is a framework that uses idle compute time in autonomous vehicles to run a Vision-Language Model on cached imagery, enabling precise parking spot selection near merchant entrances to reduce food delivery time.

Details

Motivation: Finding parking consumes a disproportionate share of food delivery time, and no existing system addresses precise parking-spot selection relative to merchant entrances. Method: ParkSense repurposes idle compute during low-risk AV states (e.g., queuing at red lights, traffic congestion, parking-lot crawl) to run a quantized 7B Vision-Language Model on pre-cached satellite and street view imagery, identifying merchant entrances and legal parking zones. Result: The framework solves the Delivery-Aware Precision Parking (DAPP) problem; the quantized 7B VLM completes inference in 4–8 seconds on HW4-class hardware; estimated annual per-driver income gains are $3,000–$8,000 in the U.S. Conclusion: ParkSense demonstrates feasibility and economic impact of leveraging idle AV compute for delivery-optimized parking, opening five key research directions at the intersection of autonomous driving, computer vision, and last-mile logistics. Abstract: Finding parking consumes a disproportionate share of food delivery time, yet no system addresses precise parking-spot selection relative to merchant entrances. We propose ParkSense, a framework that repurposes idle compute during low-risk AV states -- queuing at red lights, traffic congestion, parking-lot crawl -- to run a Vision-Language Model (VLM) on pre-cached satellite and street view imagery, identifying entrances and legal parking zones. We formalize the Delivery-Aware Precision Parking (DAPP) problem, show that a quantized 7B VLM completes inference in 4-8 seconds on HW4-class hardware, and estimate annual per-driver income gains of 3,000-8,000 USD in the U.S. Five open research directions are identified at this unexplored intersection of autonomous driving, computer vision, and last-mile logistics.

[103] Mitigating Entangled Steering in Large Vision-Language Models for Hallucination Reduction

Yuanhong Zhang,Zhaoyang Wang,Xin Zhang,Weizhan Zhang,Joey Tianyi Zhou

Main category: cs.CV

TL;DR: 本文提出MESA框架，通过选择性潜空间干预来缓解大视觉语言模型（LVLMs）的幻觉问题，同时保持原有生成行为不变。

Details

Motivation: 现有幻觉缓解方法常改变模型生成行为（如输出变短、隐空间token分布偏移），根源在于干预信号与生成过程耦合过紧。 Method: 提出MESA——一种即插即用的潜空间干预框架，仅针对幻觉相关响应进行定向干预，保留模型原始token分布。 Result: 在多种生成与判别基准上验证，MESA在降低幻觉的同时更优地维持生成行为，性能超越多个LVLM家族中的先前方法。 Conclusion: MESA实现了对LVLM幻觉的可控、选择性抑制，兼顾有效性与生成保真度，为潜空间干预提供了新范式。 Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across cross-modal tasks but remain hindered by hallucinations, producing textual outputs inconsistent with visual content. Existing methods mitigate hallucinations but often alter generation behavior, resulting in shorter outputs and shifted token distributions, especially in latent space steering approaches. We identify that this issue stems from entangled steering signals, where suppressing hallucinations inadvertently disrupts the model's intrinsic generation behavior. To address this, we propose MESA, an effective plug-and-play framework that performs controlled and selective latent intervention for hallucination mitigation. Specifically, MESA targets hallucination-relevant responses while preserving the model's original token distribution, enabling effective hallucination reduction without compromising generation behavior. Extensive experiments across diverse generative and discriminative benchmarks demonstrate that MESA consistently reduces hallucinations while better preserving generation behavior, outperforming prior methods across multiple LVLM families.

[104] Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

Weiming Zhang,Dingwen Xiao,Songyue Guo,Guangyu Xiang,Shiqi Wen,Minwei Zhao,Lei Chen,Lin Wang

Main category: cs.CV

TL;DR: 本文提出Tarot-SAM3，一种无需训练的框架，通过表达式推理解释器（ERI）和掩码自优化（MSR）两阶段，提升SAM3在任意指代表达分割（RES）任务中的鲁棒性与泛化能力。

Details

Motivation: 现有指代表达分割（RES）方法依赖大量标注数据，且难以同时处理显式与隐式表达；SAM3虽在可提示概念分割中表现优异，但对长或隐式表达效果差，且直接耦合MLLM易导致结果过度依赖其推理能力，缺乏对分割输出的优化机制。 Method: 提出Tarot-SAM3框架：第一阶段为表达式推理解释器（ERI），通过推理辅助的提示选项实现结构化解析与评估感知重述，生成鲁棒异构提示以驱动SAM3；第二阶段为掩码自优化（MSR），基于DINOv3特征关系，在不同提示生成的掩码中选择最优者，并通过判别区域对比与归属推理修正过分割与欠分割。 Result: Tarot-SAM3在显式与隐式RES基准及开放世界场景中均取得强性能；消融实验验证了ERI与MSR两阶段的有效性。 Conclusion: Tarot-SAM3是一种训练无关、通用性强的RES新范式，显著提升了模型对任意自然语言指代表达的分割鲁棒性与准确性。 Abstract: Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.

[105] Stitch4D: Sparse Multi-Location 4D Urban Reconstruction via Spatio-Temporal Interpolation

Hina Kogure,Kei Katsumata,Taiki Miyanishi,Komei Sugiura

Main category: cs.CV

TL;DR: 本文提出Stitch4D框架，用于解决城市动态环境中多视角稀疏、无重叠观测下的4D重建难题，通过合成桥接视图增强空间覆盖，并在统一坐标系下联合优化真实与合成观测，显著提升几何一致性与动态平滑性。

Details

Motivation: 现有4D重建方法依赖密集重叠视角，在实际城市中多为稀疏、无重叠的多位置观测，导致中间区域重建失败和时序伪影。 Method: Stitch4D包含两部分：(i) 合成中间桥接视图以增强空间约束；(ii) 在统一坐标系下联合优化真实与合成观测，并施加显式的跨位置一致性约束。 Result: 在自建CARLA基准U-S4D上，Stitch4D优于主流4D重建基线，重建几何更连贯、动态更平滑，视觉质量更优。 Conclusion: 恢复中间空间覆盖对稀疏城市环境下的稳定4D重建至关重要，Stitch4D为此提供了有效且统一的解决方案。 Abstract: Dynamic urban environments are often captured by cameras placed at spatially separated locations with little or no view overlap. However, most existing 4D reconstruction methods assume densely overlapping views. When applied to such sparse observations, these methods fail to reconstruct intermediate regions and often introduce temporal artifacts. To address this practical yet underexplored sparse multi-location setting, we propose Stitch4D, a unified 4D reconstruction framework that explicitly compensates for missing spatial coverage in sparse observations. Stitch4D (i) synthesizes intermediate bridge views to densify spatial constraints and improve spatial coverage, and (ii) jointly optimizes real and synthesized observations within a unified coordinate frame under explicit inter-location consistency constraints. By restoring intermediate coverage before optimization, Stitch4D prevents geometric collapse and reconstructs coherent geometry and smooth scene dynamics even in sparsely observed environments. To evaluate this setting, we introduce Urban Sparse 4D (U-S4D), a CARLA-based benchmark designed to assess spatiotemporal alignment under sparse multi-location configurations. Experimental results on U-S4D show that Stitch4D surpasses representative 4D reconstruction baselines and achieves superior visual quality. These results indicate that recovering intermediate spatial coverage is essential for stable 4D reconstruction in sparse urban environments.

[106] Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting

Tao Hana,Zhibin Wen,Zhenghao Chen,Fenghua Lin,Junyu Gao,Song Guo,Lei Bai

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯点绘（Gaussian Splatting）与尺度感知注意力机制的视觉Transformer框架（GSSA-ViT），用于任意分辨率的数值天气预报和灵活的大气场降尺度。

Details

Motivation: AI驱动的数值天气预报虽快，但生成高分辨率预报仍计算昂贵，主因是多尺度适应性差和数据表征效率低。 Method: 将经纬度网格点建模为3D高斯中心；引入生成式3D高斯参数预测（协方差、属性、不透明度）；设计尺度感知注意力模块以建模跨尺度依赖，支持连续分辨率适配。 Result: 在ERA5上实现87个大气变量的任意分辨率预报；在ERA5和CMIP6上降尺度性能优于现有方法；首次将生成式3D高斯建模与尺度感知注意力结合用于统一多尺度NWP。 Conclusion: GSSA-ViT为高分辨率、多尺度大气预测与降尺度提供了高效、可扩展的新范式。 Abstract: While AI-based numerical weather prediction (NWP) enables rapid forecasting, generating high-resolution outputs remains computationally demanding due to limited multi-scale adaptability and inefficient data representations. We propose the 3D Gaussian splatting-based scale-aware vision transformer (GSSA-ViT), a novel framework for arbitrary-resolution forecasting and flexible downscaling of high-dimensional atmospheric fields. Specifically, latitude-longitude grid points are treated as centers of 3D Gaussians. A generative 3D Gaussian prediction scheme is introduced to estimate key parameters, including covariance, attributes, and opacity, for unseen samples, improving generalization and mitigating overfitting. In addition, a scale-aware attention module is designed to capture cross-scale dependencies, enabling the model to effectively integrate information across varying downscaling ratios and support continuous resolution adaptation. To our knowledge, this is the first NWP approach that combines generative 3D Gaussian modeling with scale-aware attention for unified multi-scale prediction. Experiments on ERA5 show that the proposed method accurately forecasts 87 atmospheric variables at arbitrary resolutions, while evaluations on ERA5 and CMIP6 demonstrate its superior performance in downscaling tasks. The proposed framework provides an efficient and scalable solution for high-resolution, multi-scale atmospheric prediction and downscaling. Code is available at: https://github.com/binbin2xs/weather-GS.

[107] Shortcut Learning in Glomerular AI: Adversarial Penalties Hurt, Entropy Helps

Mohammad Daouk,Jan Ulrich Becker,Neeraja Kambham,Anthony Chang,Hien Nguyen,Chandra Mohan

Main category: cs.CV

TL;DR: 本文研究了肾病理AI中染色变异导致的分布偏移和捷径学习问题，提出了一种无需染色或站点标签的无监督熵正则化方法，在多中心、多染色数据集上验证了其对狼疮性肾炎肾小球病变分类的有效性和鲁棒性。

Details

Motivation: 染色变异性是肾病理AI中普遍存在的分布偏移来源，可能导致模型利用染色信息作为捷径进行预测，影响泛化能力。本文旨在探究狼疮性肾炎肾小球病变分类器是否依赖染色捷径，并提出无需染色/中心标签的缓解策略。 Method: 构建包含三个中心、四种染色（PAS、H&E、Jones、Trichrome）的9674张肾小球图像块的多中心多染色数据集；采用贝叶斯CNN与ViT主干网络结合蒙特卡洛Dropout；设计三种设置：(1)仅染色分类；(2)带监督染色损失的双头联合预测；(3)基于染色头预测熵最大化的无标签染色正则化。 Result: （1）染色身份极易学习，证实染色是强捷径；（2）监督染色损失调节染色性能但几乎不影响病变分类指标，表明该数据集本身对染色捷径具有一定鲁棒性；（3）熵正则化使染色预测接近随机水平，同时不损害病变分类准确率与校准性。 Conclusion: 精心构建的多染色数据集本身可具备对染色捷径的内在鲁棒性；而采用贝叶斯双头架构配合无标签熵正则化，是一种简单且部署友好的防御染色相关漂移的方法。 Abstract: Stain variability is a pervasive source of distribution shift and potential shortcut learning in renal pathology AI. We ask whether lupus nephritis glomerular lesion classifiers exploit stain as a shortcut, and how to mitigate such bias without stain or site labels. We curate a multi-center, multi-stain dataset of 9{,}674 glomerular patches (224$\times$224) from 365 WSIs across three centers and four stains (PAS, H\&E, Jones, Trichrome), labeled as proliferative vs.\ non-proliferative. We evaluate Bayesian CNN and ViT backbones with Monte Carlo dropout in three settings: (1) stain-only classification; (2) a dual-head model jointly predicting lesion and stain with supervised stain loss; and (3) a dual-head model with label-free stain regularization via entropy maximization on the stain head. In (1), stain identity is trivially learnable, confirming a strong candidate shortcut. In (2), varying the strength and sign of stain supervision strongly modulates stain performance but leaves lesion metrics essentially unchanged, indicating no measurable stain-driven shortcut learning on this multi-stain, multi-center dataset, while overly adversarial stain penalties inflate predictive uncertainty. In (3), entropy-based regularization holds stain predictions near chance without degrading lesion accuracy or calibration. Overall, a carefully curated multi-stain dataset can be inherently robust to stain shortcuts, and a Bayesian dual-head architecture with label-free entropy regularization offers a simple, deployment-friendly safeguard against potential stain-related drift in glomerular AI.

[108] ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

Jiayang Xu,Fan Zhuo,Majun Zhang,Changhao Pan,Zehan Wang,Siyu Chen,Xiaoda Yang,Tao Jin,Zhou Zhao

Main category: cs.CV

TL;DR: ImVideoEdit是一种仅用图像对训练视频编辑能力的高效框架，通过冻结3D注意力模块、引入Predict-Update空间差异注意力和文本引导动态语义门控机制，在极低计算开销下实现媲美大规模视频训练模型的编辑保真度与时间一致性。

Details

Motivation: 现有视频编辑模型依赖昂贵的成对视频数据，可扩展性差；而多数视频编辑任务本质上是时空解耦过程，可在保持预训练模型时间动态的同时精准修改空间内容。 Method: 提出ImVideoEdit框架：冻结预训练3D注意力模块，将图像视为单帧视频以解耦2D空间学习；设计Predict-Update Spatial Difference Attention模块渐进提取并注入空间差异；引入Text-Guided Dynamic Semantic Gating机制实现自适应、隐式的文本驱动编辑，无需刚性外部掩码。 Result: 仅用13K图像对训练5轮、计算开销极低，即在编辑保真度和时间一致性上达到与基于大量视频数据训练的大模型相当的性能。 Conclusion: 仅用图像对即可高效习得高质量视频编辑能力，验证了视频编辑任务中时空解耦建模的有效性与实用性，为降低视频编辑模型训练成本提供了新范式。 Abstract: Current video editing models often rely on expensive paired video data, which limits their practical scalability. In essence, most video editing tasks can be formulated as a decoupled spatiotemporal process, where the temporal dynamics of the pretrained model are preserved while spatial content is selectively and precisely modified. Based on this insight, we propose ImVideoEdit, an efficient framework that learns video editing capabilities entirely from image pairs. By freezing the pre-trained 3D attention modules and treating images as single-frame videos, we decouple the 2D spatial learning process to help preserve the original temporal dynamics. The core of our approach is a Predict-Update Spatial Difference Attention module that progressively extracts and injects spatial differences. Rather than relying on rigid external masks, we incorporate a Text-Guided Dynamic Semantic Gating mechanism for adaptive and implicit text-driven modifications. Despite training on only 13K image pairs for 5 epochs with exceptionally low computational overhead, ImVideoEdit achieves editing fidelity and temporal consistency comparable to larger models trained on extensive video datasets.

[109] TOOLCAD: Exploring Tool-Using Large Language Models in Text-to-CAD Generation with Reinforcement Learning

Yifei Gong,Xing Wu,Wenda Liu,Kang Tu

Main category: cs.CV

TL;DR: 本文提出ToolCAD框架，利用大语言模型作为工具使用代理实现文本到CAD建模，通过交互式CAD建模环境和在线课程强化学习提升模型能力。

Details

Motivation: 目前尚无研究探讨工具使用型大语言模型如何最优地与CAD引擎交互，阻碍了基于LLM的文本到CAD建模系统的发展。 Method: 提出ToolCAD框架，构建交互式CAD建模训练环境，并采用端到端后训练策略结合在线课程强化学习，使LLM能生成精细化的CAD建模思维链（CAD-CoT）并成长为熟练的CAD工具使用者。 Result: ToolCAD成功填补了开源大语言模型在CAD工具使用代理领域的应用与训练空白，使其性能可媲美专有模型。 Conclusion: ToolCAD为构建更易获取、更鲁棒的自主文本到CAD建模系统铺平了道路。 Abstract: Computer-Aided Design (CAD) is an expert-level task that relies on long-horizon reasoning and coherent modeling actions. Large Language Models (LLMs) have shown remarkable advancements in enabling language agents to tackle real-world tasks. Notably, there has been no investigation into how tool-using LLMs optimally interact with CAD engines, hindering the emergence of LLM-based agentic text-to-CAD modeling systems. We propose ToolCAD, a novel agentic CAD framework deploying LLMs as tool-using agents for text-to-CAD generation. Furthermore, we introduce an interactive CAD modeling gym to rollout reasoning and tool-augmented interaction trajectories with the CAD engine, incorporating hybrid feedback and human supervision. Meanwhile, an end-to-end post-training strategy is presented to enable the LLM agent to elicit refined CAD Modeling Chain of Thought (CAD-CoT) and evolve into proficient CAD tool-using agents via online curriculum reinforcement learning. Our findings demonstrate ToolCAD fills the gap in adopting and training open-source LLMs for CAD tool-using agents, enabling them to perform comparably to proprietary models, paving the way for more accessible and robust autonomous text-to-CAD modeling systems.

[110] DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

Gyanendra Das,Sai Satyam Jena

Main category: cs.CV

TL;DR: 本文提出了一种名为动态子空间概念对齐（DSCA）的方法，通过将视觉语言模型的表征空间分解为正交语义子空间，在结构上隔离概念，从而实现精准、无干扰的持续知识编辑，显著提升了长期编辑的稳定性与知识保留能力。

Details

Motivation: 现有视觉语言模型（VLM）的知识编辑方法在共享表征空间中操作，导致概念纠缠和编辑干扰，难以应对持续、顺序的知识更新带来的灾难性遗忘与跨模态错位问题。 Method: 提出Dynamic Subspace Concept Alignment（DSCA），利用增量聚类与PCA在联合视觉-语言表征上构建正交语义子空间；编辑仅在对应子空间中进行，并采用多目标损失函数保障任务保真度、编辑局部性与跨模态对齐。 Result: 在冻结基础模型前提下，单次编辑成功率98%，连续1000次编辑后仍保持>95%成功率，幻觉降低3–5%，后向迁移（BWT）得分最优，在多个数据集与基准上达到SOTA的持续编辑稳定性与知识保留能力。 Conclusion: 结构化地分离语义子空间比算法优化更能根本缓解编辑干扰，DSCA将概念隔离从训练目标升格为架构属性，为VLM的终身知识编辑提供了更鲁棒、可扩展的新范式。 Abstract: Model editing aims to update knowledge to add new concepts and change relevant information without retraining. Lifelong editing is a challenging task, prone to disrupting previously learned concepts, especially for Vision Language Models (VLMs), because sequential edits can lead to degraded reasoning and cross modal misalignment. Existing VLM knowledge editing methods based on gated adapters, activation edits, and parameter merging techniques address catastrophic forgetting seen in full fine tuning; however, they still operate in the shared representation space of the VLM, where concepts are entangled, so edits interfere with other non relevant concepts. We hypothesize that this instability persists because current methods algorithmically control edits via optimization rather than structurally separating knowledge. We introduce Dynamic Subspace Concept Alignment (DSCA) which by design mitigates this limitation by decomposing the representation space into a set of orthogonal semantic subspaces and proposing edits only in those transformed spaces. These subspaces are obtained through incremental clustering and PCA on joint vision language representations. This process structurally isolates concepts, enabling precise, non interfering edits by turning isolation from a soft training objective into an architectural property. The surgical edits are guided by a multi term loss function for maintaining task fidelity, edit locality, and cross modal alignment. With the base model frozen, our method achieves 98 percent single edit success, remains over 95 percent after 1000 sequential edits, lowers hallucination by 3 to 5 percent, and achieves the best backward transfer (BWT) scores on continual instruction tuning benchmarks. Extensive experiments demonstrate DSCA state of the art stability and knowledge retention capability in continual lifelong editing across various datasets and benchmarks.

[111] Lighting-grounded Video Generation with Renderer-based Agent Reasoning

Ziqi Cai,Taoyu Yang,Zheng Chang,Si Li,Han Jiang,Shuchen Weng,Boxin Shi

Main category: cs.CV

TL;DR: 本文提出LiVER框架，通过显式3D场景属性（布局、光照、相机轨迹）控制扩散模型生成视频，实现高保真、时序一致且可编辑的可控视频生成。

Details

Motivation: 现有视频扩散模型在布局、光照和相机轨迹等关键场景因素上缺乏显式、解耦的控制能力，难以满足电影制作和虚拟制片等对精确场景控制的需求。 Method: 构建基于统一3D表示的渲染控制信号，设计轻量级条件模块与渐进式训练策略，将3D场景参数融入基础视频扩散模型；并开发能将用户高级指令自动转化为3D控制信号的场景代理。 Result: 在图像到视频、视频到视频任务中实现SOTA级照片真实感与时间一致性，并支持对布局、光照、相机轨迹等因子的精确、解耦控制。 Conclusion: LiVER为可控视频生成树立了新标准，显著提升了扩散模型在专业创作场景中的实用性与可控性。 Abstract: Diffusion models have achieved remarkable progress in video generation, but their controllability remains a major limitation. Key scene factors such as layout, lighting, and camera trajectory are often entangled or only weakly modeled, restricting their applicability in domains like filmmaking and virtual production where explicit scene control is essential. We present LiVER, a diffusion-based framework for scene-controllable video generation. To achieve this, we introduce a novel framework that conditions video synthesis on explicit 3D scene properties, supported by a new large-scale dataset with dense annotations of object layout, lighting, and camera parameters. Our method disentangles these properties by rendering control signals from a unified 3D representation. We propose a lightweight conditioning module and a progressive training strategy to integrate these signals into a foundational video diffusion model, ensuring stable convergence and high fidelity. Our framework enables a wide range of applications, including image-to-video and video-to-video synthesis where the underlying 3D scene is fully editable. To further enhance usability, we develop a scene agent that automatically translates high-level user instructions into the required 3D control signals. Experiments show that LiVER achieves state-of-the-art photorealism and temporal consistency while enabling precise, disentangled control over scene factors, setting a new standard for controllable video generation.

[112] Object-Centric Stereo Ranging for Autonomous Driving: From Dense Disparity to Census-Based Template Matching

Qihao Huang

Main category: cs.CV

TL;DR: 本文提出了一种面向自动驾驶的综合立体测距系统，融合稠密立体匹配（BM/SGM）、目标中心Census模板匹配与单目几何先验，并引入在线标定优化框架，实现实时、鲁棒的长距离测距。

Details

Motivation: 传统稠密立体匹配方法（如BM、SGM）在高速公路长距车辆检测中存在计算开销大、对双目辐射差异敏感、远距离精度低等问题，亟需更鲁棒高效的测距方案。 Method: 构建统一的检测-测距-跟踪流水线，核心是GPU加速的目标中心Census模板匹配算法（含远近分治、前后向验证、遮挡感知采样、多块鲁棒聚合），并结合在线标定精调框架（自动矫正偏移搜索、雷达-立体投票修正视差、目标级雷达-立体关联）。 Result: 系统在夜间、雨天及光照变化等复杂驾驶场景下实现鲁棒测距，并通过异步GPU流水线设计达成实时性能。 Conclusion: 所提多源融合测距系统显著提升了长距、低纹理、动态光照条件下的深度估计鲁棒性与实时性，为自动驾驶感知提供了实用化解决方案。 Abstract: Accurate depth estimation is critical for autonomous driving perception systems, particularly for long range vehicle detection on highways. Traditional dense stereo matching methods such as Block Matching (BM) and Semi Global Matching (SGM) produce per pixel disparity maps but suffer from high computational cost, sensitivity to radiometric differences between stereo cameras, and poor accuracy at long range where disparity values are small. In this report, we present a comprehensive stereo ranging system that integrates three complementary depth estimation approaches: dense BM/SGM disparity, object centric Census based template matching, and monocular geometric priors, within a unified detection ranging tracking pipeline. Our key contribution is a novel object centric Census based template matching algorithm that performs GPU accelerated sparse stereo matching directly within detected bounding boxes, employing a far close divide and conquer strategy, forward backward verification, occlusion aware sampling, and robust multi block aggregation. We further describe an online calibration refinement framework that combines auto rectification offset search, radar stereo voting based disparity correction, and object level radar stereo association for continuous extrinsic drift compensation. The complete system achieves real time performance through asynchronous GPU pipeline design and delivers robust ranging across diverse driving conditions including nighttime, rain, and varying illumination.

[113] DP-DeGauss: Dynamic Probabilistic Gaussian Decomposition for Egocentric 4D Scene Reconstruction

Tingxi Chen,Zhengxue Cheng,Houqiang Zhong,Su Wang,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 本文提出DP-DeGauss框架，用于第一人称视角下的动态4D场景重建，通过概率高斯分解实现背景、手部与物体的显式解耦，显著提升重建质量与可编辑性。

Details

Motivation: 现有方法难以处理第一人称视频中复杂的自运动、遮挡及手物交互，且假设固定视角或简单合并动态前景，无法有效解耦动态成分。 Method: 提出动态概率高斯分解框架DP-DeGauss：基于COLMAP初始化统一3D高斯集，为每个高斯附加可学习类别概率，并通过类别专用变形分支（背景/手/物体）进行动态路由；引入类别掩码、亮度约束与光流控制以增强静态渲染与动态重建。 Result: 在PSNR上平均超越基线+1.70dB，并提升SSIM与LPIPS；首次实现背景、手、物体三组件的显式、细粒度解耦，达到该任务SOTA性能。 Conclusion: DP-DeGauss为第一人称4D重建提供了更鲁棒、可解释与可编辑的建模范式，推动AR/VR与具身智能中的场景理解发展。 Abstract: Egocentric video is crucial for next-generation 4D scene reconstruction, with applications in AR/VR and embodied AI. However, reconstructing dynamic first-person scenes is challenging due to complex ego-motion, occlusions, and hand-object interactions. Existing decomposition methods are ill-suited, assuming fixed viewpoints or merging dynamics into a single foreground. To address these limitations, we introduce DP-DeGauss, a dynamic probabilistic Gaussian decomposition framework for egocentric 4D reconstruction. Our method initializes a unified 3D Gaussian set from COLMAP priors, augments each with a learnable category probability, and dynamically routes them into specialized deformation branches for background, hands, or object modeling. We employ category-specific masks for better disentanglement and introduce brightness and motion-flow control to improve static rendering and dynamic reconstruction. Extensive experiments show that DP-DeGauss outperforms baselines by +1.70dB in PSNR on average with SSIM and LPIPS gains. More importantly, our framework achieves the first and state-of-the-art disentanglement of background, hand, and object components, enabling explicit, fine-grained separation, paving the way for more intuitive ego scene understanding and editing.

[114] SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Yunnan Wang,Kecheng Zheng,Jianyuan Wang,Minghao Chen,David Novotny,Christian Rupprecht,Yinghao Xu,Xing Zhu,Wenjun Zeng,Xin Jin,Yujun Shen

Main category: cs.CV

TL;DR: 本文介绍了SceneScribe-1M，一个包含百万级野外视频的大规模多模态数据集，每段视频均配有文本描述、相机参数、深度图和3D点轨迹，旨在同时推动3D感知与视频生成研究。

Details

Motivation: 现有数据集仅分别支持3D理解或视频生成，缺乏能同时支撑这两个方向的大规模统一资源。 Method: 构建了SceneScribe-1M数据集，包含100万段野外视频，并为每段视频提供文本描述、相机参数、密集深度图和一致的3D点轨迹标注；并基于该数据集建立多个下游任务基准。 Result: 在单目深度估计、场景重建、动态点跟踪、文本到视频合成（含/不含相机控制）等任务上验证了数据集的有效性和泛化能力。 Conclusion: SceneScribe-1M填补了3D感知与视频生成之间的数据鸿沟，是一个开放、全面的基准资源，有望推动兼具3D理解与可控视频生成能力的模型发展。 Abstract: The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

[115] MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

Zile Guo,Zhan Chen,Enze Zhu,Kan Wei,Yongkang Zou,Xiaoxuan Liu,Lei Wang

Main category: cs.CV

TL;DR: 本文提出MotionScape——首个面向世界模型的大规模真实无人机视角视频数据集，包含30+小时4K视频、4.5M帧，配准6-DoF相机轨迹与细粒度语言描述，解决现有世界模型在高动态无人机视角下物理一致性差的问题。

Details

Motivation: 现有世界模型在高动态无人机（UAV）视角下难以保持时空物理一致性，主因是训练数据存在分布偏差：主流数据集多为2.5D地面驾驶或平滑人眼视角，缺乏真实6自由度（6-DoF）无人机运动先验。 Method: 构建MotionScape数据集：采集大规模真实UAV视频；设计自动化多阶段处理流程，融合CLIP相关性过滤、时序分割、鲁棒视觉SLAM恢复6-DoF轨迹、大语言模型驱动语义标注；确保样本在语义与几何层面高度对齐。 Result: 实验证明，引入MotionScape及其对齐标注显著提升现有世界模型对复杂3D动态的建模能力与大视角变化下的泛化性，增强UAV智能体在复杂环境中的决策与规划性能。 Conclusion: MotionScape填补了高动态无人机视角世界建模的数据空白，为具身智能特别是空中机器人提供了关键基础资源，推动世界模型向真实物理一致性和强泛化能力发展。 Abstract: Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape

[116] SAT: Selective Aggregation Transformer for Image Super-Resolution

Dinh Phu Tran,Thao Do,Saad Wazir,Seongah Kim,Seon Kwon Kim,Daeyoung Kim

Main category: cs.CV

TL;DR: 本文提出了一种名为Selective Aggregation Transformer (SAT) 的新方法，通过密度驱动的Token聚合算法选择性地聚合Key-Value矩阵，在大幅降低计算量（Token减少97%）的同时扩大感受野、保持查询分辨率和重建精度。

Details

Motivation: 传统Transformer在图像超分中因自注意力机制的二次计算复杂度而受限；窗口注意力虽提升效率但感受野受限，需兼顾效率与全局建模能力。 Method: 提出Selective Aggregation Transformer（SAT），设计Density-driven Token Aggregation算法，对Key-Value矩阵进行基于密度和孤立性的选择性聚合，每个簇仅保留一个聚合Token，同时保持Query矩阵全分辨率。 Result: 在性能上超越SOTA方法PFT最多0.22dB，FLOPs最多降低27%。 Conclusion: SAT在显著降低计算成本的同时，有效扩展了模型感受野并保持高保真重建能力，为高效全局建模提供了新范式。 Abstract: Transformer-based approaches have revolutionized image super-resolution by modeling long-range dependencies. However, the quadratic computational complexity of vanilla self-attention mechanisms poses significant challenges, often leading to compromises between efficiency and global context exploitation. Recent window-based attention methods mitigate this by localizing computations, but they often yield restricted receptive fields. To mitigate these limitations, we propose Selective Aggregation Transformer (SAT). This novel transformer efficiently captures long-range dependencies, leading to an enlarged model receptive field by selectively aggregating key-value matrices (reducing the number of tokens by 97\%) via our Density-driven Token Aggregation algorithm while maintaining the full resolution of the query matrix. This design significantly reduces computational costs, resulting in lower complexity and enabling scalable global interactions without compromising reconstruction fidelity. SAT identifies and represents each cluster with a single aggregation token, utilizing density and isolation metrics to ensure that critical high-frequency details are preserved. Experimental results demonstrate that SAT outperforms the state-of-the-art method PFT by up to 0.22dB, while the total number of FLOPs can be reduced by up to 27\%.

[117] Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

Yun Zhu,Jianjun Qian,Jian Yang,Jin Xie,Na Zhao

Main category: cs.CV

TL;DR: 本文提出FI3Det，首个面向少样本增量式3D目标检测的框架，利用视觉语言模型（VLM）挖掘未知物体并融合2D语义与3D几何特征，通过加权机制和门控多模态原型印刻模块提升检测性能，在ScanNet V2和SUN RGB-D上显著优于基线方法。

Details

Motivation: 现有增量3D检测方法依赖大量新类别标注，难以适应动态室内环境中对少量样本即可学习新类别的需求。 Method: 提出FI3Det框架：1）VLM引导的未知物体学习模块，挖掘未知对象并提取2D语义特征和类无关3D框；2）基于空间位置与框内特征一致性的加权机制缓解噪声；3）门控多模态原型印刻模块，融合对齐的2D语义与3D几何特征生成分类分数并进行门控融合。 Result: 在ScanNet V2和SUN RGB-D数据集的批量与序列评估设置下，FI3Det均显著且稳定地超越现有基线方法。 Conclusion: FI3Det首次实现了少样本条件下的增量式3D目标检测，验证了VLM在3D感知中迁移语义知识的有效性，为具身智能中的动态环境理解提供了新思路。 Abstract: Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.

[118] SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving

Felix Embacher,Jonas Uhrig,Marius Cordts,Markus Enzweiler

Main category: cs.CV

TL;DR: 本文提出了SearchAD，一个用于自动驾驶（AD）的大规模稀有图像检索数据集，包含42.3万帧、51.3万个标注框和90个罕见类别，旨在解决‘大海捞针’式的稀有场景检索问题，并支持文本-图像/图像-图像检索、少样本学习及多模态模型微调。

Details

Motivation: 随着自动驾驶数据集规模扩大，关键挑战已从数据采集转向高效定位最相关样本，尤其是罕见且安全关键的驾驶场景。 Method: 构建了SearchAD数据集，涵盖11个现有数据集的423k帧，提供513k高质量人工标注框（90个稀有类别），并设计语义级图像检索任务与标准划分，支持多种检索范式和模型评估。 Result: 实验表明基于文本的方法优于基于图像的方法；空间视觉特征与语言直接对齐的模型在零样本设置下表现最佳；微调基线显著提升性能，但整体检索能力仍不足。 Conclusion: SearchAD是首个面向自动驾驶数据策展与长尾感知研究的大规模检索基准，推动了稀有场景识别与多模态检索技术的发展。 Abstract: Retrieving rare and safety-critical driving scenarios from large-scale datasets is essential for building robust autonomous driving (AD) systems. As dataset sizes continue to grow, the key challenge shifts from collecting more data to efficiently identifying the most relevant samples. We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. SearchAD provides high-quality manual annotations of more than 513k bounding boxes covering 90 rare categories. It specifically targets the needle-in-a-haystack problem of locating extremely rare classes, with some appearing fewer than 50 times across the entire dataset. Unlike existing benchmarks, which focused on instance-level retrieval, SearchAD emphasizes semantic image retrieval with a well-defined data split, enabling text-to-image and image-to-image retrieval, few-shot learning, and fine-tuning of multi-modal retrieval models. Comprehensive evaluations show that text-based methods outperform image-based ones due to stronger inherent semantic grounding. While models directly aligning spatial visual features with language achieve the best zero-shot results, and our fine-tuning baseline significantly improves performance, absolute retrieval capabilities remain unsatisfactory. With a held-out test set on a public benchmark server, SearchAD establishes the first large-scale dataset for retrieval-driven data curation and long-tail perception research in AD: https://iis-esslingen.github.io/searchad/

[119] Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Xuezhen Tu,Jingyu Wu,Fangyu Kang,Qingpeng Nong,Kaijin Zhang,Chaoyue Niu,Fan Wu

Main category: cs.CV

TL;DR: 本文提出Bridge-STG框架，通过解耦时空定位并引入语义桥接与查询引导的空间定位模块，解决视频语言 grounding 中的时空对齐纠缠与双域视觉令牌冗余问题，在多个基准上达到SOTA性能。

Details

Motivation: 现有多模态大语言模型在时空视频定位任务中面临两大挑战：一是时空对齐纠缠（因时间与空间子任务耦合于同一自回归输出空间），二是双域视觉令牌冗余（目标在时空维度均稀疏，导致大量视觉令牌无关）。 Method: 提出Bridge-STG端到端框架：1）通过Spatio-Temporal Semantic Bridging (STSB) 机制结合Explicit Temporal Alignment (ETA)，将MLLM的时间推理上下文蒸馏为增强型桥接查询；2）设计Query-Guided Spatial Localization (QGSL) 模块，利用该查询驱动专用空间解码器，并引入多层交互查询与正负帧采样，协同消除冗余视觉令牌。 Result: 在VidSTG等多基准上达到MLLM-based方法SOTA；m_vIoU平均从26.4提升至34.3；且在统一多任务训练下展现出对细粒度视频理解任务的良好跨任务迁移能力。 Conclusion: Bridge-STG通过解耦与语义桥接的设计，有效缓解了时空耦合与视觉冗余问题，验证了解耦+强接口设计在复杂多模态定位任务中的有效性，为MLLM在细粒度视频理解中的应用提供了新范式。 Abstract: Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Multimodal Large Language Models (MLLMs). We identify two core challenges: \textit{entangled spatio-temporal alignment}, arising from coupling two heterogeneous sub-tasks within the same autoregressive output space, and \textit{dual-domain visual token redundancy}, where target objects exhibit simultaneous temporal and spatial sparsity, rendering the overwhelming majority of visual tokens irrelevant to the grounding query. To address these, we propose \textbf{Bridge-STG}, an end-to-end framework that decouples temporal and spatial localization while maintaining semantic coherence. While decoupling is the natural solution to this entanglement, it risks creating a semantic gap between the temporal MLLM and the spatial decoder. Bridge-STG resolves this through two pivotal designs: the \textbf{Spatio-Temporal Semantic Bridging (STSB)} mechanism with Explicit Temporal Alignment (ETA) distills the MLLM's temporal reasoning context into enriched bridging queries as a robust semantic interface; and the \textbf{Query-Guided Spatial Localization (QGSL)} module leverages these queries to drive a purpose-built spatial decoder with multi-layer interactive queries and positive/negative frame sampling, jointly eliminating dual-domain visual token redundancy. Extensive experiments across multiple benchmarks demonstrate that Bridge-STG achieves state-of-the-art performance among MLLM-based methods. Bridge-STG improves average m\_vIoU from $26.4$ to $34.3$ on VidSTG and demonstrates strong cross-task transfer across various fine-grained video understanding tasks under a unified multi-task training regime.

[120] Component-Adaptive and Lesion-Level Supervision for Improved Small Structure Segmentation in Brain MRI

Minh Sao Khue Luu,Evgeniy N. Pavlovskiy,Bair N. Tuchinov

Main category: cs.CV

TL;DR: 本文提出了一种名为CATMIL的统一目标函数，通过在基础分割损失上增加两个不同层次的辅助监督项（组件自适应Tversky损失和基于多实例学习的病变级监督），提升小病灶分割性能，尤其在高度类别不平衡场景下显著提高小病灶召回率并降低假阳性体积。

Details

Motivation: 解决医学图像中小病灶分割在高度类别不平衡场景下的挑战，特别是提升小病灶召回率、降低假阴性和假阳性。 Method: 提出CATMIL统一目标函数，包含：1）Component-Adaptive Tversky损失，按连通组件重加权体素贡献以平衡不同大小病灶影响；2）基于多实例学习（MIL）的病变级监督，鼓励每个病灶实例的检测；二者与标准nnU-Net损失联合优化。 Result: 在MSLesSeg数据集上，CATMIL显著提升Dice分数（0.7834）、降低边界误差、大幅提高小病灶召回率、减少假阴性，并保持最低的假阳性体积，实现分割精度、病灶检测与误差控制的最均衡性能。 Conclusion: 将组件级与病灶级监督整合进统一目标函数，是一种有效且实用的小病灶分割改进方法，尤其适用于高度不平衡的医学图像分割任务。 Abstract: We propose a unified objective function, termed CATMIL, that augments the base segmentation loss with two auxiliary supervision terms operating at different levels. The first term, Component-Adaptive Tversky, reweights voxel contributions based on connected components to balance the influence of lesions of different sizes. The second term, based on Multiple Instance Learning, introduces lesion-level supervision by encouraging the detection of each lesion instance. These terms are combined with the standard nnU-Net loss to jointly optimize voxel-level segmentation accuracy and lesion-level detection. We evaluate the proposed objective on the MSLesSeg dataset using a consistent nnU-Net framework and 5-fold cross-validation. The results show that CATMIL achieves the most balanced performance across segmentation accuracy, lesion detection, and error control. It improves Dice score (0.7834) and reduces boundary error compared to standard losses. More importantly, it substantially increases small lesion recall and reduces false negatives, while maintaining the lowest false positive volume among compared methods. These findings demonstrate that integrating component-level and lesion-level supervision within a unified objective provides an effective and practical approach for improving small lesion segmentation in highly imbalanced settings. All code and pretrained models are available at \href{https://github.com/luumsk/SmallLesionMRI}{this url}.

[121] Rotation Equivariant Convolutions in Deformable Registration of Brain MRI

Arghavan Rezvani,Kun Han,Anthony T. Wu,Pooya Khosravi,Xiaohui Xie

Main category: cs.CV

TL;DR: 本文提出将旋转等变卷积引入可变形脑部MRI配准网络，通过在三个基线架构中替换编码器验证其优势：提升配准精度、减少参数量、增强对旋转输入的鲁棒性、提高小样本下的性能。

Details

Motivation: CNN缺乏旋转等变性，无法有效利用解剖结构（尤其是脑部MRI）中固有的旋转对称性，限制了配准性能。 Method: 将旋转等变卷积集成到可变形脑部MRI配准网络中，用等变编码器替换三个基线架构中的标准编码器，并在多个公开脑部MRI数据集上评估。 Result: 等变编码器显著提升配准精度、降低参数量、增强对旋转输入的鲁棒性，并在训练数据较少时仍保持更好性能。 Conclusion: 引入几何先验（如旋转等变性）是构建更鲁棒、准确和高效医学图像配准模型的关键步骤。 Abstract: Image registration is a fundamental task that aligns anatomical structures between images. While CNNs perform well, they lack rotation equivariance - a rotated input does not produce a correspondingly rotated output. This hinders performance by failing to exploit the rotational symmetries inherent in anatomical structures, particularly in brain MRI. In this work, we integrate rotation-equivariant convolutions into deformable brain MRI registration networks. We evaluate this approach by replacing standard encoders with equivariant ones in three baseline architectures, testing on multiple public brain MRI datasets. Our experiments demonstrate that equivariant encoders have three key advantages: 1) They achieve higher registration accuracy while reducing network parameters, confirming the benefit of this anatomical inductive bias. 2) They outperform baselines on rotated input pairs, demonstrating robustness to orientation variations common in clinical practice. 3) They show improved performance with less training data, indicating greater sample efficiency. Our results demonstrate that incorporating geometric priors is a critical step toward building more robust, accurate, and efficient registration models.

[122] Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

Jun Li,Yingying Shi,Zhixuan Ruan,Nan Guo,Jianhua Xu

Main category: cs.CV

TL;DR: 本文提出MDDCNet，结合可变形空洞卷积与Mamba模型，增强局部细节建模与多尺度特征融合能力，提升复杂交通场景下的小目标检测精度。

Details

Motivation: 现有基于Mamba的方法虽能建模长程依赖，但难以捕捉富含细节的小目标，且状态空间模型缺乏层次化特征表示和跨尺度交互能力，导致复杂交通场景检测性能受限。 Method: 提出MDDCNet：设计含多尺度可变形空洞卷积（MSDDC）块与Mamba块的混合主干网络；引入通道增强前馈网络（CE-FFN）强化通道交互；构建基于Mamba的注意力聚合特征金字塔网络（A²FPN）以增强多尺度特征融合。 Result: 在多个公开基准与真实交通数据集上显著优于当前先进检测器。 Conclusion: MDDCNet通过协同建模局部结构与全局语义、增强跨尺度特征交互，在复杂交通场景中实现了更准确的目标检测。 Abstract: In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.

[123] LINE: LLM-based Iterative Neuron Explanations for Vision Models

Vladimir Zaigrajew,Michał Piechota,Gaspar Sekula,Przemysław Biecek

Main category: cs.CV

TL;DR: 本文提出了一种名为LINE的无训练、迭代式开放词汇概念标注方法，用于解释视觉模型中单个神经元所编码的概念，通过大语言模型与文生图模型在黑盒设置下闭环优化概念描述，显著提升性能并支持多义性分析与可视化解释。

Details

Motivation: 现有神经元概念标注方法受限于预定义词汇表或生成过于具体的概念描述，难以捕捉高阶、全局概念，影响对深度神经网络决策机制的理解和AI安全性保障。 Method: 提出LINE方法：在严格黑盒设置下，利用大语言模型和文本到图像生成器，基于神经元激活历史进行迭代式概念提出与优化，无需模型训练。 Result: 在多个模型架构上达到SOTA性能，在ImageNet和Places365数据集上AUC分别提升最多0.18和0.05；平均发现29%预定义大规模词表未覆盖的新概念；提供完整生成历史，支持多义性评估，并生成媲美梯度类激活最大化方法的可视化解释。 Conclusion: LINE是一种高效、通用且可解释性强的开放词汇神经元概念标注框架，为理解视觉模型内部机制和提升AI可信性提供了新路径。 Abstract: Interpreting the concepts encoded by individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. We demonstrate that LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.18 on ImageNet and 0.05 on Places365, while discovering, on average, 29% of new concepts missed by massive predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, which enables polysemanticity evaluation and produces supporting visual explanations that rival gradient-dependent activation maximization methods.

[124] 3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience

Hongcan Xiao,Xinyue Xiao,Yilin Wang,Yue Zhang,Yonggang Qi

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的、语言驱动的3D草图生成框架3DrawAgent，利用大语言模型（LLM）在几何反馈下顺序绘制3D贝塞尔曲线，并通过相对经验优化策略结合CLIP与LLM评估进行自我改进。

Details

Motivation: 自然语言生成3D草图仍面临巨大挑战，现有方法多依赖监督训练或局限于2D，缺乏对3D空间关系和几何结构的自主理解能力。 Method: 提出3DrawAgent框架：1）LLM驱动的序列化3D贝塞尔曲线生成；2）引入相对经验优化策略，基于CLIP感知奖励和LLM细粒度定性评估构建好坏样本对；3）采用改进的Group Reward Policy Optimization（GRPO）范式，在不更新参数的前提下实现黑箱强化学习以提升3D空间意识。 Result: 实验表明，3DrawAgent能从多样化文本提示生成复杂、连贯的3D贝塞尔草图，展现出涌现的几何推理能力，并能泛化至新形状。 Conclusion: 该工作确立了‘训练-free’3D草图智能的新范式，为语言到3D生成提供了无需参数更新、依赖推理与反馈的新路径。 Abstract: Sketching in 3D space enables expressive reasoning about shape, structure, and spatial relationships, yet generating 3D sketches through natural language remains a major challenge. In this work, we introduce 3DrawAgent, a training-free, language-driven framework for 3D sketch generation that leverages large language models (LLMs) to sequentially draw 3D Bezier curves under geometric feedback. Unlike prior 2D sketch agents, our method introduces a relative experience optimization strategy that adapts the recently proposed Group Reward Policy Optimization (GRPO) paradigm. Instead of relying on explicit ground-truth supervision, we construct pairwise comparisons among generated sketches, with each pair consisting of a relatively better and a worse result based on CLIP-based perceptual rewards and LLM-based fine-grained qualitative assessment. These experiences are then used to iteratively refine the prior knowledge of 3D drawing, enabling black-box reinforcement of the model's 3D awareness. This design allows our model to self-improve its spatial understanding and drawing quality without parameter updates. Experiments show that 3DrawAgent can generate complex and coherent 3D Bezier sketches from diverse textual prompts, exhibit emergent geometric reasoning, and generalize to novel shapes, establishing a new paradigm for advancing the field of training-free 3D sketch intelligence.

[125] Adapting Foundation Models for Annotation-Efficient Adnexal Mass Segmentation in Cine Images

Francesca Fati,Alberto Rota,Adriana V. Gregory,Anna Catozzo,Maria C. Giuliano,Mrinal Dhar,Luigi De Vitis,Annie T. Packard,Francesco Multinu,Elena De Momi,Carrie L. Langstraat,Timothy L. Kline

Main category: cs.CV

TL;DR: 本文提出了一种基于DINOv3视觉Transformer的标签高效超声附件肿块分割框架，结合DPT式解码器，在小样本和域偏移场景下显著优于传统全监督CNN模型。

Details

Motivation: 超声附件肿块评估存在主观性强、观察者间差异大等问题；传统全监督分割模型依赖大量像素级标注，且在医学影像常见的域偏移下性能下降。 Method: 采用预训练的DINOv3作为骨干网络提取鲁棒语义先验，结合Dense Prediction Transformer（DPT）风格解码器进行多尺度特征重组，实现全局语义与精细空间细节融合。 Result: 在7777帧临床超声图像上达到Dice 0.945，边界精度提升（95% Hausdorff距离降低11.4%）；仅用25%数据时仍显著优于全监督基线。 Conclusion: 利用大规模自监督预训练基础模型可有效缓解医学图像分割中标注稀缺与域偏移问题，为临床数据受限环境提供高效可行方案。 Abstract: Adnexal mass evaluation via ultrasound is a challenging clinical task, often hindered by subjective interpretation and significant inter-observer variability. While automated segmentation is a foundational step for quantitative risk assessment, traditional fully supervised convolutional architectures frequently require large amounts of pixel-level annotations and struggle with domain shifts common in medical imaging. In this work, we propose a label-efficient segmentation framework that leverages the robust semantic priors of a pretrained DINOv3 foundational vision transformer backbone. By integrating this backbone with a Dense Prediction Transformer (DPT)-style decoder, our model hierarchically reassembles multi-scale features to combine global semantic representations with fine-grained spatial details. Evaluated on a clinical dataset of 7,777 annotated frames from 112 patients, our method achieves state-of-the-art performance compared to established fully supervised baselines, including U-Net, U-Net++, DeepLabV3, and MAnet. Specifically, we obtain a Dice score of 0.945 and improved boundary adherence, reducing the 95th-percentile Hausdorff Distance by 11.4% relative to the strongest convolutional baseline. Furthermore, we conduct an extensive efficiency analysis demonstrating that our DINOv3-based approach retains significantly higher performance under data starvation regimes, maintaining strong results even when trained on only 25% of the data. These results suggest that leveraging large-scale self-supervised foundations provides a promising and data-efficient solution for medical image segmentation in data-constrained clinical environments. Project Repository: https://github.com/FrancescaFati/MESA

[126] Guiding a Diffusion Model by Swapping Its Tokens

Weijia Zhang,Yuehao Liu,Shanyan Guan,Wu Ran,Yanhao Ge,Wei Li,Chao Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为Self-Swap Guidance（SSG）的新方法，通过在token latent层面进行语义差异最大的空间或通道维度交换，实现类似Classifier-Free Guidance（CFG）的效果，适用于条件与无条件扩散模型生成，提升图像保真度与提示对齐能力，且具有更强鲁棒性。

Details

Motivation: Classifier-Free Guidance（CFG）虽能提升扩散模型图像质量，但依赖文本条件，无法用于无条件生成；现有无条件引导方法扰动方式粗糙、全局性强，缺乏细粒度控制。 Method: 提出Self-Swap Guidance（SSG），在采样过程中对token latents进行语义最不相似的成对交换（空间或通道维度），生成扰动预测，并利用其与干净预测的方向差来引导采样，实现细粒度、选择性的隐空间扰动。 Result: 在MS-COCO 2014/2017和ImageNet上实验表明，SSG在图像保真度和prompt对齐方面优于现有无条件引导方法，且在不同扰动强度下副作用更小、鲁棒性更高。 Conclusion: SSG将CFG的思想成功拓展至条件与无条件生成统一框架，无需额外训练，可即插即用地提升各类扩散模型性能。 Abstract: Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

[127] ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

Daichi Yashima,Shuhei Kurita,Yusuke Oda,Shuntaro Suzuki,Seitaro Otsuki,Komei Sugiura

Main category: cs.CV

TL;DR: 本文提出ABMamba，一种基于状态空间模型的开放视频多模态大语言模型，通过分层双向扫描机制实现线性计算复杂度，显著提升视频字幕生成效率与性能。

Details

Motivation: 现有基于Transformer的视频理解模型因注意力机制的二次计算复杂度，在处理长时序视频时计算开销大、可扩展性差。 Method: 提出Aligned Hierarchical Bidirectional Scan Mamba（ABMamba），以深度状态空间模型为语言骨干，替代自注意力机制，并设计对齐的分层双向扫描模块，在多个时间分辨率上处理视频序列。 Result: 在VATEX和MSR-VTT等标准视频字幕数据集上，ABMamba达到与主流多模态大模型相当的性能，同时吞吐量提升约三倍。 Conclusion: ABMamba验证了线性复杂度状态空间模型在开放视频多模态大模型中的有效性，为高效长视频理解提供了新范式。 Abstract: In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.

[128] EEG2Vision: A Multimodal EEG-Based Framework for 2D Visual Reconstruction in Cognitive Neuroscience

Emanuele Balloni,Emanuele Frontoni,Chiara Matti,Marina Paolanti,Roberto Pierdicca,Emiliano Santarnecchi

Main category: cs.CV

TL;DR: 本文提出EEG2Vision框架，通过EEG条件扩散重建与提示引导的后处理增强机制，显著提升低密度EEG信号的视觉刺激重建质量，尤其在低通道数下仍保持较高可行性。

Details

Motivation: 非侵入式脑电图（EEG）空间分辨率低、噪声高，尤其在现实低密度电极配置下，视觉刺激重建极具挑战性。 Method: 提出模块化端到端EEG-to-image框架EEG2Vision，包含EEG条件扩散重建和基于多模态大语言模型的提示引导后处理增强：先用扩散模型生成初始图像，再用MLLM提取语义描述，驱动图像到图像扩散模型优化几何结构与感知一致性，同时保留EEG约束结构。 Result: 通道数减少导致语义解码精度大幅下降（如50类Top-1准确率从89%降至38%），但图像重建质量下降较小（FID从76.77升至80.51）；后处理增强在所有配置下均提升感知指标，低通道设置下最高提升9.71%的Inception Score；用户研究证实增强后图像具有明显感知偏好。 Conclusion: EEG2Vision显著提升了低分辨率EEG设备上实时脑-图转换应用的可行性，有望推动此类技术走出实验室、进入实际应用场景。 Abstract: Reconstructing visual stimuli from non-invasive electroencephalography (EEG) remains challenging due to its low spatial resolution and high noise, particularly under realistic low-density electrode configurations. To address this, we present EEG2Vision, a modular, end-to-end EEG-to-image framework that systematically evaluates reconstruction performance across different EEG resolutions (128, 64, 32, and 24 channels) and enhances visual quality through a prompt-guided post-reconstruction boosting mechanism. Starting from EEG-conditioned diffusion reconstruction, the boosting stage uses a multimodal large language model to extract semantic descriptions and leverages image-to-image diffusion to refine geometry and perceptual coherence while preserving EEG-grounded structure. Our experiments show that semantic decoding accuracy degrades significantly with channel reduction (e.g., 50-way Top-1 Acc from 89% to 38%), while reconstruction quality slight decreases (e.g., FID from 76.77 to 80.51). The proposed boosting consistently improves perceptual metrics across all configurations, achieving up to 9.71% IS gains in low-channel settings. A user study confirms the clear perceptual preference for boosted reconstructions. The proposed approach significantly boosts the feasibility of real-time brain-2-image applications using low-resolution EEG devices, potentially unlocking this type of applications outside laboratory settings.

[129] Small Vision-Language Models are Smart Compressors for Long Video Understanding

Junjie Fei,Jun Chen,Zechun Liu,Yunyang Xiong,Chong Zhou,Wei Wen,Junlin Han,Mingchen Zhuge,Saksham Suri,Qi Qian,Shuming Liu,Lemeng Wu,Raghuraman Krishnamoorthi,Vikas Chandra,Mohamed Elhoseiny,Chenchen Zhu

Main category: cs.CV

TL;DR: Tempo是一种针对长视频理解的高效查询感知压缩框架，利用小型视觉语言模型（SVLM）进行早期跨模态蒸馏，并通过无训练的自适应令牌分配（ATA）动态分配视觉令牌，在严格预算下保持语义完整性与因果性。

Details

Motivation: 现有方法在处理小时级视频时受限于上下文长度，密集视觉流导致令牌超限并加剧‘中间丢失’现象；启发式采样策略盲目牺牲关键帧或浪费带宽，无法兼顾保真度与效率。 Method: 提出Tempo框架：1）用小型视觉语言模型（SVLM）作为局部时间压缩器，将令牌缩减建模为前向一次的跨模态蒸馏；2）引入自适应令牌分配（ATA），基于SVLM零样本相关性先验和语义前置特性，实现无训练、O(1)动态路由，在保证因果性的前提下按查询重要性分配令牌密度。 Result: 在LVBench（4101秒）上，6B模型在8K视觉令牌预算下达52.3分，超越GPT-4o与Gemini 1.5 Pro；扩展至2048帧达53.7分；支持0.5–16 tokens/frame的激进动态压缩，显著低于理论令牌上限。 Conclusion: 真正长视频理解的关键在于意图驱动的高效压缩，而非依赖扩大上下文窗口；Tempo验证了轻量、查询感知、因果保持的压缩范式可突破传统长视频建模瓶颈。 Abstract: Adapting Multimodal Large Language Models (MLLMs) for hour-long videos is bottlenecked by context limits. Dense visual streams saturate token budgets and exacerbate the lost-in-the-middle phenomenon. Existing heuristics, like sparse sampling or uniform pooling, blindly sacrifice fidelity by discarding decisive moments and wasting bandwidth on irrelevant backgrounds. We propose Tempo, an efficient query-aware framework compressing long videos for downstream understanding. Tempo leverages a Small Vision-Language Model (SVLM) as a local temporal compressor, casting token reduction as an early cross-modal distillation process to generate compact, intent-aligned representations in a single forward pass. To enforce strict budgets without breaking causality, we introduce Adaptive Token Allocation (ATA). Exploiting the SVLM's zero-shot relevance prior and semantic front-loading, ATA acts as a training-free $O(1)$ dynamic router. It allocates dense bandwidth to query-critical segments while compressing redundancies into minimal temporal anchors to maintain the global storyline. Extensive experiments show our 6B architecture achieves state-of-the-art performance with aggressive dynamic compression (0.5-16 tokens/frame). On the extreme-long LVBench (4101s), Tempo scores 52.3 under a strict 8K visual budget, outperforming GPT-4o and Gemini 1.5 Pro. Scaling to 2048 frames reaches 53.7. Crucially, Tempo compresses hour-long videos substantially below theoretical limits, proving true long-form video understanding relies on intent-driven efficiency rather than greedily padded context windows.

[130] Brain3D: EEG-to-3D Decoding of Visual Representations via Multimodal Reasoning

Emanuele Balloni,Emanuele Frontoni,Chiara Matti,Marina Paolanti,Roberto Pierdicca,Emiliano Santarnecchi

Main category: cs.CV

TL;DR: 本文提出Brain3D框架，通过分阶段的多模态方法（EEG→图像→3D描述→3D扩散生成→3D网格）实现从脑电图信号到三维几何模型的重建，在语义与几何层面均取得良好效果。

Details

Motivation: 现有研究主要聚焦于EEG到2D图像的重建，而EEG到3D表示的重建尚未被充分探索，限制了神经解码在几何理解与实际应用中的潜力。 Method: 提出Brain3D架构：首先将EEG解码为图像；再用多模态大语言模型提取结构化3D感知描述；随后以该描述引导扩散模型生成图像；最后通过单图像到3D模型转换为一致的3D网格。整个流程避免直接EEG-to-3D映射。 Result: 实验显示该方法达到85.4%的10类Top-1 EEG解码准确率和0.648 CLIPScore，验证了多模态EEG驱动3D重建的可行性。 Conclusion: Brain3D为脑机接口与神经科学提供了可扩展、几何感知的3D重建新范式，推动了从神经信号到具身化三维表征的理解与应用。 Abstract: Decoding visual information from electroencephalography (EEG) has recently achieved promising results, primarily focusing on reconstructing two-dimensional (2D) images from brain activity. However, the reconstruction of three-dimensional (3D) representations remains largely unexplored. This limits the geometric understanding and reduces the applicability of neural decoding in different contexts. To address this gap, we propose Brain3D, a multimodal architecture for EEG-to-3D reconstruction based on EEG-to-image decoding. It progressively transforms neural representations into the 3D domain using geometry-aware generative reasoning. Our pipeline first produces visually grounded images from EEG signals, then employs a multimodal large language model to extract structured 3D-aware descriptions, which guide a diffusion-based generation stage whose outputs are finally converted into coherent 3D meshes via a single-image-to-3D model. By decomposing the problem into structured stages, the proposed approach avoids direct EEG-to-3D mappings and enables scalable brain-driven 3D generation. We conduct a comprehensive evaluation comparing the reconstructed 3D outputs against the original visual stimuli, assessing both semantic alignment and geometric fidelity. Experimental results demonstrate strong performance of the proposed architecture, achieving up to 85.4% 10-way Top-1 EEG decoding accuracy and 0.648 CLIPScore, supporting the feasibility of multimodal EEG-driven 3D reconstruction.

[131] Can Vision Language Models Judge Action Quality? An Empirical Evaluation

Miguel Monte e Freitas,Rui Henriques,Ricardo Rei,Pedro Henrique Martins

Main category: cs.CV

TL;DR: 本文对视觉语言模型（VLMs）在动作质量评估（AQA）任务上的实际性能进行了系统性评测，发现当前SOTA模型表现仅略优于随机猜测，存在两大系统性偏差，且现有改进策略效果有限，揭示其在细粒度运动质量评估上存在根本性困难。

Details

Motivation: 尽管视觉语言模型（VLMs）在动作质量评估（AQA）中潜力巨大，但其在此领域的实际性能尚未被系统刻画，亟需全面评估以明确能力边界与失败模式。 Method: 对多个前沿VLM（如Gemini 3.1 Pro、Qwen3-VL、InternVL3.5）在不同活动领域（健身、花样滑冰、跳水）、任务设定、输入表征（如骨架信息）及提示策略（接地指令、推理结构、上下文学习等）下进行综合评测，并分析预测分布以识别系统性偏差。 Result: 所有模型在AQA任务上仅略高于随机水平；引入骨架信息、接地提示、推理链或上下文学习等策略仅带来零星提升，无一具有一致有效性；发现两大系统性偏差：无视视觉证据倾向预测‘正确’，以及对语言表述方式敏感；对比式任务重构亦收效甚微。 Conclusion: 当前VLM在细粒度动作质量评估上存在根本性局限，其问题超越已识别的偏差，需针对性解决关键失败模式，方能支撑未来可靠的实际部署。 Abstract: Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

[132] AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

Imane Momayiz,Soufiane Ait Elaouad,Abdeljalil Elmajjodi,Haitame Bouanane

Main category: cs.CV

TL;DR: 本文提出了AtlasOCR，首个开源的摩洛哥阿拉伯语（Darija）专用OCR模型，基于30亿参数视觉语言模型（VLM）Qwen2.5-VL微调而成；通过合成数据（OCRSmith库）与真实数据构建专属数据集，采用QLoRA和Unsloth实现高效微调，并在新基准AtlasOCRBench及KITAB-Bench上达到SOTA性能。

Details

Motivation: 摩洛哥阿拉伯语（Darija）富含视觉内容但缺乏专用OCR工具，现有模型对其支持不足。 Method: 基于Qwen2.5-VL 3B视觉语言模型，使用QLoRA与Unsloth进行参数高效微调；构建Darija专用数据集，融合OCRSmith生成的合成数据与精选真实数据；开展超参数消融实验。 Result: 在新构建的AtlasOCRBench和通用KITAB-Bench上均取得SOTA结果，性能超越更大规模模型，展现出对Darija及标准阿拉伯语OCR任务的强鲁棒性与泛化能力。 Conclusion: AtlasOCR验证了轻量级、高效微调VLM在低资源方言OCR中的可行性与优越性，为阿拉伯语族方言OCR研究提供了新范式与开源基础。 Abstract: Darija, the Moroccan Arabic dialect, is rich in visual content yet lacks specialized Optical Character Recognition (OCR) tools. This paper introduces AtlasOCR, the first open-source Darija OCR model built by fine-tuning a 3B parameter Vision Language Model (VLM). We detail our comprehensive approach, from curating a unique Darija-specific dataset leveraging both synthetic generation with our OCRSmith library and carefully sourced real-world data, to implementing efficient fine-tuning strategies. We utilize QLoRA and Unsloth for parameter-efficient training of Qwen2.5-VL 3B and present comprehensive ablation studies optimizing key hyperparameters. Our evaluation on the newly curated AtlasOCRBench and the established KITAB-Bench demonstrates state-of-the-art performance, challenging larger models and highlighting AtlasOCR's robustness and generalization capabilities for both Darija and standard Arabic OCR tasks.

[133] Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

Marcel Gröpl,Jaewoo Jung,Seungryong Kim,Marc Pollefeys,Sunghwan Hong

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、基于模型内在不确定性的视觉定位方法，通过熵梯度生成相关性图，并结合多区域提取与迭代缩放重定位策略，显著提升了细粒度视觉细节理解和多线索组合查询能力。

Details

Motivation: 预训练的视觉语言模型在依赖微小视觉细节或多区域线索组合（如文档理解、组合式查询）的任务上仍表现不佳。 Method: 提出一种训练无关、模型内在的定位方法：利用模型下一词分布的熵作为监督信号，反向传播至视觉token嵌入以生成熵梯度相关性图；提取并排序多个连贯区域以支持多证据查询；引入带空间熵停止准则的迭代缩放-重定位流程。 Result: 在七个基准数据集、四种VLM架构上的实验表明，该方法持续优于现有方法，尤其在细节关键和高分辨率场景下提升最大，同时生成更具可解释性的证据定位结果。 Conclusion: 将定位建模为测试时证据检索，并以不确定性为内在监督信号，是一种有效且通用的提升VLM细粒度视觉推理能力的新范式。 Abstract: Despite rapid progress, pretrained vision-language models still struggle when answers depend on tiny visual details or on combining clues spread across multiple regions, as in documents and compositional queries. We address this by framing grounding as test-time evidence retrieval: given a query, the model should actively identify where to look next to resolve ambiguity. To this end, we propose a training-free, model-intrinsic grounding method that uses uncertainty as supervision. Specifically, we compute the entropy of the model's next-token distribution and backpropagate it to the visual token embeddings to obtain an entropy-gradient relevance map, without auxiliary detectors or attention-map heuristics. We then extract and rank multiple coherent regions to support multi-evidence queries, and introduce an iterative zoom-and-reground procedure with a spatial-entropy stopping rule to avoid over-refinement. Experiments on seven benchmarks across four VLM architectures demonstrate consistent improvements over existing methods, with the largest gains on detail-critical and high-resolution settings, while also producing more interpretable evidence localizations.

[134] Tensor-Augmented Convolutional Neural Networks: Enhancing Expressivity with Generic Tensor Kernels

Chia-Wei Hsing,Wei-Lin Tu

Main category: cs.CV

TL;DR: 本文提出了一种物理引导的浅层张量增强卷积神经网络（TACNN），用通用张量替代传统卷积核，以提升表征能力并捕捉高阶特征相关性，在Fashion-MNIST上仅用两层即达到媲美VGG-16和GoogLeNet的精度。

Details

Motivation: 传统CNN依赖深层结构来捕获复杂相关性，导致计算开销大、可解释性差；需一种兼具高表达力与结构简洁性的新模型。 Method: 提出张量增强CNN（TACNN），将卷积核泛化为高阶张量，使每层输出成为能建模高阶特征相关性的多线性形式，并利用张量在希尔伯特空间中表征量子叠加态的物理特性增强表达能力。 Result: 在Fashion-MNIST数据集上，仅含两层卷积的TACNN达到93.7%测试准确率，优于或持平于VGG-16（93.5%）和GoogLeNet（93.7%）。 Conclusion: TACNN通过物理启发的张量设计，在保持浅层架构的同时显著提升模型表达力，为构建更高效、可解释的深度学习模型提供了新路径。 Abstract: Convolutional Neural Networks (CNNs) excel at extracting local features hierarchically, but their performance in capturing complex correlations hinges heavily on deep architectures, which are usually computationally demanding and difficult to interpret. To address these issues, we propose a physically-guided shallow model: tensor-augmented CNN (TACNN), which replaces conventional convolution kernels with generic tensors to enhance representational capacity. This choice is motivated by the fact that an order-$N$ tensor naturally encodes an arbitrary quantum superposition state in the Hilbert space of dimension $d^N$, where $d$ is the local physical dimension, thus offering substantially richer expressivity. Furthermore, in our design the convolution output of each layer becomes a multilinear form capable of capturing high-order feature correlations, thereby equipping a shallow multilayer architecture with an expressive power competitive to that of deep CNNs. On the Fashion-MNIST benchmark, TACNN demonstrates clear advantages over conventional CNNs, achieving remarkable accuracies with only a few layers. In particular, a TACNN with only two convolution layers attains a test accuracy of 93.7$\%$, surpassing or matching considerably deeper models such as VGG-16 (93.5$\%$) and GoogLeNet (93.7$\%$). These findings highlight TACNN as a promising framework that strengthens model expressivity while preserving architectural simplicity, paving the way towards more interpretable and efficient deep learning models.

[135] What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Mohamed Amine Kerkouri,Marouane Tliba,Bin Wang,Aladine Chetouani,Ulas Bagci,Alessandro Bruno

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型（VLM）的语义扫视路径相似性框架，通过将注视点编码为文本描述并计算语义相似性，补充了传统仅依赖空间/时间对齐的扫视路径分析方法。

Details

Motivation: 现有扫视路径相似性度量主要关注空间和时间对齐，忽视了被注意图像区域之间的语义等价性。 Method: 利用视觉语言模型对每个注视点在受控视觉上下文（patch-based 和 marker-based）下进行编码，生成简洁文本描述，并聚合为扫视路径级表征；随后采用嵌入式和词法NLP指标计算语义相似性，并与MultiMatch、DTW等经典空间度量对比。 Result: 实验表明语义相似性能捕捉与几何对齐部分独立的变异，揭示出空间差异大但内容一致的案例；上下文编码方式影响描述保真度与指标稳定性。 Conclusion: 多模态基础模型可支持可解释、内容感知的经典扫视路径分析扩展，为ETRA社区的眼动研究提供互补维度。 Abstract: Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

[136] DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather

Christof Leitgeb,Thomas Puchleitner,Max Peter Ronecker,Daniel Watzenig

Main category: cs.CV

TL;DR: 本文提出DinoRADE，一种以雷达为中心的检测框架，通过可变形交叉注意力融合FMCW雷达张量与DINOv3视觉特征，在K-Radar数据集上显著提升恶劣天气下（尤其是对弱势道路使用者）的检测性能，较现有雷达-相机方法提升12.1%。

Details

Motivation: 现有FMCW雷达在恶劣天气下虽检测性能好，但难以分辨细粒度空间细节，尤其对小型和弱势道路使用者（VRU）检测不足；且缺乏在恶劣天气数据集（如K-Radar）上针对VRU的系统性研究。 Method: 提出DinoRADE检测流程：处理密集雷达张量，并利用可变形交叉注意力机制，在相机视角下将DINOv3视觉基础模型提取的视觉特征聚合到变换后的参考点周围。 Result: 在K-Radar全天气条件下完成全面性能评估，首次单独报告五类目标的检测结果；相较现有单类检测方法，优于最新雷达-相机方法12.1%。 Conclusion: DinoRADE有效提升了雷达主导的多模态感知在恶劣天气中对VRU等小目标的检测能力，验证了融合先进视觉基础模型与雷达张量建模的可行性与优势。 Abstract: Reliable and weather-robust perception systems are essential for safe autonomous driving and typically employ multi-modal sensor configurations to achieve comprehensive environmental awareness. While recent automotive FMCW Radar-based approaches achieved remarkable performance on detection tasks in adverse weather conditions, they exhibited limitations in resolving fine-grained spatial details particularly critical for detecting smaller and vulnerable road users (VRUs). Furthermore, existing research has not adequately addressed VRU detection in adverse weather datasets such as K-Radar. We present DinoRADE, a Radar-centered detection pipeline that processes dense Radar tensors and aggregates vision features around transformed reference points in the camera perspective via deformable cross-attention. Vision features are provided by a DINOv3 Vision Foundation Model. We present a comprehensive performance evaluation on the K-Radar dataset in all weather conditions and are among the first to report detection performance individually for five object classes. Additionally, we compare our method with existing single-class detection approaches and outperform recent Radar-camera approaches by 12.1%. The code is available under https://github.com/chr-is-tof/RADE-Net.

[137] OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

Wenbo Hu,Xin Chen,Yan Gao-Tian,Yihe Deng,Nanyun Peng,Kai-Wei Chang

Main category: cs.CV

TL;DR: 本文提出G²RPO，一种新的强化学习训练目标，通过非线性分布匹配使优势分布收敛至标准正态分布，以解决多模态大模型中奖励拓扑差异大与感知-推理平衡难的问题；并结合响应长度塑造和熵塑造机制，构建出高性能开源多模态模型OpenVLThinkerV2。

Details

Motivation: 现有Group Relative Policy Optimization（GRPO）在开源多模态通用模型中应用受限，主要因不同视觉任务间奖励拓扑差异极大，且难以兼顾细粒度感知与多步推理能力。 Method: 提出Gaussian GRPO（G²RPO），用非线性分布匹配替代线性缩放，强制各任务优势分布收敛至标准正态分布；引入响应长度塑造（动态调控推理链长度）和熵塑造（约束探索范围）两种任务级塑形机制。 Result: 构建了OpenVLThinkerV2模型，在18个多样化基准测试中全面超越主流开源及前沿闭源模型，验证了方法在稳定性与泛化性上的显著提升。 Conclusion: G²RPO及其配套塑形机制有效缓解了多模态RL训练中的梯度不均衡、异常值敏感与感知-推理失衡问题，为开源多模态通用模型提供了更鲁棒、可扩展的强化学习训练范式。 Abstract: Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G$^2$RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, $\mathcal{N}(0,1)$, G$^2$RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G$^2$RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.

[138] AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

Handong Li,Zikang Liu,Longteng Guo,Tongtian Yue,Yepeng Tang,Xinxin Zhu,Chuanyang Zheng,Ziming Wang,Zhibin Wang,Jun Song,Cheng Yu,Bo Zheng,Jing Liu

Main category: cs.CV

TL;DR: AdaSpark是一种自适应稀疏框架，通过自适应选择视频立方体和关键令牌来降低长视频处理的计算成本，同时保持细粒度感知和长程时序建模能力。

Details

Motivation: 现有Video-LLM处理长视频计算开销大，且效率方法常牺牲细粒度感知或限制长程时序建模。 Method: 提出AdaSpark框架：将视频划分为3D时空立方体，并设计两个协同组件——自适应立方体选择注意力（AdaS-Attn）与自适应令牌选择前馈网络（AdaS-FFN），结合基于熵的Top-p选择机制动态分配算力。 Result: 在小时级视频基准上，FLOPs降低高达57%，性能媲美稠密模型，同时保留细粒度感知与长程依赖建模能力。 Conclusion: AdaSpark有效平衡了计算效率与建模能力，在长视频理解中具有实用价值与推广潜力。 Abstract: Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

[139] AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

Ziwei Zhou,Zeyuan Lai,Rui Wang,Yifan Yang,Zhen Xing,Yuqing Yang,Qi Dai,Lili Qiu,Chong Luo

Main category: cs.CV

TL;DR: 本文提出了AVGen-Bench，一个面向任务的文本到音视频（T2AV）生成评测基准，并设计了多粒度评估框架，揭示了当前T2AV模型在语义可靠性（如文字渲染、语音连贯性、物理推理和音高控制）方面存在显著缺陷。

Details

Motivation: 现有T2AV评测方法碎片化，难以捕捉真实提示所需的细粒度音视频联合正确性。 Method: 构建包含11类高质量真实场景提示的AVGen-Bench基准；提出融合轻量级专用模型与多模态大语言模型（MLLM）的多粒度评估框架，覆盖感知质量到细粒度语义可控性。 Result: 评估发现当前强模型在视听美学上表现良好，但在语义可靠性上存在明显短板，尤其在文字渲染、语音连贯性、物理推理及音乐音高控制方面普遍存在失败。 Conclusion: T2AV生成亟需更精准、任务驱动的评测体系，AVGen-Bench为推动该领域发展提供了新标准与开源资源。 Abstract: Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

[140] DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

Junbo Wang,Liangyu Fu,Yuke Li,Yining Zhu,Ya Jing,Xuecheng Wu,Jiangbin Zheng

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的非自回归视频字幕生成框架DiffVC，通过并行解码提升生成速度、减少累积误差，并利用判别式条件扩散模型提高字幕质量，在多个基准数据集上性能媲美自回归方法且速度更快。

Details

Motivation: 现有自回归视频字幕方法存在生成慢、累积误差大问题；非自回归方法则因多模态交互建模不足导致生成质量差。 Method: 提出基于扩散模型的非自回归框架DiffVC：先编码视频为视觉表征；训练时对真实字幕文本表征加高斯噪声，再以视觉表征为条件、用判别式去噪器重建文本表征；最后输入非自回归语言模型生成字幕；推理时直接从高斯分布采样噪声生成。 Result: 在MSVD、MSR-VTT和VATEX数据集上，DiffVC超越以往非自回归方法，性能媲美自回归方法，CIDEr最高提升9.9，BLEU@4提升2.6，同时生成速度更快。 Conclusion: DiffVC有效兼顾生成质量与效率，验证了扩散模型在非自回归视频字幕任务中的潜力。 Abstract: Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore, the few non-autoregressive counterparts suffer from deficiencies in generation quality due to the lack of sufficient multimodal interaction modeling. Therefore, we propose a non-autoregressive framework based on Diffusion model for Video Captioning (DiffVC) to address these issues. Its parallel decoding can effectively solve the problems of generation speed and cumulative error. At the same time, our proposed discriminative conditional Diffusion Model can generate higher-quality textual descriptions. Specifically, we first encode the video into a visual representation. During training, Gaussian noise is added to the textual representation of the ground-truth caption. Then, a new textual representation is generated via the discriminative denoiser with the visual representation as a conditional constraint. Finally, we input the new textual representation into a non-autoregressive language model to generate captions. During inference, we directly sample noise from the Gaussian distribution for generation. Experiments on MSVD, MSR-VTT, and VATEX show that our method can outperform previous non-autoregressive methods and achieve comparable performance to autoregressive methods, e.g., it achieved a maximum improvement of 9.9 on the CIDEr and improvement of 2.6 on the B@4, while having faster generation speed. The source code will be available soon.

[141] Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Haolei Xu,Haiwen Hong,Hongxing Li,Rui Zhou,Yang Zhang,Longtao Huang,Hui Xue,Yongliang Shen,Weiming Lu,Yueting Zhuang

Main category: cs.CV

TL;DR: 本文发现多模态混合专家（MoE）模型存在‘看得见、想不了’现象：能准确感知图像内容，却在视觉推理上表现差于纯文本推理；提出‘路由干扰’假说，指出视觉输入导致路由机制未能充分激活任务相关推理专家，并设计路由引导干预方法提升性能。

Details

Motivation: 解释为何多模态MoE模型在视觉-语言任务中虽能准确感知图像，却在视觉推理上显著弱于纯文本推理，即‘Seeing but Not Thinking’现象。 Method: 通过系统性分析验证跨模态语义共享的存在；揭示视觉专家与领域专家的层间分离及中间层路由发散；提出Routing Distraction假说；设计路由引导干预方法以增强领域专家激活。 Result: 在三个多模态MoE模型和六个基准上的实验表明，该方法在复杂视觉推理任务上最高提升3.17%；领域专家识别可定位认知功能而非样本特异性解，支持跨任务迁移。 Conclusion: ‘Seeing but Not Thinking’源于视觉输入引发的路由干扰，而非语义对齐失败；通过调控路由机制可有效提升多模态MoE的推理能力，且领域专家具有可迁移的认知功能定位作用。 Abstract: Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

[142] Coordinate-Based Dual-Constrained Autoregressive Motion Generation

Kang Ding,Hongsong Wang,Jie Gui,Liang Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为CDAMD的坐标驱动双约束自回归运动生成框架，结合自回归与扩散模型优势，解决误差放大和模式崩溃问题，在文本到运动生成与编辑任务中达到SOTA性能。

Details

Motivation: 扩散模型存在噪声预测中的误差放大问题，而自回归模型因运动离散化导致模式坍塌，现有坐标系下的运动合成研究也十分有限。 Method: 提出Coordinate-based Dual-constrained Autoregressive Motion Generation（CDAMD）框架：以运动坐标为输入，采用自回归范式；引入扩散启发的多层感知机提升运动保真度；设计Dual-Constrained Causal Mask，将运动token作为先验并与文本编码拼接以引导生成。 Result: 在新构建的文本到运动生成与运动编辑基准上，CDAMD在运动保真度和语义一致性两方面均达到当前最优性能（SOTA）。 Conclusion: CDAMD通过融合自回归结构与扩散思想，并引入双重约束机制，实现了高保真、语义忠实的文本驱动运动生成，同时推动了坐标级运动建模的发展。 Abstract: Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human-computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinate-based Dual-constrained Autoregressive Motion Generation (CDAMD). With motion coordinates as input, CDAMD follows the autoregressive paradigm and leverages diffusion-inspired multi-layer perceptrons to enhance the fidelity of predicted motions. Furthermore, a Dual-Constrained Causal Mask is introduced to guide autoregressive generation, where motion tokens act as priors and are concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish new benchmarks for both text-to-motion generation and motion editing. Experimental results demonstrate that our approach achieves state-of-the-art performance in terms of both fidelity and semantic consistency on these benchmarks.

[143] EPIR: An Efficient Patch Tokenization, Integration and Representation Framework for Micro-expression Recognition

Junbo Wang,Liangyu Fu,Yuke Li,Yining Zhu,Xuecheng Wu,Kun Hu

Main category: cs.CV

TL;DR: 本文提出了一种高效微表情识别框架EPIR，通过双范数偏移分块、令牌集成和判别性令牌提取，在降低计算复杂度的同时提升识别性能。

Details

Motivation: 现有基于Transformer的微表情识别方法计算复杂度高，且受限于小规模数据集难以学习有效表征。 Method: 提出EPIR框架，包括：1）双范数偏移分块（DNSPT）模块建模面部区域像素空间关系；2）令牌集成模块减少令牌数量而不损失信息；3）判别性令牌提取器（含改进注意力机制与动态令牌选择模块DTSM）捕获更具判别性的微表情特征。 Result: 在CASME II、SAMM、SMIC和CAS(ME)3四个数据集上显著优于SOTA方法，如在CAS(ME)3上UF1提升9.6%，在SMIC上UAR提升4.58%。 Conclusion: EPIR框架在保证高性能的同时显著降低了计算开销，为资源受限场景下的微表情识别提供了新思路。 Abstract: Micro-expression recognition can obtain the real emotion of the individual at the current moment. Although deep learning-based methods, especially Transformer-based methods, have achieved impressive results, these methods have high computational complexity due to the large number of tokens in the multi-head self-attention. In addition, the existing micro-expression datasets are small-scale, which makes it difficult for Transformer-based models to learn effective micro-expression representations. Therefore, we propose a novel Efficient Patch tokenization, Integration and Representation framework (EPIR), which can balance high recognition performance and low computational complexity. Specifically, we first propose a dual norm shifted tokenization (DNSPT) module to learn the spatial relationship between neighboring pixels in the face region, which is implemented by a refined spatial transformation and dual norm projection. Then, we propose a token integration module to integrate partial tokens among multiple cascaded Transformer blocks, thereby reducing the number of tokens without information loss. Furthermore, we design a discriminative token extractor, which first improves the attention in the Transformer block to reduce the unnecessary focus of the attention calculation on self-tokens, and uses the dynamic token selection module (DTSM) to select key tokens, thereby capturing more discriminative micro-expression representations. We conduct extensive experiments on four popular public datasets (i.e., CASME II, SAMM, SMIC, and CAS(ME)3. The experimental results show that our method achieves significant performance gains over the state-of-the-art methods, such as 9.6% improvement on the CAS(ME)$^3$ dataset in terms of UF1 and 4.58% improvement on the SMIC dataset in terms of UAR metric.

[144] OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

Seungjae Moon,Seunghyun Oh,Youngmin Ro

Main category: cs.CV

TL;DR: 本文提出OV-Stitcher，一种无需训练的开放词汇语义分割框架，通过在编码器最后一层直接拼接子图特征，实现全局注意力，提升上下文聚合与分割一致性。

Details

Motivation: 现有无训练开放词汇语义分割方法受限于预训练编码器输入分辨率，依赖滑动窗口策略，导致缺乏全局注意力、特征碎片化和上下文推理能力弱。 Method: OV-Stitcher在预训练视觉语言模型的最后一层编码器中，对滑动窗口提取的子图特征进行特征级拼接与注意力重建，从而在不增加训练的前提下恢复全局感受野。 Result: 在八个基准上验证，mIoU从48.7提升至50.7，显著优于先前无训练基线，展现出更强的空间一致性和语义对齐能力。 Conclusion: OV-Stitcher证明了在不微调的前提下，通过结构化特征融合可有效增强大模型在高分辨率密集预测任务中的全局建模能力，为TF-OVSS提供了可扩展新范式。 Abstract: Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.

[145] Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Luozheng Qin,Jia Gong,Qian Qiao,Tianjiao Li,Li Xu,Haoyu Pan,Chao Qu,Zhiyu Tan,Hao Li

Main category: cs.CV

TL;DR: 本文提出Uni-ViGU框架，以视频生成模型为基座，通过统一连续/离散流匹配、模态驱动的MoE结构及双向训练机制，实现视频生成与理解的统一，在两类任务上均取得竞争力性能。

Details

Motivation: 传统多模态大模型以理解为中心扩展生成能力，但视频生成计算开销远高于理解，造成效率失衡；因此作者提出反向范式——以生成模型为基座来统一生成与理解。 Method: 1）提出统一流方法：对视频做连续流匹配、对文本做离散流匹配；2）设计模态驱动的MoE架构，在Transformer中插入轻量文本生成层，保留视频生成先验；3）构建双向训练机制：知识回溯（重建输入提示）与能力精炼（细粒度字幕微调）。 Result: Uni-ViGU在视频生成与理解任务上均达到具有竞争力的性能，验证了生成中心架构在统一多模态智能中的可扩展性。 Conclusion: 以生成为基座的统一架构是构建高效、可扩展的多模态模型的新路径，打破了传统以理解为中心的范式限制。 Abstract: Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

[146] PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

Zhi-Yi Lin,Thomas Markhorst,Jouh Yeong Chew,Xucong Zhang

Main category: cs.CV

TL;DR: 本文提出PolySLGen框架，用于生成多参与者（polyadic）场景下目标参与者的多模态（语音、身体动作、说话状态）反应，通过姿态融合模块和社会线索编码器建模群体互动，显著提升反应的上下文恰当性、时序连贯性与真实感。

Details

Motivation: 现有方法局限于单模态或仅说话的双人交互，忽视非语言线索和多人交互的复杂动态，难以适用于真实社交场景。 Method: 提出PolySLGen在线框架，包含姿态融合模块和社会线索编码器，联合聚合群体的动作与社会信号，以生成目标参与者的语音、身体动作及说话状态得分。 Result: 实验表明PolySLGen在动作质量、动作-语音对齐、说话状态预测及人类感知真实性方面均优于多个适配基线和SOTA方法。 Conclusion: PolySLGen有效建模多人多模态交互，为具身AI在自然群体互动中生成类人反应提供了新范式。 Abstract: Human-like multimodal reaction generation is essential for natural group interactions between humans and embodied AI. However, existing approaches are limited to single-modality or speaking-only responses in dyadic interactions, making them unsuitable for realistic social scenarios. Many also overlook nonverbal cues and complex dynamics of polyadic interactions, both critical for engagement and conversational coherence. In this work, we present PolySLGen, an online framework for Polyadic multimodal Speaking and Listening reaction Generation. Given past conversation and motion from all participants, PolySLGen generates a future speaking or listening reaction for a target participant, including speech, body motion, and speaking state score. To model group interactions effectively, we propose a pose fusion module and a social cue encoder that jointly aggregate motion and social signals from the group. Extensive experiments, along with quantitative and qualitative evaluations, show that PolySLGen produces contextually appropriate and temporally coherent multi-modal reactions, outperforming several adapted and state-of-the-art baselines in motion quality, motion-speech alignment, speaking state prediction, and human-perceived realism.

[147] Bag of Bags: Adaptive Visual Vocabularies for Genizah Join Image Retrieval

Sharva Gogawale,Gal Grudka,Daria Vasyutinsky-Shapira,Omer Ventura,Berat Kurar-Barakat,Nachum Dershowitz

Main category: cs.CV

TL;DR: 本文提出了一种名为Bag of Bags（BoB）的图像级表示方法，用于手稿碎片归属检索任务，通过局部视觉词典和集合间距离度量，在开罗藏经阁数据集上超越了传统Bag of Words基线方法。

Details

Motivation: 解决手稿碎片归属问题：给定一个碎片图像，检索出同源手稿的其他碎片，这对古籍修复与历史研究至关重要，但现有方法（如BoW）在细粒度匹配上存在局限。 Method: 提出Bag of Bags（BoB）表示法：用稀疏卷积自编码器学习二值化碎片块的局部特征；对每页连通组件编码并做单图k均值聚类生成局部词典；使用集合间距离（如Chamfer距离、最优传输OT）比较图像；引入质量加权BoB-OT变体，并给出其近似误差理论保证；结合BoW初筛与BoB-OT重排序的两阶段流程。 Result: 在开罗藏经阁数据集上，最佳BoB变体（Chamfer）达到Hit@1=0.78、MRR=0.84，相较最强BoW基线（BoW-RawPatches-χ²）提升6.1%相对准确率；BoB-OT具备理论近似保证；两阶段流程兼顾性能与效率。 Conclusion: BoB通过建模碎片内局部结构多样性显著提升归属检索精度，为大规模古籍数字化提供了可扩展、有理论支撑的新范式。 Abstract: A join is a set of manuscript fragments identified as originally emanating from the same manuscript. We study manuscript join retrieval: Given a query image of a fragment, retrieve other fragments originating from the same physical manuscript. We propose Bag of Bags (BoB), an image-level representation that replaces the global-level visual codebook of classical Bag of Words (BoW) with a fragment-specific vocabulary of local visual words. Our pipeline trains a sparse convolutional autoencoder on binarized fragment patches, encodes connected components from each page, clusters the resulting embeddings with per image $k$-means, and compares images using set to set distances between their local vocabularies. Evaluated on fragments from the Cairo Genizah, the best BoB variant (viz.\@ Chamfer) achieves Hit@1 of 0.78 and MRR of 0.84, compared to 0.74 and 0.80, respectively, for the strongest BoW baseline (BoW-RawPatches-$χ^2$), a 6.1\% relative improvement in top-1 accuracy. We furthermore study a mass-weighted BoB-OT variant that incorporates cluster population into prototype matching and present a formal approximation guarantee bounding its deviation from full component-level optimal transport. A two-stage pipeline using a BoW shortlist followed by BoB-OT reranking provides a practical compromise between retrieval strength and computational cost, supporting applicability to larger manuscript collections.

[148] Face-D(^2)CL: Multi-Domain Synergistic Representation with Dual Continual Learning for Facial DeepFake Detection

Yushuo Zhang,Yu Cheng,Yongkang Hu,Jiuan Zhou,Jiawei Chen,Yuan Xie,Zhaoxia Yin

Main category: cs.CV

TL;DR: 本文提出Face-D²CL框架，通过多域协同表征融合空间与频域特征，并结合EWC与OGC双持续学习机制，在不依赖历史数据回放的情况下，有效缓解特征表示不足和灾难性遗忘问题，显著提升DeepFake检测的稳定性和可塑性。

Details

Motivation: 面部伪造技术快速发展威胁公众信任与信息安全，而现有持续学习方法在真实场景中面临特征表示不足和灾难性遗忘两大瓶颈。 Method: 提出Face-D²CL框架：1）多域协同表征融合空间与频域特征；2）双持续学习机制，结合Elastic Weight Consolidation（区分真假样本参数重要性）和Orthogonal Gradient Constraint（约束任务适配器更新不干扰旧知识）。 Result: 相比当前SOTA方法，平均检测错误率相对降低60.7%，在未见伪造域上平均检测AUC提升7.9%。 Conclusion: Face-D²CL在不使用历史数据回放的前提下，实现了抗遗忘能力与适应新伪造范式能力的动态平衡，显著提升了持续学习下的DeepFake检测性能。 Abstract: The rapid advancement of facial forgery techniques poses severe threats to public trust and information security, making facial DeepFake detection a critical research priority. Continual learning provides an effective approach to adapt facial DeepFake detection models to evolving forgery patterns. However, existing methods face two key bottlenecks in real-world continual learning scenarios: insufficient feature representation and catastrophic forgetting. To address these issues, we propose Face-D(^2)CL, a framework for facial DeepFake detection. It leverages multi-domain synergistic representation to fuse spatial and frequency-domain features for the comprehensive capture of diverse forgery traces, and employs a dual continual learning mechanism that combines Elastic Weight Consolidation (EWC), which distinguishes parameter importance for real versus fake samples, and Orthogonal Gradient Constraint (OGC), which ensures updates to task-specific adapters do not interfere with previously learned knowledge. This synergy enables the model to achieve a dynamic balance between robust anti-forgetting capabilities and agile adaptability to emerging facial forgery paradigms, all without relying on historical data replay. Extensive experiments demonstrate that our method surpasses current SOTA approaches in both stability and plasticity, achieving 60.7% relative reduction in average detection error rate, respectively. On unseen forgery domains, it further improves the average detection AUC by 7.9% compared to the current SOTA method.

[149] T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

Pranjal Khadka

Main category: cs.CV

TL;DR: 本文提出了一种时序适配器，通过在视觉语言模型（VLM）中引入相邻切片上下文信息，显著提升3D医学图像分割性能，尤其在少样本、零样本及跨模态场景下表现优异。

Details

Motivation: 传统3D全监督分割依赖大量昂贵的体素级标注；现有VLM直接应用于2D切片导致解剖连续性差、分割噪声大。 Method: 设计一种时序适配器，包含：1）在token级对固定窗口内相邻切片建模的时序Transformer；2）优化单切片表征的空间上下文模块；3）自适应门控融合时序与单切片特征。 Result: 在FLARE22上（30例标注数据）达平均Dice 0.704（+0.206 vs 基线）；零样本迁移至BTCV/AMOS22分别提升+0.210/+0.230；跨模态（CT预训练→AMOS22 MRI）达Dice 0.366，超越仅用CT训练的DynUNet（0.224）。 Conclusion: 注入切片间时序上下文可有效增强VLM在3D医学图像分割中的解剖一致性与泛化能力，尤其在标注稀缺、跨域和跨模态场景中展现出显著优势。 Abstract: Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.

[150] OceanMAE: A Foundation Model for Ocean Remote Sensing

Viola-Joanna Stamer,Panagiotis Agrafiotis,Behnood Rasti,Begüm Demir

Main category: cs.CV

TL;DR: 本文提出OceanMAE，一种面向海洋遥感的掩码自编码器，通过融合多光谱Sentinel-2影像与物理意义明确的海洋描述符进行自监督预训练，提升了海洋下游任务（如海洋污染物分割与水深估计）的性能。

Details

Motivation: 海洋遥感受限于标注数据稀缺及通用遥感预训练模型（主要基于陆地影像）在海洋场景中迁移能力弱的问题。 Method: 提出OceanMAE模型，扩展标准MAE框架，在自监督预训练中联合多光谱Sentinel-2影像与物理海洋描述符；下游采用改进的UNet架构用于海洋分割与水深估计。 Result: 在MADOS、MARIDA和MagicBathyNet数据集上的实验表明，OceanMAE在海洋分割任务上性能最优，水深估计效果具竞争力且任务相关；消融实验证明引入海洋描述符可提升分割精度。 Conclusion: 面向海洋的、融入物理先验的自监督预训练能显著提升海洋遥感任务性能，凸显领域对齐预训练的重要性。 Abstract: Accurate ocean mapping is essential for applications such as bathymetry estimation, seabed characterization, marine litter detection, and ecosystem monitoring. However, ocean remote sensing (RS) remains constrained by limited labeled data and by the reduced transferability of models pre-trained mainly on land-dominated Earth observation imagery. In this paper, we propose OceanMAE, an ocean-specific masked autoencoder that extends standard MAE pre-training by integrating multispectral Sentinel-2 observations with physically meaningful ocean descriptors during self-supervised learning. By incorporating these auxiliary ocean features, OceanMAE is designed to learn more informative and ocean-aware latent representations from large- scale unlabeled data. To transfer these representations to downstream applications, we further employ a modified UNet-based framework for marine segmentation and bathymetry estimation. Pre-trained on the Hydro dataset, OceanMAE is evaluated on MADOS and MARIDA for marine pollutant and debris segmentation, and on MagicBathyNet for bathymetry regression. The experiments show that OceanMAE yields the strongest gains on marine segmentation, while bathymetry benefits are competitive and task-dependent. In addition, an ablation against a standard MAE on MARIDA indicates that incorporating auxiliary ocean descriptors during pre-training improves downstream segmentation quality. These findings highlight the value of physically informed and domain-aligned self-supervised pre- training for ocean RS. Code and weights are publicly available at https://git.tu-berlin.de/joanna.stamer/SSLORS2.

[151] On the Global Photometric Alignment for Low-Level Vision

Mingjia Li,Tianle Du,Hainuo Wang,Qiming Hu,Xiaojie Guo

Main category: cs.CV

TL;DR: 本文提出Photometric Alignment Loss (PAL)，通过闭式仿射颜色对齐来减少光度不一致带来的干扰，从而提升低层视觉任务的性能和泛化能力。

Details

Motivation: 监督式低层视觉模型依赖成对参考图像的像素级损失，但成对训练集存在每对之间的光度不一致（如亮度、色彩、白平衡差异），这种不一致源于任务固有的光度变换或采集偏差，导致优化困难。 Method: 通过最小二乘分解证明预测与目标残差中的光度分量与结构分量正交，且光度分量主导梯度能量；据此提出PAL损失，在保持内容恢复监督的同时，用协方差统计与轻量矩阵求逆实现光度对齐。 Result: 在6个任务、16个数据集、16种网络架构上，PAL持续提升定量指标与泛化性能。 Conclusion: PAL是一种高效、通用、开销极小的监督损失改进方法，有效缓解光度不一致导致的优化病理问题。 Abstract: Supervised low-level vision models rely on pixel-wise losses against paired references, yet paired training sets exhibit per-pair photometric inconsistency, say, different image pairs demand different global brightness, color, or white-balance mappings. This inconsistency enters through task-intrinsic photometric transfer (e.g., low-light enhancement) or unintended acquisition shifts (e.g., de-raining), and in either case causes an optimization pathology. Standard reconstruction losses allocate disproportionate gradient budget to conflicting per-pair photometric targets, crowding out content restoration. In this paper, we investigate this issue and prove that, under least-squares decomposition, the photometric and structural components of the prediction-target residual are orthogonal, and that the spatially dense photometric component dominates the gradient energy. Motivated by this analysis, we propose Photometric Alignment Loss (PAL). This flexible supervision objective discounts nuisance photometric discrepancy via closed-form affine color alignment while preserving restoration-relevant supervision, requiring only covariance statistics and tiny matrix inversion with negligible overhead. Across 6 tasks, 16 datasets, and 16 architectures, PAL consistently improves metrics and generalization. The implementation is in the appendix.

[152] MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning

Zheng Jiang,Heng Guo,Chengyu Fang,Changchen Xiao,Xinyang Hu,Lifeng Sun,Minfeng Xu

Main category: cs.CV

TL;DR: 本文提出MedVR，一种无需人工标注的强化学习框架，用于提升医学视觉语言模型（VLMs）的视觉推理能力，通过熵引导的视觉重定位（EVR）和基于共识的信用分配（CCA）机制，在多个医学VQA基准上达到SOTA性能。

Details

Motivation: 现有医学VLMs受限于纯文本推理范式，难以有效结合视觉证据，导致细粒度视觉分析能力不足及视觉幻觉风险，影响临床安全性与可靠性。 Method: 提出MedVR强化学习框架，包含两个核心机制：熵引导的视觉重定位（EVR）利用模型不确定性指导探索；基于共识的信用分配（CCA）从多轮推理结果的一致性中提取伪监督信号，全程无需中间步骤的人工标注。 Result: 在多个公开医学视觉问答（VQA）基准上实现SOTA性能，显著超越现有方法，且提升模型鲁棒性与可解释性。 Conclusion: MedVR通过直接基于视觉证据进行推理，为医学AI的临床落地提供了更可靠、透明的视觉语言理解路径。 Abstract: Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.

Yiduo Jia,Muzhi Zhu,Hao Zhong,Mingyu Liu,Yuling Xi,Hao Chen,Bin Qin,Yongjie Yang,Zhenbo Luo,Chunhua Shen

Main category: cs.CV

TL;DR: 本文提出OmniJigsaw，一种基于时间重排序代理任务的通用自监督框架，用于增强多模态（视频-音频）理解与协同推理；通过联合模态融合、样本级模态选择和片段级模态掩码三种策略促进跨模态整合，并设计两级数据过滤机制以适配海量无标注数据；实验证明其在15个基准上显著提升性能，并揭示并缓解了‘双模态捷径现象’。

Details

Motivation: 将强化学习后训练范式扩展到全模态模型，以同时提升视频-音频理解与协同推理能力，且需适配海量无标注数据。 Method: 提出OmniJigsaw框架，核心是基于时间顺序重建打乱的音视频片段的代理任务；采用三种跨模态整合策略：联合模态整合、样本级模态选择、片段级模态掩码；并设计粗粒度到细粒度的两阶段数据过滤流程以保障拼图质量。 Result: 在15个基准测试中，视频、音频及协同推理任务均取得显著性能提升；发现并验证了‘双模态捷径现象’，证明片段级模态掩码比样本级选择更有效。 Conclusion: OmniJigsaw是一种可扩展的全模态自监督学习范式，能有效促进音视频跨模态整合，尤其通过细粒度掩码策略缓解模态捷径问题。 Abstract: To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

[154] SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

You Hu,Chenzhuo Zhao,Changfa Mo,Haotian Liu,Xiaobai Li

Main category: cs.CV

TL;DR: 本文提出了首个用于检测AI生成科学图表的基准测试，揭示了现有AI生成图像检测方法在该领域表现不佳，存在零样本迁移失败、生成器特异性过拟合及对常见后处理失真鲁棒性差等问题。

Details

Motivation: 现代多模态生成模型能生成接近出版质量的科学图表，但现有AI图像检测方法主要面向自然图像，缺乏针对结构化、文本密集、语义严谨的科学图表的检测能力，亟需专门基准推动研究。 Method: 构建了一个基于智能体的数据流水线：检索授权论文→多模态理解图文内容→生成结构化提示→合成候选图表→通过评审驱动的精炼循环筛选，最终形成覆盖多类别、多生成源、真实-合成配对的基准数据集，并在零样本、跨生成器和退化图像设置下评估主流检测器。 Result: 当前检测方法在零样本迁移上表现极差，严重依赖特定生成器，且对压缩、噪声等常见退化操作极为脆弱，暴露出与实际科学图表分布间存在显著能力鸿沟。 Conclusion: 现有AI生成图像检测技术难以应对高质量科学图表带来的新挑战，本工作提供的基准有望成为推动鲁棒、泛化性强的科学图表取证研究的基础。 Abstract: Modern multimodal generators can now produce scientific figures at near-publishable quality, creating a new challenge for visual forensics and research integrity. Unlike conventional AI-generated natural images, scientific figures are structured, text-dense, and tightly aligned with scholarly semantics, making them a distinct and difficult detection target. However, existing AI-generated image detection benchmarks and methods are almost entirely developed for open-domain imagery, leaving this setting largely unexplored. We present the first benchmark for AI-generated scientific figure detection. To construct it, we develop an agent-based data pipeline that retrieves licensed source papers, performs multimodal understanding of paper text and figures, builds structured prompts, synthesizes candidate figures, and filters them through a review-driven refinement loop. The resulting benchmark covers multiple figure categories, multiple generation sources and aligned real--synthetic pairs. We benchmark representative detectors under zero-shot, cross-generator, and degraded-image settings. Results show that current methods fail dramatically in zero-shot transfer, exhibit strong generator-specific overfitting, and remain fragile under common post-processing corruptions. These findings reveal a substantial gap between existing AIGI detection capabilities and the emerging distribution of high-quality scientific figures. We hope this benchmark can serve as a foundation for future research on robust and generalizable scientific-figure forensics. The dataset is available at https://github.com/Joyce-yoyo/SciFigDetect.

[155] Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment

Blessing Agyei Kyem,Joshua Kofi Asamoah,Anthony Dontoh,Armstrong Aboah

Main category: cs.CV

TL;DR: 本文提出PaveInstruct数据集和PaveGPT模型，通过领域特定指令微调，显著提升视觉语言模型在路面状况评估中的性能，实现符合工程标准的统一评估工具。

Details

Motivation: 通用视觉语言模型在专业工程领域（如路面检测）表现不佳，难以满足精确术语、结构化推理和工程标准要求。 Method: 构建包含278,889个图像-指令-响应对的PaveInstruct数据集（整合9个异构路面数据集），并基于此训练领域基础模型PaveGPT；在感知、理解与推理任务上对比评估其性能。 Result: 指令微调使空间定位、推理和生成任务性能提升超20%，输出符合ASTM D6433标准；支持交通部门用统一对话式工具替代多个专用系统。 Conclusion: 领域指令微调可有效赋能视觉语言模型完成专业基础设施评估任务，该范式可推广至桥梁、铁路、建筑等其他基础设施检测场景。 Abstract: General-purpose vision-language models demonstrate strong performance in everyday domains but struggle with specialized technical fields requiring precise terminology, structured reasoning, and adherence to engineering standards. This work addresses whether domain-specific instruction tuning can enable comprehensive pavement condition assessment through vision-language models. PaveInstruct, a dataset containing 278,889 image-instruction-response pairs spanning 32 task types, was created by unifying annotations from nine heterogeneous pavement datasets. PaveGPT, a pavement foundation model trained on this dataset, was evaluated against state-of-the-art vision-language models across perception, understanding, and reasoning tasks. Instruction tuning transformed model capabilities, achieving improvements exceeding 20% in spatial grounding, reasoning, and generation tasks while producing ASTM D6433-compliant outputs. These results enable transportation agencies to deploy unified conversational assessment tools that replace multiple specialized systems, simplifying workflows and reducing technical expertise requirements. The approach establishes a pathway for developing instruction-driven AI systems across infrastructure domains including bridge inspection, railway maintenance, and building condition assessment.

[156] EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

Xiangyuan Wang,Honghao Cai,Yunhao Bai,Tianze Zhou,Haohua Chen,Yao Hu,Xu Tang,Yibo Chen,Wei Zhu

Main category: cs.CV

TL;DR: 本文提出EditCaption，一种两阶段后训练流程，用于提升视觉语言模型（VLM）在图像编辑指令合成中的准确性，显著降低方向、视角与属性描述错误，提高生成指令的可用性与人类对齐度。

Details

Motivation: 高质量带编辑指令的图像对稀缺，而现有VLM自动生成指令存在方向混淆、视角模糊和属性描述粗略三大系统性缺陷，导致近一半指令不可用于训练。 Method: 提出两阶段EditCaption流程：第一阶段构建10万样本监督微调（SFT）数据集，融合GLM自动标注、EditScore过滤与人工精修；第二阶段收集1万组针对三类错误的人类偏好对，采用直接偏好优化（DPO）进一步对齐人类意图。 Result: 微调后的Qwen3-VL模型在Eval-400、ByteMorph-Bench和HQ-Edit基准上全面超越开源基线；235B模型在Eval-400达4.712（超Gemini-3-Pro），关键错误率从47.75%降至23%，指令正确率从41.75%升至66%。 Conclusion: EditCaption为可扩展、高保真、人类对齐的图像编辑指令合成提供了实用可行的技术路径，有效缓解训练数据瓶颈。 Abstract: High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.

[157] Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges

Saniya M. Deshmukh,Kailash A. Hambarde,Hugo Proença

Main category: cs.CV

TL;DR: 本文是一篇关于跨域目标检测（CDOD）的综述论文，系统梳理了该领域的挑战、方法分类、领域偏移传播机制、数据集与评估标准，并指出了未来研究方向。

Details

Motivation: 目标检测模型在源域训练后迁移到未见目标域时性能显著下降，现有研究分散且缺乏对领域偏移本质挑战和适配策略效果的统一视角。 Method: 提出多阶段问题建模框架，构建基于适配范式、建模假设和检测流程组件的概念化分类体系，并分析领域偏移在检测各阶段的传播机制。 Result: 建立了CDOD的统一分析框架，系统归纳了主流方法、常用数据集与评估协议，并揭示了检测任务比分类任务更难适配的根本原因。 Conclusion: 该综述为理解与推进跨域目标检测提供了结构化基础和未来研究指南，有助于构建更鲁棒的目标检测系统。 Abstract: Object detection models trained on a source domain often exhibit significant performance degradation when deployed in unseen target domains, due to various kinds of variations, such as sensing conditions, environments and data distributions. Hence, regardless the recent breakthrough advances in deep learning-based detection technology, cross-domain object detection (CDOD) remains a critical research area. Moreover, the existing literature remains fragmented, lacking a unified perspective on the structural challenges underlying domain shift and the effectiveness of adaptation strategies. This survey provides a comprehensive and systematic analysis of CDOD. We start upon a problem formulation that highlights the multi-stage nature of object detection under domain shift. Then, we organize the existing methods through a conceptual taxonomy that categorizes approaches based on adaptation paradigms, modeling assumptions, and pipeline components. Furthermore, we analyze how domain shift propagates across detection stages and discuss why adaptation in object detection is inherently more complex than in classification. In addition, we review commonly used datasets, evaluation protocols, and benchmarking practices. Finally, we identify the key challenges and outline promising future research directions. Cohesively, this survey aims to provide a unified framework for understanding CDOD and to guide the development of more robust detection systems.

[158] $\oslash$ Source Models Leak What They Shouldn't $\nrightarrow$: Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization

Arnav Devalapally,Poornima Jain,Kartik Srinivas,Vineeth N. Balasubramanian

Main category: cs.CV

TL;DR: 本文提出了一种针对源域独有类别在无源域数据的域自适应（SFDA）中意外泄露问题的机器遗忘新设定SCADA-UL，并设计了结合对抗样本生成与重标定策略的遗忘方法，在理论与实验上均验证其有效性。

Details

Motivation: 现有无源域域自适应（SFDA）方法会在目标域中意外泄露源域独有类别的知识，带来隐私风险，而传统机器遗忘方法未考虑分布偏移，无法直接适用。 Method: 提出SCADA-UL遗忘设定；设计基于对抗生成‘遗忘类’样本、重标定标签策略和对抗优化的新型遗忘方法；拓展至持续学习和未知遗忘类别两种变体。 Result: 所提方法在SCADA-UL设定下显著优于基线，在基准数据集上达到与重训练相当的遗忘性能。 Conclusion: 本文首次系统定义并解决了SFDA中源独有类别的隐私遗忘问题，为跨域模型安全提供了新范式。 Abstract: The increasing adaptation of vision models across domains, such as satellite imagery and medical scans, has raised an emerging privacy risk: models may inadvertently retain and leak sensitive source-domain specific information in the target domain. This creates a compelling use case for machine unlearning to protect the privacy of sensitive source-domain data. Among adaptation techniques, source-free domain adaptation (SFDA) calls for an urgent need for machine unlearning (MU), where the source data itself is protected, yet the source model exposed during adaptation encodes its influence. Our experiments reveal that existing SFDA methods exhibit strong zero-shot performance on source-exclusive classes in the target domain, indicating they inadvertently leak knowledge of these classes into the target domain, even when they are not represented in the target data. We identify and address this risk by proposing an MU setting called SCADA-UL: Unlearning Source-exclusive ClAsses in Domain Adaptation. Existing MU methods do not address this setting as they are not designed to handle data distribution shifts. We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization. We also extend our study to two variants: a continual version of this problem setting and to one where the specific source classes to be forgotten may be unknown. Alongside theoretical interpretations, our comprehensive empirical results show that our method consistently outperforms baselines in the proposed setting while achieving retraining-level unlearning performance on benchmark datasets. Our code is available at https://github.com/D-Arnav/SCADA

[159] DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection

Jiangbei Yue,Sharib Ali

Main category: cs.CV

TL;DR: 本文提出了一种双分支多模态框架，结合文本-图像分支和视觉分支，以提升内窥镜图像中分布外（OOD）样本检测的性能，显著超越现有方法。

Details

Motivation: 现有OOD检测方法通常仅依赖单一视觉模态或简单的图文匹配，未能充分利用多模态信息，难以应对临床中复杂多变的分布外数据（如未见过的疾病病例）。 Method: 提出一种双分支多模态框架：一个文本-图像分支（计算得分St）和一个纯视觉分支（计算得分Sv），二者互补；训练后融合两个分支得分得到最终OOD得分S，并与阈值比较实现检测。 Result: 在多个公开内窥镜图像数据集上验证，该方法在不同骨干网络下均表现鲁棒，OOD检测性能较当前最优方法最高提升24.84%。 Conclusion: 双分支多模态设计能更充分挖掘图文与视觉特征，显著提升临床DL系统对分布外数据的识别能力与可靠性。 Abstract: The complex and dynamic real-world clinical environment demands reliable deep learning (DL) systems. Out-of-distribution (OOD) detection plays a critical role in enhancing the reliability and generalizability of DL models when encountering data that deviate from the training distribution, such as unseen disease cases. However, existing OOD detection methods typically rely either on a single visual modality or solely on image-text matching, failing to fully leverage multimodal information. To overcome the challenge, we propose a novel dual-branch multimodal framework by introducing a text-image branch and a vision branch. Our framework fully exploits multimodal representations to identify OOD samples through these two complementary branches. After training, we compute scores from the text-image branch ($S_t$) and vision branch ($S_v$), and integrate them to obtain the final OOD score $S$ that is compared with a threshold for OOD detection. Comprehensive experiments on publicly available endoscopic image datasets demonstrate that our proposed framework is robust across diverse backbones and improves state-of-the-art performance in OOD detection by up to 24.84%

[160] Orion-Lite: Distilling LLM Reasoning into Efficient Vision-Only Driving Models

Jing Gu,Niccolò Cavagnero,Gijs Dubbelman

Main category: cs.CV

TL;DR: 本文提出了一种针对自动驾驶的轻量级视觉模型Orion-Lite，通过潜在特征蒸馏与真值轨迹监督，成功将大型视觉-语言-动作（VLA）模型ORION的知识蒸馏到紧凑模型中，在闭环复杂交互场景下超越教师模型，刷新Bench2Drive基准性能。

Details

Motivation: 大型语言模型（LLMs）具备通用世界知识，有望提升自动驾驶系统应对罕见复杂场景的能力，但其庞大参数量难以满足低延迟、高能效部署需求；知识蒸馏可兼顾推理能力与计算效率，但此前工作多限于简单场景和开环评估。 Method: 采用潜在特征蒸馏（latent feature distillation）结合真值轨迹监督（ground-truth trajectory supervision），将VLA教师模型ORION的知识迁移至轻量级纯视觉学生模型Orion-Lite。 Result: Orion-Lite在严苛的Bench2Drive基准上取得80.6的驾驶分数，超越其大型VLA教师模型ORION，创下新SOTA；验证了纯视觉架构在高性能反应式规划中仍有巨大未开发潜力。 Conclusion: 在复杂闭环交互场景下，基于知识蒸馏的轻量纯视觉模型不仅能高效继承大模型推理能力，甚至可实现性能反超，表明无需多模态输入亦能达成顶尖自动驾驶规划性能。 Abstract: Leveraging the general world knowledge of Large Language Models (LLMs) holds significant promise for improving the ability of autonomous driving systems to handle rare and complex scenarios. While integrating LLMs into Vision-Language-Action (VLA) models has yielded state-of-the-art performance, their massive parameter counts pose severe challenges for latency-sensitive and energy-efficient deployment. Distilling LLM knowledge into a compact driving model offers a compelling solution to retain these reasoning capabilities while maintaining a manageable computational footprint. Although previous works have demonstrated the efficacy of distillation, these efforts have primarily focused on relatively simple scenarios and open-loop evaluations. Therefore, in this work, we investigate LLM distillation in more complex, interactive scenarios under closed-loop evaluation. We demonstrate that through a combination of latent feature distillation and ground-truth trajectory supervision, an efficient vision-only student model \textbf{Orion-Lite} can even surpass the performance of its massive VLA teacher, ORION. Setting a new state-of-the-art on the rigorous Bench2Drive benchmark, with a Driving Score of 80.6. Ultimately, this reveals that vision-only architectures still possess significant, untapped potential for high-performance reactive planning.

[161] Preventing Overfitting in Deep Image Prior for Hyperspectral Image Denoising

Panagiotis Gkotsis,Athanasios A. Rontogiannis

Main category: cs.CV

TL;DR: 本文提出了一种结合鲁棒数据保真项和显式敏感性正则化的DIP方法，用于高光谱图像去噪，有效缓解过拟合问题并提升性能。

Details

Motivation: DIP方法在逆成像任务中易过拟合，导致性能下降并需早停，亟需改进。 Method: 采用Smooth ℓ1数据项、基于散度的正则化及输入优化联合抑制DIP在HSI去噪中的过拟合。 Result: 在含高斯、稀疏和条纹噪声的真实HSI上实验表明，该方法有效防止过拟合，性能优于现有DIP-based HSI去噪方法。 Conclusion: 联合鲁棒数据保真与敏感性正则化可显著提升DIP在HSI去噪中的泛化能力与去噪效果。 Abstract: Deep image prior (DIP) is an unsupervised deep learning framework that has been successfully applied to a variety of inverse imaging problems. However, DIP-based methods are inherently prone to overfitting, which leads to performance degradation and necessitates early stopping. In this paper, we propose a method to mitigate overfitting in DIP-based hyperspectral image (HSI) denoising by jointly combining robust data fidelity and explicit sensitivity regularization. The proposed approach employs a Smooth $\ell_1$ data term together with a divergence-based regularization and input optimization during training. Experimental results on real HSIs corrupted by Gaussian, sparse, and stripe noise demonstrate that the proposed method effectively prevents overfitting and achieves superior denoising performance compared to state-of-the-art DIP-based HSI denoising methods.

[162] Revisiting Radar Perception With Spectral Point Clouds

Hamza Alsharif,Jing Gu,Pavol Jancura,Satish Ravindran,Gijs Dubbelman

Main category: cs.CV

TL;DR: 本文提出光谱点云范式，将点云视为雷达频谱的稀疏压缩表示，并通过注入频谱信息提升其性能，使其在雷达感知任务中可媲美甚至超越密集范围-多普勒（RD）谱输入。

Details

Motivation: 密集范围-多普勒谱虽常被认为性能优于稀疏点云，但其易受传感器与配置差异影响，阻碍模型迁移；而点云作为通用表征潜力未被充分挖掘。 Method: 提出光谱点云范式，设计实验框架对比不同密度点云模型与密集RD基准的性能，并探索两种基础频谱增强方法（向点云注入目标相关频谱信息）。 Result: 在特定点云密度下，光谱点云模型性能达到甚至超过RD基准；经频谱增强后，点云模型显著超越RD基准。 Conclusion: 光谱点云是一种鲁棒、统一的雷达感知输入表示，有望支撑未来雷达基础模型的发展。 Abstract: Radar perception models are trained with different inputs, from range-Doppler spectra to sparse point clouds. Dense spectra are assumed to outperform sparse point clouds, yet they can vary considerably across sensors and configurations, which hinders transfer. In this paper, we provide alternatives for incorporating spectral information into radar point clouds and show that, point clouds need not underperform compared to spectra. We introduce the spectral point cloud paradigm, where point clouds are treated as sparse, compressed representations of the radar spectra, and argue that, when enriched with spectral information, they serve as strong candidates for a unified input representation that is more robust against sensor-specific differences. We develop an experimental framework that compares spectral point cloud (PC) models at varying densities against a dense range-Doppler (RD) benchmark, and report the density levels where the PC configurations meet the performance of the RD benchmark. Furthermore, we experiment with two basic spectral enrichment approaches, that inject additional target-relevant information into the point clouds. Contrary to the common belief that the dense RD approach is superior, we show that point clouds can do just as well, and can surpass the RD benchmark when enrichment is applied. Spectral point clouds can therefore serve as strong candidates for unified radar perception, paving the way for future radar foundation models.

[163] CAMotion: A High-Quality Benchmark for Camouflaged Moving Object Detection in the Wild

Siyuan Yao,Hao Sun,Ruiqi Yu,Xiwei Jiang,Wenqi Ren,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出了CAMotion，一个高质量的视频伪装移动目标检测基准数据集，旨在解决现有VCOD数据集规模和多样性不足的问题，并对当前SOTA模型进行了全面评估。

Details

Motivation: 现有视频伪装物体检测（VCOD）数据集在规模和多样性上严重受限，阻碍了数据驱动的深度学习算法的深入分析与广泛评估。 Method: 构建了一个名为CAMotion的新型高质量视频伪装移动目标检测基准数据集，涵盖多种物种，包含具有不确定边缘、遮挡、运动模糊、形状复杂等挑战性属性的视频序列，并从多角度提供序列标注细节与统计分布；同时对现有SOTA模型在该数据集上进行综合评测。 Result: CAMotion成为首个覆盖野外复杂场景、支持运动特性深度分析的VCOD基准；实验揭示了当前方法在处理运动模糊、遮挡及复杂边缘等挑战时的主要瓶颈。 Conclusion: CAMotion填补了VCOD领域高质量、多样化视频基准的空白，有望推动伪装目标检测，特别是动态场景下检测技术的发展。 Abstract: Discovering camouflaged objects is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings. While the problem of camouflaged object detection over sequential video frames has received increasing attention, the scale and diversity of existing video camouflaged object detection (VCOD) datasets are greatly limited, which hinders the deeper analysis and broader evaluation of recent deep learning-based algorithms with data-hungry training strategy. To break this bottleneck, in this paper, we construct CAMotion, a high-quality benchmark covers a wide range of species for camouflaged moving object detection in the wild. CAMotion comprises various sequences with multiple challenging attributes such as uncertain edge, occlusion, motion blur, and shape complexity, etc. The sequence annotation details and statistical distribution are presented from various perspectives, allowing CAMotion to provide in-depth analyses on the camouflaged object's motion characteristics in different challenging scenarios. Additionally, we conduct a comprehensive evaluation of existing SOTA models on CAMotion, and discuss the major challenges in VCOD task. The benchmark is available at https://www.camotion.focuslab.net.cn, we hope that our CAMotion can lead to further advancements in the research community.

[164] GroundingAnomaly: Spatially-Grounded Diffusion for Few-Shot Anomaly Synthesis

Yishen Liu,Hongcang Chen,Pengcheng Zhao,Yunfan Bao,Yuxi Tian,Jieming Zhang,Hao Chen,Zheng Zhi,Yongchun Liu,Ying Li,Dongpu Cao

Main category: cs.CV

TL;DR: 本文提出GroundingAnomaly框架，通过空间条件模块和门控自注意力模块，在少量异常样本下生成高质量、精准定位的工业图像异常，显著提升下游检测与分割性能。

Details

Motivation: 工业视觉异常检测受限于真实异常样本稀缺，现有异常合成方法存在融合效果差或掩码不准确的问题。 Method: 提出GroundingAnomaly：包含Spatial Conditioning Module（利用像素级语义图实现异常空间精确定位）和Gated Self-Attention Module（通过门控注意力将条件token注入冻结U-Net，兼顾预训练先验与少样本适应稳定性）。 Result: 在MVTec AD和VisA数据集上验证，生成异常质量高，在异常检测、分割及实例级检测等下游任务中达到SOTA性能。 Conclusion: GroundingAnomaly有效解决了少样本下异常合成的定位不准与融合不佳问题，为工业质检提供了可靠的数据增强方案。 Abstract: The performance of visual anomaly inspection in industrial quality control is often constrained by the scarcity of real anomalous samples. Consequently, anomaly synthesis techniques have been developed to enlarge training sets and enhance downstream inspection. However, existing methods either suffer from poor integration caused by inpainting or fail to provide accurate masks. To address these limitations, we propose GroundingAnomaly, a novel few-shot anomaly image generation framework. Our framework introduces a Spatial Conditioning Module that leverages per-pixel semantic maps to enable precise spatial control over the synthesized anomalies. Furthermore, a Gated Self-Attention Module is designed to inject conditioning tokens into a frozen U-Net via gated attention layers. This carefully preserves pretrained priors while ensuring stable few-shot adaptation. Extensive evaluations on the MVTec AD and VisA datasets demonstrate that GroundingAnomaly generates high-quality anomalies and achieves state-of-the-art performance across multiple downstream tasks, including anomaly detection, segmentation, and instance-level detection.

[165] Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow

Richard Petersen,Fredrik Kahl,Jennifer Alvén

Main category: cs.CV

TL;DR: 本文提出了一种无需密集标注的弱监督肺结节分割方法，结合预训练的3D校正流生成模型与预测器，在仅使用图像级标签、不重训生成模型的前提下，显著提升了小目标（如肺结节）的分割质量。

Details

Motivation: 密集标注（如体素级分割掩码）在3D医学影像中成本极高；现有弱监督方法（尤其归因类）难以准确捕获小结构（如肺结节）。 Method: 将预训练的3D校正流（rectified flow）生成模型与预测器以即插即用方式结合，利用生成模型进行免训练引导，仅微调预测器，且仅需图像级标签。 Result: 在LUNA16数据集上，该方法优于基线方法，对不同大小和形状的肺结节均能稳定检出，分割质量提升明显。 Conclusion: 生成式基础模型可作为高效工具，推动弱监督3D医学图像分割的发展，尤其适用于小病灶检测场景。 Abstract: Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.

[166] Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

Yuchuan Deng,Qijie Wei,Kaiheng Qian,Jiazhen Liu,Zijie Xin,Bangxiang Lan,Jingyu Liu,Jianfeng Dong,Xirong Li

Main category: cs.CV

TL;DR: 本文提出Fundus-R1模型，利用纯公开数据（94%仅含图像级标签）训练具备推理能力的眼底影像理解多模态大模型，通过RAG生成知识感知的推理链，并在RLVR中引入自一致性过程奖励，显著优于基线模型。

Details

Motivation: 现有眼底影像理解模型依赖大量私有、高质量临床报告配对数据，导致可复现性差且研究门槛高；亟需一种仅用公开、低标注成本数据（如图像级标签）构建高性能模型的方法。 Method: 1）基于RAG自动生成图像特异、知识驱动的推理链，将通用MLLM识别的视觉发现与图像标签通过眼科知识关联；2）改进RLVR框架，新增鼓励推理链内部自一致性的过程奖励。 Result: 在FunBench、Omni-Fundus和GMAI-Fundus三个基准上，Fundus-R1显著超越Qwen2.5-VL等基线，包括未使用生成推理链的增强版模型。 Conclusion: 证明仅用大规模公开、弱标注（图像级）数据即可训练出高性能、可解释的眼底阅读MLLM，为该领域普惠化研究提供新范式。 Abstract: Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.

[167] Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification

Xun Zhu,Fanbin Mo,Xi Chen,Kaili Zheng,Shaoshuai Yang,Yiming Shi,Jian Gao,Miao Li,Ji Wu

Main category: cs.CV

TL;DR: 本文通过特征探测方法，系统分析了14个开源医疗多模态大语言模型（MLLMs）在图像分类任务中性能下降的根本原因，揭示了四种典型失效模式，并提出了量化评估指标。

Details

Motivation: 尽管医疗多模态大语言模型（MLLMs）在预训练数据和参数规模上具有显著优势，但在基础的医学图像分类任务上却持续落后于传统深度学习模型，这一矛盾亟需深入探究其性能退化根源。 Method: 在三个代表性医学图像分类数据集上对14个开源医疗MLLMs进行大规模实验；采用模块级、层间的视觉特征探针（feature probing）追踪信息流，可视化分类信号的畸变、稀释或覆盖过程。 Result: 首次系统识别出四类失效模式：视觉表征质量受限、连接器投影保真度损失、大语言模型推理理解不足、语义映射错位；并提出可量化的特征演化健康度评分，支持跨模型与跨数据集的客观比较。 Conclusion: 当前医疗MLLMs距离临床可用仍有显著差距，其性能瓶颈源于架构与任务需求之间的深层不匹配，需从特征建模、模态对齐与推理机制等多方面重新设计。 Abstract: The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.

[168] InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Ashutosh Kumar,Rajat Saini,Jingjing Pan,Mustafa Erdogan,Mingfang Zhang,Betty Le Dem,Norimasa Kobori,Quan Kong

Main category: cs.CV

TL;DR: 本文提出InstAP实例感知预训练框架，通过联合优化全局视觉-文本对齐和细粒度实例级对比对齐，提升模型在实例级别推理能力；并构建了大规模InstVL数据集支持该框架，在实例级检索任务上显著优于现有VLP模型，同时提升了全局理解能力。

Details

Motivation: 现有视觉语言预训练（VLP）范式擅长全局场景理解，但在实例级推理上受限于仅使用全局监督信号。 Method: 提出InstAP实例感知预训练框架，联合优化全局视觉-文本对齐与细粒度实例级对比对齐（将文本提及锚定到具体时空区域）；构建InstVL大规模双粒度标注数据集（含整体场景描述和密集接地的实例描述）。 Result: 在InstVL基准上，InstAP在实例级检索任务上显著超越现有VLP模型，并优于在同一数据集上训练的强基线模型；在MSR-VTT、DiDeMo等视频零样本任务中也取得有竞争力的表现；可视化显示其能准确定位文本提及的实例。 Conclusion: 实例感知预训练不仅提升实例级理解能力，还能反哺全局理解性能，验证了细粒度监督对视觉语言模型的必要性与有效性。 Abstract: Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

[169] PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

Ruizhi Zhang,Ye Huang,Yuangang Pan,Chuanfu Shen,Zhilin Liu,Ting Xie,Wen Li,Lixin Duan

Main category: cs.CV

TL;DR: 本文提出PokeGym——一个基于《宝可梦传说：Z-A》的视觉驱动、长时程3D具身智能基准，旨在克服现有VLM评估在交互性、深度感知、状态泄露和可扩展性上的四大缺陷；通过严格隔离RGB输入与内存验证机制，系统评测VLM在导航、交互等任务中的视觉接地、语义推理与自主探索能力，并发现物理死锁恢复而非高层规划是当前VLM的主要瓶颈，且不同能力模型表现出‘无意识死锁’与‘有意识死锁’的元认知差异。

Details

Motivation: 现有Vision-Language Models（VLMs）在静态视觉理解上表现优异，但在复杂3D具身环境中的部署仍严重受限；当前基准存在四大缺陷：被动感知、2D简化、状态泄露、人工评估不可扩展，亟需更真实、严格、可扩展的具身视觉评估范式。 Method: 构建PokeGym基准：基于《宝可梦传说：Z-A》这一高保真3D开放世界RPG，设计30个长时程（30–220步）任务，覆盖导航、交互与混合场景；采用三级指令粒度（视觉引导、步骤引导、目标仅指）；实行代码级隔离——代理仅接收原始RGB帧，评估器通过内存扫描独立判定成功，杜绝状态泄露并实现全自动评估。 Result: 实验揭示：物理死锁恢复是当前VLM的核心瓶颈，其与任务成功率呈强负相关；进一步发现‘元认知分化’现象——弱模型多陷入‘无意识死锁’（未察觉被困），强模型则陷入‘有意识死锁’（察觉但无法脱困）。 Conclusion: 单纯提升语言或高层规划能力不足以突破具身视觉瓶颈；必须将显式的空间直觉（spatial intuition）嵌入VLM架构设计，以增强其在真实3D动态环境中的物理交互鲁棒性与恢复能力。 Abstract: While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.

[170] MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

Junyao Gao,Sibo Liu,Jiaxing Li,Yanan Sun,Yuanpeng Tu,Fei Shen,Weidong Zhang,Cairong Zhao,Jun Zhang

Main category: cs.CV

TL;DR: 本文提出了MegaStyle，一种新颖且可扩展的数据整理流程，用于构建风格数据集，并基于该数据集训练了风格编码器和风格迁移模型。

Details

Motivation: 现有风格数据集缺乏风格内一致性、风格间多样性和高质量，限制了风格迁移模型的性能。 Method: 利用大生成模型的一致文本到图像风格映射能力，构建包含17万风格提示和40万内容提示的多样化提示库，并通过组合生成大规模风格数据集MegaStyle-1.4M；在此基础上，采用风格监督对比学习微调风格编码器MegaStyle-Encoder，并训练基于FLUX的风格迁移模型MegaStyle-FLUX。 Result: 实验验证了MegaStyle-1.4M在保持风格内一致性、风格间多样性及高质量方面的有效性；MegaStyle-Encoder能提供可靠的风格相似性度量，MegaStyle-FLUX具备泛化性强的风格迁移能力。 Conclusion: MegaStyle为风格迁移领域提供了高质量、大规模、结构化的风格数据集与配套模型，显著推动了该领域的发展。 Abstract: In this paper, we introduce MegaStyle, a novel and scalable data curation pipeline that constructs an intra-style consistent, inter-style diverse and high-quality style dataset. We achieve this by leveraging the consistent text-to-image style mapping capability of current large generative models, which can generate images in the same style from a given style description. Building on this foundation, we curate a diverse and balanced prompt gallery with 170K style prompts and 400K content prompts, and generate a large-scale style dataset MegaStyle-1.4M via content-style prompt combinations. With MegaStyle-1.4M, we propose style-supervised contrastive learning to fine-tune a style encoder MegaStyle-Encoder for extracting expressive, style-specific representations, and we also train a FLUX-based style transfer model MegaStyle-FLUX. Extensive experiments demonstrate the importance of maintaining intra-style consistency, inter-style diversity and high-quality for style dataset, as well as the effectiveness of the proposed MegaStyle-1.4M. Moreover, when trained on MegaStyle-1.4M, MegaStyle-Encoder and MegaStyle-FLUX provide reliable style similarity measurement and generalizable style transfer, making a significant contribution to the style transfer community. More results are available at our project website https://jeoyal.github.io/MegaStyle/.

[171] SurfelSplat: Learning Efficient and Generalizable Gaussian Surfel Representations for Sparse-View Surface Reconstruction

Chensheng Dai,Shengjun Zhang,Min Chen,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出SurfelSplat，一种前馈式框架，用于从稀疏视角图像生成像素对齐的高斯surfels表示，解决现有优化方法依赖密集视角和耗时长的问题；通过基于奈奎斯特采样定理的跨视角特征聚合模块，提升几何属性重建精度，在DTU数据集上实现与SOTA相当的效果且推理速度快100倍。

Details

Motivation: 现有基于优化的3D高斯点阵（3DGS）表面重建方法依赖密集输入视角、每场景优化耗时长，难以实用化。 Method: 提出SurfelSplat前馈框架：引入空间采样率引导的低通滤波器适配高斯surfels几何形态，并通过跨视角投影获取特征相关性，再经专用特征融合网络回归精确几何的高斯surfels。 Result: 在DTU重建基准上达到与当前最优方法相当的精度，单场景预测仅需约1秒，相较优化方法提速约100倍，且无需每场景训练。 Conclusion: SurfelSplat实现了高效、通用、像素对齐的稀疏视角表面重建，验证了前馈式高斯surfels建模的可行性与实用性。 Abstract: 3D Gaussian Splatting (3DGS) has demonstrated impressive performance in 3D scene reconstruction. Beyond novel view synthesis, it shows great potential for multi-view surface reconstruction. Existing methods employ optimization-based reconstruction pipelines that achieve precise and complete surface extractions. However, these approaches typically require dense input views and high time consumption for per-scene optimization. To address these limitations, we propose SurfelSplat, a feed-forward framework that generates efficient and generalizable pixel-aligned Gaussian surfel representations from sparse-view images. We observe that conventional feed-forward structures struggle to recover accurate geometric attributes of Gaussian surfels because the spatial frequency of pixel-aligned primitives exceeds Nyquist sampling rates. Therefore, we propose a cross-view feature aggregation module based on the Nyquist sampling theorem. Specifically, we first adapt the geometric forms of Gaussian surfels with spatial sampling rate-guided low-pass filters. We then project the filtered surfels across all input views to obtain cross-view feature correlations. By processing these correlations through a specially designed feature fusion network, we can finally regress Gaussian surfels with precise geometry. Extensive experiments on DTU reconstruction benchmarks demonstrate that our model achieves comparable results with state-of-the-art methods, and predict Gaussian surfels within 1 second, offering a 100x speedup without costly per-scene training.

[172] Phantasia: Context-Adaptive Backdoors in Vision Language Models

Nam Duong Tran,Phi Le Nguyen

Main category: cs.CV

TL;DR: 本文揭示了现有视觉语言模型（VLM）后门攻击的隐蔽性被高估，并提出一种新型上下文自适应后门攻击Phantasia，能生成语义一致、更难检测的恶意响应。

Details

Motivation: 现有VLM后门攻击多依赖固定、易识别的中毒模式，其隐蔽性被高估；缺乏真正隐蔽且上下文感知的攻击方法。 Method: 1）通过迁移跨模态防御技术，评估并暴露现有攻击的可检测性；2）提出Phantasia攻击：利用输入上下文动态生成语义连贯但恶意的响应，避免静态中毒模式。 Result: Phantasia在多种VLM架构上实现SOTA攻击成功率，同时在多种防御设置下保持对良性样本的正常性能。 Conclusion: VLM后门攻击的实际隐蔽性远低于预期；Phantasia为更真实、更具挑战性的后门威胁建模提供了新范式，凸显了亟需更强健的VLM安全防御机制。 Abstract: Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.

[173] SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation

Wenli Zhang,Xianglong Shi,Sirui Zhao,Xinqi Chen,Guo Cheng,Yifan Xu,Tong Xu,Yong Liao

Main category: cs.CV

TL;DR: 本文提出SyncBreaker，一种阶段感知的多模态保护框架，通过联合扰动图像和音频输入来抑制语音驱动的面部动态，有效降低唇形同步与面部动态，同时保持输入感知质量并具备抗净化鲁棒性。

Details

Motivation: 扩散模型驱动的语音驱动说话人头像生成技术虽逼真，但易被滥用于欺诈和虚假信息；现有单模态防护方法（仅图像或仅音频）难以有效抑制语音驱动的面部动态。 Method: 提出SyncBreaker框架：1）图像流采用多区间采样（MIS）下的归零监督，聚合多个去噪阶段的引导以趋向静态参考肖像；2）音频流引入跨注意力欺骗（CAF），抑制特定区间内音频条件化的跨注意力响应；两路独立优化、推理时融合。 Result: 在白盒主动防护设置下，SyncBreaker相比强单模态基线更显著地降低唇同步与面部动态，同时保持输入感知质量，并对净化攻击具有鲁棒性。 Conclusion: SyncBreaker通过阶段感知的多模态协同扰动，为语音驱动 talking-head 生成提供了更有效、灵活且鲁棒的防护方案。 Abstract: Diffusion-based audio-driven talking-head generation enables realistic portrait animation, but also introduces risks of misuse, such as fraud and misinformation. Existing protection methods are largely limited to a single modality, and neither image-only nor audio-only attacks can effectively suppress speech-driven facial dynamics. To address this gap, we propose SyncBreaker, a stage-aware multimodal protection framework that jointly perturbs portrait and audio inputs under modality-specific perceptual constraints. Our key contributions are twofold. First, for the image stream, we introduce nullifying supervision with Multi-Interval Sampling (MIS) across diffusion stages to steer the generation toward the static reference portrait by aggregating guidance from multiple denoising intervals. Second, for the audio stream, we propose Cross-Attention Fooling (CAF), which suppresses interval-specific audio-conditioned cross-attention responses. Both streams are optimized independently and combined at inference time to enable flexible deployment. We evaluate SyncBreaker in a white-box proactive protection setting. Extensive experiments demonstrate that SyncBreaker more effectively degrades lip synchronization and facial dynamics than strong single-modality baselines, while preserving input perceptual quality and remaining robust under purification. Code: https://github.com/kitty384/SyncBreaker.

[174] BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

Fan Yang,Wenrui Chen,Guorun Yan,Ruize Liao,Wanjun Jia,Dongsheng Luo,Kailun Yang,Zhiyong Li,Yaonan Wang

Main category: cs.CV

TL;DR: 本文提出BLaDA框架，通过语言解析、三角形功能点定位和3D关键点抓取矩阵变换，实现零样本、可解释的功能性灵巧抓取。

Details

Motivation: 现有模块化方法依赖预定义的可供性标签，缺乏语义与姿态的紧密耦合，难以支持开放词汇指令下的功能性灵巧操作。 Method: BLaDA包含三个核心模块：知识引导的语言解析（KLP）将自然语言转为结构化六元组约束；三角形功能点定位（TriLocation）基于3D高斯泼溅与几何约束定位功能区域；3D关键点抓取矩阵变换执行（KGT3D+）将语义-几何约束解码为腕部姿态与手指级指令。 Result: 在多个复杂基准测试中，BLaDA在可供性定位精度和功能性操作成功率上显著优于现有方法。 Conclusion: BLaDA实现了开放词汇指令驱动、语义-姿态强耦合、物理可解释的功能性灵巧抓取，提升了模块化方法的泛化性与可控性。 Abstract: In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.

[175] HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment

Changdao Chen

Main category: cs.CV

TL;DR: 本文提出HST-HGN模型，结合分层超图网络与双向状态空间模型（Bi-Mamba），在保持低计算开销的同时，有效建模驾驶员面部表情的高阶协同与长程时序演化，显著提升无裁剪视频中的疲劳检测精度与实时性。

Details

Motivation: 现有方法难以在计算受限条件下，从无裁剪视频中准确建模细微面部表情的长程时序依赖；重模型计算开销大，轻量图模型又难以捕获高阶协同和全局时序上下文。 Method: 提出异构时空超图网络HST-HGN：空间上采用分层超图融合姿态解耦的几何拓扑与多模态纹理块；时间上引入线性复杂度的双向Bi-Mamba模块进行时序建模。 Result: 在多个疲劳检测基准上达到SOTA性能，在判别力与计算效率间取得良好平衡，适用于车载边缘端实时部署。 Conclusion: HST-HGN通过异构超图与双向状态空间建模，突破了轻量级模型在高阶协同与时序细粒度建模上的瓶颈，为资源受限场景下的疲劳检测提供了新范式。 Abstract: It remains challenging to assess driver fatigue from untrimmed videos under constrained computational budgets, due to the difficulty of modeling long-range temporal dependencies in subtle facial expressions. Some existing approaches rely on computationally heavy architectures, whereas others employ traditional lightweight pairwise graph networks, despite their limited capacity to model high-order synergies and global temporal context. Therefore, we propose HST-HGN, a novel Heterogeneous Spatial-Temporal Hypergraph Network driven by Bidirectional State Space Models. Spatially, we introduce a hierarchical hypergraph network to fuse pose-disentangled geometric topologies with multi-modal texture patches dynamically. This formulation encapsulates high-order synergistic facial deformations, effectively overcoming the limitations of conventional methods. In temporal terms, a Bi-Mamba module with linear complexity is applied to perform bidirectional sequence modeling. This explicit temporal-evolution filtering enables the network to distinguish highly ambiguous transient actions, such as yawning versus speaking, while encompassing their complete physiological lifecycles. Extensive evaluations across diverse fatigue benchmarks demonstrate that HST-HGN achieves state-of-the-art performance. In particular, our method strikes a balance between discriminative power and computational efficiency, making it well-suited for real-time in-cabin edge deployment.

[176] CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

Rui Gan,Junyi Ma,Pei Li,Xingyou Yang,Kai Chen,Sikai Chen,Bin Ran

Main category: cs.CV

TL;DR: 本文提出了CrashSight，一个面向路侧视角的大型视觉-语言基准数据集，用于评估模型在道路碰撞场景中的理解能力，特别关注基础设施视角下的时序与因果推理能力。

Details

Motivation: 现有视觉-语言模型（VLMs）在安全关键交通场景（尤其是路侧视角）下的性能缺乏充分评估，现有基准多聚焦于自车视角，难以支撑协同自动驾驶所需的基础设施协同感知。 Method: 构建了包含250个真实路侧碰撞视频、13K多选问答对的CrashSight基准，采用两层分类体系：Tier 1评估视觉定位与场景主体识别，Tier 2评估碰撞机理、因果归因、时序演化和后果预测等高阶推理能力；并在8个SOTA VLM上进行系统评测与失败案例分析。 Result: 当前主流VLM在场景描述任务上表现良好，但在时序建模与因果推理等安全关键能力上显著不足；实验揭示了典型失败模式，并指出了改进方向。 Conclusion: CrashSight填补了路侧视角交通理解基准的空白，为协同自动驾驶中基础设施辅助感知提供了标准化评测框架，推动VLM向更安全、更鲁棒的交通场景理解发展。 Abstract: Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

[177] OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

Haoxi Zeng,Qiankun Liu,Yi Bin,Haiyue Zhang,Yujuan Ding,Guoqing Wang,Deqiang Ouyang,Heng Tao Shen

Main category: cs.CV

TL;DR: 本文提出OVS-DINO框架，通过将DINO与SAM结构对齐，增强其边界感知能力，显著提升开放词汇分割性能，尤其在复杂场景下效果突出。

Details

Motivation: CLIP类方法语义泛化强但空间细节不足；DINO等VFM虽有改进，但仍缺乏精确边缘感知能力。 Method: 发现DINO深层特征中边界信息逐渐衰减，提出OVS-DINO框架：引入结构感知编码器（SAE）和结构调制解码器（SMD），利用SAM的结构先验激活DINO的边界特征，并采用SAM生成的伪掩码进行监督。 Result: 在多个弱监督OVS基准上达到SOTA，平均得分提升2.1%（44.8%→46.9%）；Cityscapes上提升6.3%（36.6%→42.9%）。 Conclusion: 通过结构对齐方式有效恢复DINO的隐式边界敏感性，验证了融合多模型结构先验对开放词汇分割任务的有效性与潜力。 Abstract: Open-Vocabulary Segmentation (OVS) aims to segment image regions beyond predefined category sets by leveraging semantic descriptions. While CLIP based approaches excel in semantic generalization, they frequently lack the fine-grained spatial awareness required for dense prediction. Recent efforts have incorporated Vision Foundation Models (VFMs) like DINO to alleviate these limitations. However, these methods still struggle with the precise edge perception necessary for high fidelity segmentation. In this paper, we analyze internal representations of DINO and discover that its inherent boundary awareness is not absent but rather undergoes progressive attenuation as features transition into deeper transformer blocks. To address this, we propose OVS-DINO, a novel framework that revitalizes latent edge-sensitivity of DINO through structural alignment with the Segment Anything Model (SAM). Specifically, we introduce a Structure-Aware Encoder (SAE) and a Structure-Modulated Decoder (SMD) to effectively activate boundary features of DINO using SAM's structural priors, complemented by a supervision strategy utilizing SAM generated pseudo-masks. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple weakly-supervised OVS benchmarks, improving the average score by 2.1% (from 44.8% to 46.9%). Notably, our approach significantly enhances segmentation accuracy in complex, cluttered scenarios, with a gain of 6.3% on Cityscapes (from 36.6% to 42.9%).

[178] LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

Jingjing Wang,Zhengdong Hong,Chong Bao,Yuke Zhu,Junhan Sun,Guofeng Zhang

Main category: cs.CV

TL;DR: 本文提出LAMP方法，通过将图像编辑作为3D先验，提取物体间连续、几何感知的3D变换表示，以提升开放世界机器人操作中的泛化能力。

Details

Motivation: 现有基于学习的方法（如强化学习、模仿学习和视觉-语言-动作模型）在面对新任务和未见环境时泛化能力不足；而大语言模型和视觉-语言模型虽具备强语义推理能力，但3D感知有限，难以支持细粒度操作。 Method: 提出LAMP框架，利用图像编辑中隐含的丰富2D空间线索，将其‘提升’为物体间的连续3D变换，作为几何感知的表示。 Result: 实验表明LAMP能提供精确的3D变换，并在开放世界操作任务中实现强零样本泛化性能。 Conclusion: 将图像编辑作为3D先验是一种有效构建几何感知、可泛化表征的新范式，显著提升了开放世界机器人操作的适应性与鲁棒性。 Abstract: Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.

[179] Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

Sai Srinivas Kancheti,Aditya Kanade,Rohit Sinha,Vineeth N Balasubramanian,Tanuja Ganu

Main category: cs.CV

TL;DR: 本文提出Faithful GRPO（FGRPO），一种改进的强化学习方法，通过拉格朗日对偶上升强制链式推理（CoT）在逻辑一致性和视觉依据性两方面保持忠实，显著提升多模态空间推理模型的推理质量与答案准确率。

Details

Motivation: 现有基于强化学习的多模态推理模型（如ViGoRL-Spatial、TreeVGR及标准GRPO训练模型）虽提升答案准确率，但其生成的链式推理（CoT）常与最终答案不一致且缺乏图像证据支撑，推理质量下降。 Method: 提出Faithful GRPO（FGRPO），在Group Relative Policy Optimization（GRPO）框架中引入批次级逻辑一致性与视觉接地约束，并通过拉格朗日对偶上升自适应调整约束权重，将其融入组内优势函数计算。 Result: 在Qwen2.5-VL-7B/3B模型和七个空间推理数据集上验证：FGRPO将CoT不一致率从24.5%降至1.7%，视觉接地分数提升+13%，同时答案准确率也超越标准GRPO。 Conclusion: 强制推理过程的逻辑一致性与视觉接地性不仅提升推理质量，还能反哺最终答案准确率，证实‘忠实推理’是提升多模态推理模型性能的关键路径。 Abstract: Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.

[180] Novel View Synthesis as Video Completion

Qi Wu,Khiem Vuong,Minsik Jeon,Srinivasa Narasimhan,Deva Ramanan

Main category: cs.CV

TL;DR: 本文提出FrameCrafter方法，将稀疏新视角合成（NVS）建模为低帧率视频补全任务，通过改造视频扩散模型（如去除时序位置编码、引入逐帧潜在编码）使其对输入视角顺序不变，从而有效利用视频模型中隐含的多视角知识，在仅需约5张多视角图像的情况下实现高质量新视角生成。

Details

Motivation: 现有基于单图扩散模型的方法缺乏多视角几何先验；而视频扩散模型天然蕴含多视角一致性知识，更易适配稀疏NVS任务。 Method: 将稀疏NVS重构为视角序列补全问题，设计FrameCrafter架构：采用逐帧潜在编码、移除时间位置嵌入、增强输入排列不变性，以适配无序稀疏多视角输入。 Result: 在稀疏视角NVS基准上达到具有竞争力的性能，表明视频模型可经少量监督即‘遗忘’时间信息，转而聚焦视角几何关系。 Conclusion: 视频扩散模型是稀疏NVS的有效先验源；通过结构改造实现排列不变性，可高效迁移其隐式多视角知识，无需显式三维建模或大量多视角训练数据。 Abstract: We tackle the problem of sparse novel view synthesis (NVS) using video diffusion models; given $K$ ($\approx 5$) multi-view images of a scene and their camera poses, we predict the view from a target camera pose. Many prior approaches leverage generative image priors encoded via diffusion models. However, models trained on single images lack multi-view knowledge. We instead argue that video models already contain implicit multi-view knowledge and so should be easier to adapt for NVS. Our key insight is to formulate sparse NVS as a low frame-rate video completion task. However, one challenge is that sparse NVS is defined over an unordered set of inputs, often too sparse to admit a meaningful order, so the models should be $\textit{invariant}$ to permutations of that input set. To this end, we present FrameCrafter, which adapts video models (naturally trained with coherent frame orderings) to permutation-invariant NVS through several architectural modifications, including per-frame latent encodings and removal of temporal positional embeddings. Our results suggest that video models can be easily trained to "forget" about time with minimal supervision, producing competitive performance on sparse-view NVS benchmarks. Project page: https://frame-crafter.github.io/

[181] Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification

Kabilan Elangovan,Daniel Ting

Main category: cs.CV

TL;DR: 本文提出了一种名为C-Score的新指标，用于评估类激活映射（CAM）方法在医学影像中解释的一致性（而非仅定位准确性），并在多个CAM方法和CNN架构上验证其有效性，揭示了分类性能（AUC）与解释一致性之间的多种解耦现象，并可提前预警模型不稳定。

Details

Motivation: 现有评估框架只关注CAM解释对放射科医生标注的定位保真度（正确性），而忽略了模型是否对同类病灶采用一致的空间推理策略（一致性），这在临床部署中至关重要。 Method: 提出C-Score（一致性得分）：一种无需人工标注、置信度加权的指标，通过强调强度的成对软IoU量化同类正确预测样本间热图的再现性；在Kermany胸部X光数据集上，系统评估6种CAM方法与3种CNN架构（含迁移学习与微调阶段）的30个训练轮次。 Result: 发现了三种AUC与一致性解耦机制：阈值导致的‘金标列表坍缩’、峰值AUC时技术特异的归因坍缩、全局聚合中的类别级一致性掩盖；C-Score可在AUC灾难性下降前一个检查点即检测到ScoreCAM在ResNet50V2上的退化，提供基于解释质量的架构选择建议。 Conclusion: C-Score填补了解释质量评估中‘一致性’维度的空白，是一种有效的早期模型稳定性预警工具，推动医学AI从单纯追求预测性能转向兼顾可解释性与鲁棒性的临床就绪评估范式。 Abstract: Class Activation Mapping (CAM) methods are widely used to generate visual explanations for deep learning classifiers in medical imaging. However, existing evaluation frameworks assess whether explanations are correct, measured by localisation fidelity against radiologist annotations, rather than whether they are consistent: whether the model applies the same spatial reasoning strategy across different patients with the same pathology. We propose the C-Score (Consistency Score), a confidence-weighted, annotation-free metric that quantifies intra-class explanation reproducibility via intensity-emphasised pairwise soft IoU across correctly classified instances. We evaluate six CAM techniques: GradCAM, GradCAM++, LayerCAM, EigenCAM, ScoreCAM, and MS GradCAM++ across three CNN architectures (DenseNet201, InceptionV3, ResNet50V2) over thirty training epochs on the Kermany chest X-ray dataset, covering transfer learning and fine-tuning phases. We identify three distinct mechanisms of AUC-consistency dissociation, invisible to standard classification metrics: threshold-mediated gold list collapse, technique-specific attribution collapse at peak AUC, and class-level consistency masking in global aggregation. C-Score provides an early warning signal of impending model instability. ScoreCAM deterioration on ResNet50V2 is detectable one full checkpoint before catastrophic AUC collapse and yields architecture-specific clinical deployment recommendations grounded in explanation quality rather than predictive ranking alone.

[182] Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Ying Shen,Jerry Xiong,Tianjiao Yu,Ismini Lourentzou

Main category: cs.CV

TL;DR: 本文提出Phantom模型，通过在视频生成过程中联合建模视觉内容与潜在物理动力学，提升生成视频的物理一致性与视觉真实性。

Details

Motivation: 现有生成视频模型虽具高视觉真实感，但缺乏对真实世界物理规律的理解和遵循，导致运动不真实；需探索如何将物理属性推理融入生成过程以提升物理合理性。 Method: 提出Phantom——一种物理注入式视频生成模型，以观测帧和推断的物理状态为条件，联合预测潜在物理动力学并生成未来帧；引入物理感知视频表征作为抽象但信息丰富的物理嵌入，无需显式建模复杂物理方程。 Result: 在标准视频生成与物理感知基准上，Phantom在物理动态遵循性上优于现有方法，同时保持有竞争力的感知保真度。 Conclusion: 将物理属性推理直接嵌入视频生成流程，可有效提升生成结果的物理一致性和视觉质量，验证了物理引导建模对生成式视频模型的重要性。 Abstract: Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

[183] Visually-grounded Humanoid Agents

Hang Ye,Xiaoxuan Ma,Fan Lu,Wayne Wu,Kwan-Yee Lin,Yizhou Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Visually-grounded Humanoid Agents的两层架构，使数字人仅凭视觉观测和指定目标即可在新3D场景中自主、自然地执行目标导向行为。

Details

Motivation: 现有数字人系统多为被动驱动，依赖状态信息或脚本控制，难以扩展到新环境；本文旨在实现仅基于视觉和目标的主动、具身式数字人行为。 Method: 构建世界层（重建语义丰富的3D高斯场景并支持可动画化高斯人类头像）与智能体层（赋予头像第一人称RGB-D感知能力，结合空间感知与迭代推理进行具身规划，并生成全身动作执行）。 Result: 在自建基准上实验表明，该方法在任务成功率和碰撞率方面优于消融模型及现有最先进规划方法。 Conclusion: 本工作实现了可规模化部署的主动数字人，推动了以人为中心的具身AI发展，并将开源数据、代码与模型。 Abstract: Digital human generation has been studied for decades and supports a wide range of real-world applications. However, most existing systems are passively animated, relying on privileged state or scripted control, which limits scalability to novel environments. We instead ask: how can digital humans actively behave using only visual observations and specified goals in novel scenes? Achieving this would enable populating any 3D environments with digital humans at scale that exhibit spontaneous, natural, goal-directed behaviors. To this end, we introduce Visually-grounded Humanoid Agents, a coupled two-layer (world-agent) paradigm that replicates humans at multiple levels: they look, perceive, reason, and behave like real people in real-world 3D scenes. The World Layer reconstructs semantically rich 3D Gaussian scenes from real-world videos via an occlusion-aware pipeline and accommodates animatable Gaussian-based human avatars. The Agent Layer transforms these avatars into autonomous humanoid agents, equipping them with first-person RGB-D perception and enabling them to perform accurate, embodied planning with spatial awareness and iterative reasoning, which is then executed at the low level as full-body actions to drive their behaviors in the scene. We further introduce a benchmark to evaluate humanoid-scene interaction in diverse reconstructed environments. Experiments show our agents achieve robust autonomous behavior, yielding higher task success rates and fewer collisions than ablations and state-of-the-art planning methods. This work enables active digital human population and advances human-centric embodied AI. Data, code, and models will be open-sourced.

[184] When Fine-Tuning Changes the Evidence: Architecture-Dependent Semantic Drift in Chest X-Ray Explanations

Kabilan Elangovan,Daniel Ting

Main category: cs.CV

TL;DR: 本文研究了在医学图像分类中，迁移学习和微调过程中模型解释性（归因结构）的稳定性问题，提出了‘语义漂移’概念，并通过多种架构和归因方法验证了其存在及影响因素。

Details

Motivation: 在多类医学图像分类中，尽管微调能提升准确率，但模型所依赖的视觉证据可能不稳定；现有研究缺乏对解释性稳定性的系统评估。 Method: 在五类胸部X光数据上，采用DenseNet201、ResNet50V2和InceptionV3，实施两阶段训练（迁移+全微调），使用无参考指标量化归因图的空间定位与结构一致性（如IoU），并对比LayerCAM与GradCAM++的稳定性表现。 Result: 解剖级粗粒度定位保持稳定，但证据结构的重组织显著依赖于网络架构；不同归因方法（LayerCAM vs GradCAM++）下稳定性排序可反转，表明解释稳定性是架构、优化阶段与归因目标共同作用的结果。 Conclusion: 分类性能稳定不等于解释稳定；语义漂移揭示了模型推理基础的潜在变化，提示需将解释稳定性纳入医学AI模型评估体系。 Abstract: Transfer learning followed by fine-tuning is widely adopted in medical image classification due to consistent gains in diagnostic performance. However, in multi-class settings with overlapping visual features, improvements in accuracy do not guarantee stability of the visual evidence used to support predictions. We define semantic drift as systematic changes in the attribution structure supporting a model's predictions between transfer learning and full fine-tuning, reflecting potential shifts in underlying visual reasoning despite stable classification performance. Using a five-class chest X-ray task, we evaluate DenseNet201, ResNet50V2, and InceptionV3 under a two-stage training protocol and quantify drift with reference-free metrics capturing spatial localization and structural consistency of attribution maps. Across architectures, coarse anatomical localization remains stable, while overlap IoU reveals pronounced architecture-dependent reorganization of evidential structure. Beyond single-method analysis, stability rankings can reverse across LayerCAM and GradCAM++ under converged predictive performance, establishing explanation stability as an interaction between architecture, optimization phase, and attribution objective.

[185] MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Tanmay Gupta,Piper Wolters,Zixian Ma,Peter Sushko,Rock Yuren Pang,Diego Llanes,Yue Yang,Taira Anderson,Boyuan Zheng,Zhongzheng Ren,Harsh Trivedi,Taylor Blanton,Caleb Ouellette,Winson Han,Ali Farhadi,Ranjay Krishna

Main category: cs.CV

TL;DR: 本文提出MolmoWebMix数据集和MolmoWeb开源多模态网页代理模型，旨在推动开放、可复现的网页智能体研究；该模型仅依赖网页截图和指令即可预测浏览器操作，在多个基准上达到SOTA性能，并支持测试时扩展以提升成功率。

Details

Motivation: 当前高性能网页代理多依赖闭源模型，限制了科研透明性、可复现性与社区协作；作者主张构建面向开放网络的开源代理系统。 Method: 构建大规模混合数据集MolmoWebMix（含10万+合成轨迹、3万+人工演示、GUI感知数据），训练指令驱动的视觉-语言动作策略模型MolmoWeb（4B/8B参数），仅输入网页截图和任务指令，直接预测浏览器动作，无需HTML或可访问性树。 Result: MolmoWeb在WebVoyager、Online-Mind2Web、DeepShop等基准上超越同规模开源模型（如Fara-7B）及部分基于大闭源模型（如GPT-4o）的SoM代理；测试时采用并行rollout+best-of-N策略，pass@4显著提升（WebVoyager达94.7%，Online-Mind2Web达60.5%）。 Conclusion: MolmoWeb系列模型与配套开源资源（模型、数据、代码、评测框架）为开放网页代理研究提供了坚实基础，验证了纯视觉输入路径的有效性与可扩展性。 Abstract: Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.

[186] UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding

Joungbin An,Agrim Jain,Kristen Grauman

Main category: cs.CV

TL;DR: 本文提出UniversalVTG，一种轻量级、通用视频时序定位模型，通过跨数据集预训练与查询统一机制，在多个基准上达到SOTA性能，且参数量仅为MLLM方法的1/100。

Details

Motivation: 现有VTG方法依赖数据集特化模型，泛化性差；而基于大语言模型的方法计算开销高、视频上下文受限，难以处理长视频。 Method: 提出UniversalVTG：1）跨数据集大规模预训练；2）离线Query Unifier将异构查询统一为声明式表示，缓解语言不匹配与负迁移；3）搭配高效定位头支持长视频处理。 Result: 在GoalStep-StepGrounding、Ego4D-NLQ、TACoS、Charades-STA、ActivityNet-Captions等多基准上，单个checkpoint超越专用VTG模型；参数量<1%的MLLM方法，却在多个基准上达到或超过其精度。 Conclusion: 轻量级、统一监督的范式可有效替代大参数量MLLM，在保持高性能的同时显著提升实用性与可扩展性。 Abstract: Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.

[187] FIT: A Large-Scale Dataset for Fit-Aware Virtual Try-On

Johanna Karras,Yuanhao Wang,Yingwei Li,Ira Kemelmacher-Shlizerman

Main category: cs.CV

TL;DR: 本文提出FIT数据集和首个支持服装合身度感知的虚拟试穿（VTO）方法，通过合成生成含精确人体与服装尺寸的113万组三元图像，并结合物理仿真与重纹理技术提升真实感与身份一致性。

Details

Motivation: 现有虚拟试穿方法忽略服装合身度（如大号衬衫穿在小号人体上），且缺乏带精确尺寸标注（尤其'不合身'场景）的数据集，导致模型默认生成合身结果。 Method: 构建FIT数据集：①用GarmentCode生成3D服装并经物理仿真模拟真实穿着效果；②提出新重纹理框架，将合成渲染图转为照片级真实图像并严格保持几何结构；③在重纹理中引入人物身份保持机制，生成同一人物穿不同服装的配对图像用于监督训练；最后基于FIT训练首个合身度感知VTO基线模型。 Result: 发布首个大规模、带精确尺寸标注的FIT数据集（1.13M三元组）；实现首个能反映真实服装合身状态（过松/过紧）的VTO模型；建立新的合身感知VTO基准，性能达当前最优。 Conclusion: FIT填补了合身度感知虚拟试穿的数据与方法空白，推动VTO从‘外观可视化’迈向‘物理合理性建模’，为未来研究提供可复现基准与开源资源。 Abstract: Given a person and a garment image, virtual try-on (VTO) aims to synthesize a realistic image of the person wearing the garment, while preserving their original pose and identity. Although recent VTO methods excel at visualizing garment appearance, they largely overlook a crucial aspect of the try-on experience: the accuracy of garment fit -- for example, depicting how an extra-large shirt looks on an extra-small person. A key obstacle is the absence of datasets that provide precise garment and body size information, particularly for "ill-fit" cases, where garments are significantly too large or too small. Consequently, current VTO methods default to generating well-fitted results regardless of the garment or person size. In this paper, we take the first steps towards solving this open problem. We introduce FIT (Fit-Inclusive Try-on), a large-scale VTO dataset comprising over 1.13M try-on image triplets accompanied by precise body and garment measurements. We overcome the challenges of data collection via a scalable synthetic strategy: (1) We programmatically generate 3D garments using GarmentCode and drape them via physics simulation to capture realistic garment fit. (2) We employ a novel re-texturing framework to transform synthetic renderings into photorealistic images while strictly preserving geometry. (3) We introduce person identity preservation into our re-texturing model to generate paired person images (same person, different garments) for supervised training. Finally, we leverage our FIT dataset to train a baseline fit-aware virtual try-on model. Our data and results set the new state-of-the-art for fit-aware virtual try-on, as well as offer a robust benchmark for future research. We will make all data and code publicly available on our project page: https://johannakarras.github.io/FIT.

[188] Self-Improving 4D Perception via Self-Distillation

Nan Huang,Pengcheng Yu,Weijia Zeng,James M. Rehg,Angjoo Kanazawa,Haiwen Feng,Qianqian Wang

Main category: cs.CV

TL;DR: SelfEvo是一种无需标注数据的自改进框架，利用时空上下文不对称性进行自蒸馏，持续提升多视角重建模型在动态场景中的4D感知性能。

Details

Motivation: 现有大规模多视图重建模型严重依赖昂贵且稀缺的真值3D/4D标注（尤其动态场景），限制了可扩展性。 Method: 提出SelfEvo框架，采用基于时空上下文不对称性的自蒸馏机制，在无标签视频上持续优化预训练模型；系统研究了有效自改进的关键设计（如损失信号、不对称形式和训练策略）。 Result: 在八个涵盖不同数据集与领域的基准上，SelfEvo稳定提升多种基线模型（如VGGT和π³），在视频深度估计和相机位姿估计上分别获得最高36.5%和20.1%的相对提升。 Conclusion: SelfEvo证明了无需外部标注即可实现学习型4D感知模型的持续自改进，显著提升了动态场景建模能力与泛化性。 Abstract: Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.

[189] RewardFlow: Generate Images by Optimizing What You Reward

Onkar Susladkar,Dong-Hwan Jang,Tushar Prakash,Adheesh Juvekar,Vedant Shah,Ayush Barik,Nabeel Bashir,Muntasir Wahed,Ritish Shrirao,Ismini Lourentzou

Main category: cs.CV

TL;DR: RewardFlow是一种无需模型 inversion 的推理时引导框架，通过多奖励 Langevin 动态控制预训练扩散与流匹配模型，融合多种可微奖励并引入可微 VQA 奖励，结合提示感知的自适应策略实现高质量图像编辑与组合生成。

Details

Motivation: 现有方法在图像编辑和组合生成中难以兼顾语义对齐、感知保真、局部定位、对象一致性和人类偏好等多重目标，且依赖模型 inversion，限制灵活性与效率。 Method: 提出 RewardFlow 框架：1）设计多奖励 Langevin 动态，整合语义对齐、感知保真、局部接地、对象一致性及人类偏好等可微奖励；2）新增基于 VQA 的可微奖励以提供细粒度语言-视觉语义监督；3）构建提示感知的自适应策略，从指令中提取语义原语、推断编辑意图，并动态调节各奖励权重与采样步长。 Result: 在多个图像编辑与组合生成基准上，RewardFlow 实现了当前最优的编辑保真度与组合对齐性能。 Conclusion: RewardFlow 证明了无需 inversion 的多目标奖励驱动推理策略在可控图像生成中的有效性与通用性，为统一语义与感知优化提供了新范式。 Abstract: We introduce RewardFlow, an inversion-free framework that steers pretrained diffusion and flow-matching models at inference time through multi-reward Langevin dynamics. RewardFlow unifies complementary differentiable rewards for semantic alignment, perceptual fidelity, localized grounding, object consistency, and human preference, and further introduces a differentiable VQA-based reward that provides fine-grained semantic supervision through language-vision reasoning. To coordinate these heterogeneous objectives, we design a prompt-aware adaptive policy that extracts semantic primitives from the instruction, infers edit intent, and dynamically modulates reward weights and step sizes throughout sampling. Across several image editing and compositional generation benchmarks, RewardFlow delivers state-of-the-art edit fidelity and compositional alignment.

[190] ParseBench: A Document Parsing Benchmark for AI Agents

Boyang Zhang,Sebastián G. Acosta,Preston Carlson,Sacha Bron,Pierre-Loïc Doulcet,Simon Suo

Main category: cs.CV

TL;DR: 本文提出了ParseBench，一个面向企业自动化场景的新型文档解析基准，强调语义正确性，涵盖表格、图表、内容保真度、语义格式和视觉定位五大维度，评估了14种方法，揭示当前系统能力仍存在明显短板。

Details

Motivation: 现有文档解析基准无法充分反映AI智能体对语义正确性的高要求（如表格结构、图表数据精度、视觉 grounding 等），尤其在保险、金融、政务等企业文档场景中存在覆盖不足与评估指标失准问题。 Method: 构建包含约2000页人工验证的企业文档数据集ParseBench，按五大能力维度设计细粒度评估协议，并对14种主流解析方法（含多模态大模型、专用解析器及LlamaParse）进行系统评测。 Result: 评测显示当前方法能力高度碎片化：无一方法在全部五个维度上表现一致优异；LlamaParse Agentic以最高综合得分（agenticoverall%）领先，但仍存在显著能力缺口。 Conclusion: ParseBench填补了面向AI智能体的语义感知文档解析评估空白，为后续研究提供了可复现、多维度、真实场景驱动的基准框架。 Abstract: AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.

[191] Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Tao Xie,Peishan Yang,Yudong Jin,Yingfeng Cai,Wei Yin,Weiqiang Ren,Qian Zhang,Wei Hua,Sida Peng,Xiaoyang Guo,Xiaowei Zhou

Main category: cs.CV

TL;DR: 本文提出了一种神经全局上下文表示方法，通过轻量级子网络在测试时自监督快速适应，提升长视频序列下的大规模3D场景重建精度与一致性。

Details

Motivation: 现有前馈式3D重建模型在长视频序列中因内存有限、缺乏全局上下文建模能力，导致精度和一致性下降；受人类利用全局场景理解辅助局部感知的启发，需构建高效长程上下文表征。 Method: 设计轻量级神经子网络构成的可快速自适应的全局上下文表示，在测试时通过自监督目标优化，以低计算开销显著扩展模型记忆容量并融合长程场景信息。 Result: 在KITTI Odometry和Oxford Spires等大规模基准上达到领先的位姿精度和最先进的3D重建精度，同时保持高效率。 Conclusion: 所提神经全局上下文表示有效缓解了长序列重建中的精度退化问题，为大规模动态场景重建提供了高效且可扩展的新范式。 Abstract: This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.

[192] E-3DPSM: A State Machine for Event-Based Egocentric 3D Human Pose Estimation

Mayur Deshmukh,Hiroyasu Akada,Helge Rhodin,Christian Theobalt,Vladislav Golyanik

Main category: cs.CV

TL;DR: 本文提出E-3DPSM，一种面向事件流特性的连续姿态状态机，用于头戴式设备的单目事件相机3D人体姿态估计，显著提升精度与时间稳定性。

Details

Motivation: 现有方法未充分适配事件流的异步、连续特性，导致3D估计精度低、易受自遮挡和时序抖动影响，难以满足VR/AR等应用需求。 Method: 提出事件驱动的连续姿态状态机E-3DPSM，将人体连续运动与细粒度事件动态对齐，通过演化潜在状态并预测事件关联的3D关节点连续变化，再融合直接3D姿态预测，实现稳定无漂移重建。 Result: 在两个基准上达到SOTA，MPJPE精度最高提升19%，时间稳定性提升达2.7倍；实现实时80Hz运行。 Conclusion: E-3DPSM有效克服了事件流建模的固有挑战，为基于事件相机的 egocentric 3D姿态估计提供了更鲁棒、高效的新范式。 Abstract: Event cameras offer multiple advantages in monocular egocentric 3D human pose estimation from head-mounted devices, such as millisecond temporal resolution, high dynamic range, and negligible motion blur. Existing methods effectively leverage these properties, but suffer from low 3D estimation accuracy, insufficient in many applications (e.g., immersive VR/AR). This is due to the design not being fully tailored towards event streams (e.g., their asynchronous and continuous nature), leading to high sensitivity to self-occlusions and temporal jitter in the estimates. This paper rethinks the setting and introduces E-3DPSM, an event-driven continuous pose state machine for event-based egocentric 3D human pose estimation. E-3DPSM aligns continuous human motion with fine-grained event dynamics; it evolves latent states and predicts continuous changes in 3D joint positions associated with observed events, which are fused with direct 3D human pose predictions, leading to stable and drift-free final 3D pose reconstructions. E-3DPSM runs in real-time at 80 Hz on a single workstation and sets a new state of the art in experiments on two benchmarks, improving accuracy by up to 19% (MPJPE) and temporal stability by up to 2.7x. See our project page for the source code and trained models.

[193] Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Shilin Yan,Jintao Tong,Hongwei Xue,Xiaojun Tang,Yangyang Wang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou

Main category: cs.CV

TL;DR: 本文提出HDPO框架，通过解耦准确性和效率两个优化通道，解决多模态智能体盲目调用工具的问题，显著减少工具调用次数并提升推理准确性。

Details

Motivation: 现有多模态智能体缺乏元认知能力，无法合理判断何时应依赖内部知识、何时需调用外部工具，导致盲目调用工具、延迟高、噪声大。 Method: 提出HDPO框架，摒弃标量化奖励，构建两个正交优化通道：准确性通道（最大化任务正确率）与效率通道（仅在准确轨迹中通过条件优势估计强制执行经济性），形成认知课程学习机制。 Result: 所提出的模型Metis在多项评测中大幅降低工具调用次数（数量级下降），同时提升推理准确性。 Conclusion: HDPO通过条件化工具效率优化，有效缓解工具过用问题，为多模态智能体的高效、鲁棒推理提供了新范式。 Abstract: The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

[194] When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Zhengyang Sun,Yu Chen,Xin Zhou,Xiaofan Li,Xiwu Chen,Dingkang Liang,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出NUMINA框架，通过训练-free的方式提升文本到视频扩散模型中对象数量的准确性。

Details

Motivation: 文本到视频扩散模型在生成提示中指定的正确对象数量方面存在困难。 Method: NUMINA是一种训练-free的identify-then-guide框架：首先通过选择判别性的自注意力和交叉注意力头来识别提示与布局之间的不一致性，并推导出可计数的潜在布局；然后保守地优化该布局，并调节交叉注意力以指导再生。 Result: 在新提出的CountBench基准上，NUMINA将Wan2.1-1.3B、5B和14B模型的数量准确性分别提高了7.4%、4.9%和5.5%，同时提升了CLIP对齐性并保持时间一致性。 Conclusion: 结构化引导可作为种子搜索和提示增强的有效补充，为实现数量准确的文本到视频扩散提供了一条实用路径。 Abstract: Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

[195] GaussiAnimate: Reconstruct and Rig Animatable Categories with Level of Dynamics

Jiaxin Wang,Dongxin Lyu,Zeyu Cai,Zhiyang Dou,Cheng Lin,Anpei Chen,Yuliang Xiu

Main category: cs.CV

TL;DR: 本文提出了一种名为Skelebones的Scaffold-Skin绑定系统，通过自由形态骨骼（Bones）、均曲率骨架（Skeleton）和非参数分部运动匹配（PartMM）三步，实现对4D形状动态性的高效压缩与可控表达，在重动画性能和重建保真度上显著优于LBS和BoB等方法。

Details

Motivation: 自由形态骨骼虽能有效捕捉非刚性变形，但缺乏用于直观控制的运动学结构；现有方法在控制性与表达力之间难以兼顾。 Method: 提出Skelebones系统：(1) 将时序一致的可变形高斯分布压缩为自由形态骨骼以近似表面非刚性变形；(2) 从规范高斯分布中提取并时序优化均曲率骨架，获得类别无关、运动自适应且拓扑正确的运动学结构；(3) 利用非参数分部运动匹配（PartMM）绑定骨架与骨骼，通过匹配、检索与融合已有运动合成新骨骼运动。 Result: 在合成与真实数据集上验证，重动画性能显著提升：相比LBS提升17.3% PSNR，相比BoB提升21.7% PSNR；PartMM在低数据（~1000帧）下RMSE比鲁棒LBS改善48.4%，优于GRU/MLP方法超20%；且对高斯与网格表示均具强泛化性。 Conclusion: Skelebones成功将4D形状的动态复杂性压缩为紧凑、可控且富有表现力的skelebones表示，为非刚性角色动画提供了新范式。 Abstract: Free-form bones, that conform closely to the surface, can effectively capture non-rigid deformations, but lack a kinematic structure necessary for intuitive control. Thus, we propose a Scaffold-Skin Rigging System, termed "Skelebones", with three key steps: (1) Bones: compress temporally-consistent deformable Gaussians into free-form bones, approximating non-rigid surface deformations; (2) Skeleton: extract a Mean Curvature Skeleton from canonical Gaussians and refine it temporally, ensuring a category-agnostic, motion-adaptive, and topology-correct kinematic structure; (3) Binding: bind the skeleton and bones via non-parametric partwise motion matching (PartMM), synthesizing novel bone motions by matching, retrieving, and blending existing ones. Collectively, these three steps enable us to compress the Level of Dynamics of 4D shapes into compact skelebones that are both controllable and expressive. We validate our approach on both synthetic and real-world datasets, achieving significant improvements in reanimation performance across unseen poses-with 17.3% PSNR gains over Linear Blend Skinning (LBS) and 21.7% over Bag-of-Bones (BoB)-while maintaining excellent reconstruction fidelity, particularly for characters exhibiting complex non-rigid surface dynamics. Our Partwise Motion Matching algorithm demonstrates strong generalization to both Gaussian and mesh representations, especially under low-data regime (~1000 frames), achieving 48.4% RMSE improvement over robust LBS and outperforming GRU- and MLP-based learning methods by >20%. Code will be made publicly available for research purposes at cookmaker.cn/gaussianimate.

[196] ETCH-X: Robustify Expressive Body Fitting to Clothed Humans with Composable Datasets

Xiaoben Li,Jingyi Wu,Zeyu Cai,Yu Siyuan,Boqian Li,Yuliang Xiu

Main category: cs.CV

TL;DR: 本文提出ETCH-X，一种改进的人体拟合方法，通过紧致性感知的拟合范式（'undress'）去除衣物干扰、采用SMPL-X增强局部表达能力，并以隐式密集对应关系（'dense fit'）替代显式稀疏标记，提升在复杂服装、姿态变化及不完整输入下的鲁棒性与精细度。

Details

Motivation: 现有方法难以同时兼顾局部细节表达（如手部、面部）和全局鲁棒性（应对衣物动态、姿态变化、噪声/缺失点云），缺乏一体化解决方案。 Method: 升级ETCH为ETCH-X：1）引入紧致性感知拟合模块过滤衣物影响（'undress'）；2）采用SMPL-X模型增强表达能力；3）用隐式密集对应替代显式稀疏标记（'dense fit'）；4）将'undress'与'dense fit'解耦为可独立训练的模块，分别利用CLOTH3D、AMASS、InterHand2.6M等多源数据进行可扩展训练。 Result: 在多个基准上显著超越原ETCH：在已见数据（4D-Dress、CAPE）MPJPE-All提升33.0%，V2V-Hands提升35.8%；在未见数据（BEDLAM2.0）MPJPE-All和V2V-All均提升约80.5%–80.8%。 Conclusion: ETCH-X实现了更鲁棒、更精细的人体拟合，尤其适用于多样化服装、复杂姿态及不完整点云输入，为下游动画、纹理等任务提供了更可靠的几何基础。 Abstract: Human body fitting, which aligns parametric body models such as SMPL to raw 3D point clouds of clothed humans, serves as a crucial first step for downstream tasks like animation and texturing. An effective fitting method should be both locally expressive-capturing fine details such as hands and facial features-and globally robust to handle real-world challenges, including clothing dynamics, pose variations, and noisy or partial inputs. Existing approaches typically excel in only one aspect, lacking an all-in-one solution.We upgrade ETCH to ETCH-X, which leverages a tightness-aware fitting paradigm to filter out clothing dynamics ("undress"), extends expressiveness with SMPL-X, and replaces explicit sparse markers (which are highly sensitive to partial data) with implicit dense correspondences ("dense fit") for more robust and fine-grained body fitting. Our disentangled "undress" and "dense fit" modular stages enable separate and scalable training on composable data sources, including diverse simulated garments (CLOTH3D), large-scale full-body motions (AMASS), and fine-grained hand gestures (InterHand2.6M), improving outfit generalization and pose robustness of both bodies and hands. Our approach achieves robust and expressive fitting across diverse clothing, poses, and levels of input completeness, delivering a substantial performance improvement over ETCH on both: 1) seen data, such as 4D-Dress (MPJPE-All, 33.0% ) and CAPE (V2V-Hands, 35.8% ), and 2) unseen data, such as BEDLAM2.0 (MPJPE-All, 80.8% ; V2V-All, 80.5% ). Code and models will be released at https://xiaobenli00.github.io/ETCH-X/.

Table of Contents

cs.CL [Back]

[1] Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild

[2] Hybrid CNN-Transformer Architecture for Arabic Speech Emotion Recognition

[3] Cross-Tokenizer LLM Distillation through a Byte-Level Interface

[4] Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá

[5] Enabling Intrinsic Reasoning over Dense Geospatial Embeddings with DFR-Gemma

[6] Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

[7] EMSDialog: Synthetic Multi-person Emergency Medical Service Dialogue Generation from Electronic Patient Care Reports via Multi-LLM Agents

[8] TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization

[9] Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

[10] CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

[11] ADAG: Automatically Describing Attribution Graphs

[12] DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification

[13] Efficient and Effective Internal Memory Retrieval for LLM-Based Healthcare Prediction

[14] Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models

[15] SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

[16] Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

[17] An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

[18] Sensitivity-Positional Co-Localization in GQA Transformers

[19] TEMPER: Testing Emotional Perturbation in Quantitative Reasoning

[20] GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning

[21] AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

[22] Tool Retrieval Bridge: Aligning Vague Instructions with Retriever Preferences via Bridge Model

[23] Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

[24] Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers

[25] MemReader: From Passive to Active Extraction for Long-Term Agent Memory

[26] Contextualising (Im)plausible Events Triggers Figurative Language

[27] Linear Representations of Hierarchical Concepts in Language Models

[28] Data Selection for Multi-turn Dialogue Instruction Tuning

[29] TSUBASA: Improving Long-Horizon Personalization via Evolving Memory and Self-Learning with Context Distillation

[30] HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy

[31] Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

[32] Rethinking Data Mixing from the Perspective of Large Language Models

[33] AtomEval: Atomic Evaluation of Adversarial Claims in Fact Verification

[34] Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

[35] A Decomposition Perspective to Long-context Reasoning for LLMs

[36] Rag Performance Prediction for Question Answering

[37] Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation

[38] Efficient Provably Secure Linguistic Steganography via Range Coding

[39] Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

[40] Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection

[41] Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

[42] LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

[43] Graph Neural Networks for Misinformation Detection: Performance-Efficiency Trade-offs

[44] Clickbait detection: quick inference with maximum impact

[45] Training Data Size Sensitivity in Unsupervised Rhyme Recognition

[46] Self-Debias: Self-correcting for Debiasing Large Language Models

[47] HyperMem: Hypergraph Memory for Long-Term Conversations

[48] Behavior-Aware Item Modeling via Dynamic Procedural Solution Representations for Knowledge Tracing

[49] Floating or Suggesting Ideas? A Large-Scale Contrastive Analysis of Metaphorical and Literal Verb-Object Constructions

[50] When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

[51] Distributed Multi-Layer Editing for Rule-Level Knowledge in Large Language Models

[52] SeLaR: Selective Latent Reasoning in Large Language Models

[53] Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

[54] A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection

[55] Synthetic Data for any Differentiable Target

[56] AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

[57] AI generates well-liked but templatic empathic responses

[58] What do Language Models Learn and When? The Implicit Curriculum Hypothesis

[59] Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

[60] ClawBench: Can AI Agents Complete Everyday Online Tasks?

[61] Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CV [Back]

[62] FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

[63] Personalizing Text-to-Image Generation to Individual Taste

[64] GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

[65] HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

[66] SMFD-UNet: Semantic Face Mask Is The Only Thing You Need To Deblur Faces

[67] Training-free Spatially Grounded Geometric Shape Encoding (Technical Report)

[68] On the Uphill Battle of Image frequency Analysis

[69] Mathematical Analysis of Image Matching Techniques

[70] Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

[71] MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition

[72] Bootstrapping Sign Language Annotations with Sign Language Models

[73] VSAS-BENCH: Real-Time Evaluation of Visual Streaming Assistant Models

[74] Monocular Depth Estimation From the Perspective of Feature Restoration: A Diffusion Enhanced Depth Restoration Approach

[75] Adaptive Depth-converted-Scale Convolution for Self-supervised Monocular Depth Estimation

[76] Weight Group-wise Post-Training Quantization for Medical Foundation Model

[77] FireSenseNet: A Dual-Branch CNN with Cross-Attentive Feature Interaction for Next-Day Wildfire Spread Prediction