Skip to content

Table of Contents

cs.CL [Back]

[1] Task-Specific Knowledge Distillation via Intermediate Probes

Ryan Brown,Chris Russell

Main category: cs.CL

TL;DR: 本文提出了一种名为\method{}的知识蒸馏新框架,通过在冻结的教师模型隐藏层上训练轻量级探针,并用探针预测而非教师输出logits作为学生训练监督信号,从而绕过词汇投影带来的噪声瓶颈,在多个推理基准上取得显著提升,尤其在数据受限时效果更佳。

Details Motivation: 传统知识蒸馏假设教师模型的输出分布是高质量监督信号,但在推理任务中,由于提示格式和答案词元选择导致的词汇投影问题,教师输出常存在噪声和失真,即使其内部表征已包含正确答案。 Method: 提出\method{}框架:在冻结的大型教师模型中间隐藏层上训练轻量级、可解释的探针(probes),利用探针的预测结果作为监督信号来训练学生模型,而非直接使用教师的输出logits;探针训练成本低,教师表征可缓存,且不改变师生模型架构。 Result: 在AQuA-RAT、ARC Easy/Challenge和MMLU四个推理基准上均取得一致性能提升,尤其在数据有限场景下增益最显著;探针提供的标签比教师原始输出更干净,有效去噪蒸馏信号。 Conclusion: \method{}是一种架构无关、计算开销小、无需额外数据或修改模型结构的知识蒸馏方法,通过挖掘教师模型内部表征,提升了推理任务下的蒸馏效率与效果。 Abstract: Knowledge distillation from large language models (LLMs) assumes that the teacher's output distribution is a high-quality training signal. On reasoning tasks, this assumption is frequently violated. A model's intermediate representations may encode the correct answer, yet this information is lost or distorted through the vocabulary projection, where prompt formatting and answer-token choices creates brittle, noisy outputs. We introduce \method{}, a distillation framework that bypasses this bottleneck by training lightweight probes on frozen teacher hidden states and using the probe's predictions, rather than output logits, as supervision for student training. This simple change yields consistent improvements across four reasoning benchmarks (AQuA-RAT, ARC Easy/Challenge, and MMLU), with gains most pronounced under limited data. Probes trained on intermediate representations provide cleaner labels than the teacher's own outputs, effectively denoising the distillation signal. \method{} requires no architectural changes to student or teacher, is architecture-agnostic, and adds minimal compute since probe training is cheap and teacher representations can be cached. By exploiting internal representations, \method{} enables practitioners to extract more value from large teacher models without additional training data or architectural complexity.

[2] Diagnosing Retrieval Bias Under Multiple In-Context Knowledge Updates in Large Language Models

Boyu Qiao,Sean Guo,Xian Yang,Kun Li,Wei Zhou,Songlin Hu,Yunya Song

Main category: cs.CL

TL;DR: 本文提出动态知识实例(DKI)评估框架,研究大语言模型在多轮知识更新场景下的检索偏差问题,发现模型对最新知识状态的准确率随更新次数增加而显著下降,且认知启发的干预策略效果有限。

Details Motivation: 现有工作主要关注单次更新或单一冲突,而多更新场景中多个历史有效版本在检索时相互竞争,这一问题尚未被充分探索。 Method: 受认知心理学AB-AC干扰范式的启发,构建DKI评估框架,将同一事实的多次更新建模为线索与一系列更新值的配对,并通过端点探测评估模型对最早(初始)和最新(当前)状态的识别能力;进一步分析注意力、隐状态相似性和输出logits等信号。 Result: 随着更新次数增加,检索偏差加剧:最早状态准确率保持高位,而最新状态准确率大幅下降;错误样本中注意力、隐状态和logits信号趋于平坦、判别力减弱;认知启发的启发式干预仅带来有限提升。 Conclusion: 大语言模型在长上下文中持续跟踪和响应知识更新仍面临根本性挑战。 Abstract: LLMs are widely used in knowledge-intensive tasks where the same fact may be revised multiple times within context. Unlike prior work focusing on one-shot updates or single conflicts, multi-update scenarios contain multiple historically valid versions that compete at retrieval, yet remain underexplored. This challenge resembles the AB-AC interference paradigm in cognitive psychology: when the same cue A is successively associated with B and C, the old and new associations compete during retrieval, leading to bias. Inspired by this, we introduce a Dynamic Knowledge Instance (DKI) evaluation framework, modeling multi-updates of the same fact as a cue paired with a sequence of updated values, and assess models via endpoint probing of the earliest (initial) and latest (current) states. Across diverse LLMs, we observe that retrieval bias intensifies as updates increase, earliest-state accuracy stays high while latest-state accuracy drops substantially. Diagnostic analyses of attention, hidden-state similarity, and output logits further reveal that these signals become flatter and weakly discriminative on errors, providing little stable basis for identifying the latest update. Finally, cognitively inspired heuristic intervention strategies yield only modest gains and do not eliminate the bias. Our results reveal a persistent challenge in tracking and following knowledge updates in long contexts.

[3] ActTail: Global Activation Sparsity in Large Language Models

Wenwen Hou,Xinyuan Song,Shiwei Liu

Main category: cs.CL

TL;DR: 本文提出ActTail方法,基于重尾自正则化理论,通过各投影的重尾指数动态分配激活稀疏度,在保持模型性能的同时显著提升LLM推理效率。

Details Motivation: 现有激活稀疏方法采用统一稀疏度,忽略Transformer权重统计特性的异质性,导致性能下降加剧。 Method: 提出基于TopK幅值的激活稀疏方法ActTail,利用各投影经验谱密度(ESD)计算重尾指数,作为异构稀疏预算分配依据,并从理论上建立稀疏率与重尾指数的显式关系。 Result: 在LLaMA和Mistral模型上验证,80%稀疏度下,LLaMA-2-7B、LLaMA-2-13B和Mistral-7B的困惑度分别降低21.8%、40.1%和9.4%,下游任务性能也优于均匀稀疏。 Conclusion: ActTail通过理论驱动的异构稀疏分配策略,有效缓解高稀疏下的性能退化,为LLM高效推理提供了新范式。 Abstract: Activation sparsity is a promising approach for accelerating large language model (LLM) inference by reducing computation and memory movement. However, existing activation sparsity methods typically apply uniform sparsity across projections, ignoring the heterogeneous statistical properties of Transformer weights and thereby amplifying performance degradation. In this paper, we propose ActTail, a TopK magnitude-based activation sparsity method with global activation sparsity allocation grounded in Heavy-Tailed Self-Regularization (HT-SR) theory. Specifically, we capture this heterogeneity via the heavy-tail exponent computed from each projection's empirical spectral density (ESD), which is used as a quantitative indicator to assign projection-specific sparsity budgets. Importantly, we provide a theoretical analysis that establishes an explicit relationship between the activation sparsity ratio and the heavy-tail exponent under the HT-SR regime, offering principled guidance for sparsity allocation beyond heuristic design. Experiments on LLaMA and Mistral models show that our method improves both perplexity and downstream task performance at high sparsity compared to uniform allocation. At 80% sparsity, perplexity is reduced by 21.8% on LLaMA-2-7B, 40.1% on LLaMA-2-13B, and 9.4% on Mistral-7B.

[4] Aligning Language Models from User Interactions

Thomas Kleine Buening,Jonas Hübotter,Barna Pásztor,Idan Shenfeld,Giorgia Ramponi,Andreas Krause

Main category: cs.CL

TL;DR: 本文提出了一种基于自蒸馏的多轮用户交互学习方法,利用用户后续消息作为‘事后修正’信号来更新语言模型策略,无需显式反馈即可提升对齐性、指令遵循能力与个性化适应能力。

Details Motivation: 现有语言模型在部署中产生大量多轮用户交互数据,但这些数据通常被丢弃;而实际上,用户的后续消息隐含了对前序响应的修正信号(如错误、未遵从指令或偏好不符),值得有效利用。 Method: 通过将模型条件化于用户后续消息,生成‘事后分布’,并与原始策略分布对比,以此作为监督信号进行自蒸馏,将修正后的行为蒸馏回当前策略。 Result: 在WildChat真实用户对话上训练后,模型在标准对齐与指令遵循基准上显著提升,且不损害其他能力;同时支持无需显式反馈的个性化与持续适应。 Conclusion: 自然产生的原始用户交互数据本身即蕴含丰富监督信号,可直接用于模型的对齐、个性化与持续学习,无需依赖人工标注或强化学习框架。 Abstract: Multi-turn user interactions are among the most abundant data produced by language models, yet we lack effective methods to learn from them. While typically discarded, these interactions often contain useful information: follow-up user messages may indicate that a response was incorrect, failed to follow an instruction, or did not align with the user's preferences. Importantly, language models are already able to make use of this information in context. After observing a user's follow-up, the same model is often able to revise its behavior. We leverage this ability to propose a principled and scalable method for learning directly from user interactions through self-distillation. By conditioning the model on the user's follow-up message and comparing the resulting token distribution with the original policy, we obtain a target for updating the policy that captures how the model's behavior changes in hindsight. We then distill this hindsight distribution back into the current policy. Remarkably, we show that training on real-world user conversations from WildChat improves language models across standard alignment and instruction-following benchmarks, without regressing other capabilities. The same mechanism enables personalization, allowing models to continually adapt to individual users through interaction without explicit feedback. Our results demonstrate that raw user interactions that arise naturally during deployment enable alignment, personalization, and continual adaptation.

[5] GONE: Structural Knowledge Unlearning via Neighborhood-Expanded Distribution Shaping

Chahana Dahal,Ashutosh Balasubramaniam,Zuobin Xiong

Main category: cs.CL

TL;DR: 本文提出GONE基准和NEDS框架,首次系统评估并提升大语言模型在知识图谱结构化知识上的遗忘能力,兼顾直接删除、推理泄露与灾难性遗忘三重效果,并实现高遗忘效能与强局部性。

Details Motivation: 现有大语言模型知识遗忘方法仅关注句子级扁平数据,忽视知识图谱中关系性、多跳与推理性结构化知识,难以应对安全、隐私与知识产权等挑战。 Method: 构建Graph Oblivion and Node Erasure(GONE)知识图谱遗忘基准,解耦直接删除、推理泄露和灾难性遗忘三类效应;提出Neighborhood-Expanded Distribution Shaping(NEDS)框架,利用图连通性识别锚定相关邻居,塑造遗忘事实与其语义邻域间的精确决策边界。 Result: 在LLaMA-3-8B和Mistral-7B上,NEDS在GONE及其他基准上取得最优性能:遗忘效能达1.000,局部性达0.839。 Conclusion: 结构化知识(如知识图谱)的遗忘需专门建模图关系与推理路径;GONE为该任务提供首个标准化评估基准,NEDS通过邻域分布塑形显著提升遗忘精度与鲁棒性,推动可信AI发展。 Abstract: Unlearning knowledge is a pressing and challenging task in Large Language Models (LLMs) because of their unprecedented capability to memorize and digest training data at scale, raising more significant issues regarding safety, privacy, and intellectual property. However, existing works, including parameter editing, fine-tuning, and distillation-based methods, are all focused on flat sentence-level data but overlook the relational, multi-hop, and reasoned knowledge in naturally structured data. In response to this gap, this paper introduces Graph Oblivion and Node Erasure (GONE), a benchmark for evaluating knowledge unlearning over structured knowledge graph (KG) facts in LLMs. This KG-based benchmark enables the disentanglement of three effects of unlearning: direct fact removal, reasoning-based leakage, and catastrophic forgetting. In addition, Neighborhood-Expanded Distribution Shaping (NEDS), a novel unlearning framework, is designed to leverage graph connectivity and identify anchor correlated neighbors, enforcing a precise decision boundary between the forgotten fact and its semantic neighborhood. Evaluations on LLaMA-3-8B and Mistral-7B across multiple knowledge editing and unlearning methods showcase NEDS's superior performance (1.000 on unlearning efficacy and 0.839 on locality) on GONE and other benchmarks. Code is available at https://anonymous.4open.science/r/GONE-4679/.

[6] Prompt Injection as Role Confusion

Charles Ye,Jasmine Cui,Dylan Hadfield-Menell

Main category: cs.CL

TL;DR: 本文揭示了大语言模型易受提示注入攻击的根本原因——角色混淆,即模型根据文本的书写方式而非来源推断说话者角色,导致恶意文本通过模仿角色获得权威;作者设计了新型角色探针来量化这种混淆,并验证其与攻击成功率的高度相关性,提出了一个统一的机制性框架解释各类提示注入攻击。

Details Motivation: 尽管经过大量安全训练,语言模型仍易受提示注入攻击,本文旨在探究其根本原因。 Method: 设计新型角色探针以捕捉模型内部如何识别‘谁在说话’,并基于该洞察在用户提示和工具输出中注入伪造推理进行攻击实验。 Result: 在StrongREJECT和代理数据泄露任务上平均攻击成功率达60%和61%,且角色混淆程度可在生成前显著预测攻击成功率。 Conclusion: 安全边界定义于接口层面,而权威分配发生在隐空间,这一根本错配导致现有防御失效;所有提示注入攻击本质上均利用同一角色混淆机制。 Abstract: Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofed reasoning into user prompts and tool outputs, achieving average success rates of 60% on StrongREJECT and 61% on agent exfiltration, across multiple open- and closed-weight models with near-zero baselines. Strikingly, the degree of internal role confusion strongly predicts attack success before generation begins. Our findings reveal a fundamental gap: security is defined at the interface but authority is assigned in latent space. More broadly, we introduce a unifying, mechanistic framework for prompt injection, demonstrating that diverse prompt-injection attacks exploit the same underlying role-confusion mechanism.

[7] LLM-Augmented Therapy Normalization and Aspect-Based Sentiment Analysis for Treatment-Resistant Depression on Reddit

Yuxin Zhu,Sahithi Lakamana,Masoud Rouhizadeh,Selen Bozkurt,Rachel Hershenberg,Abeed Sarker

Main category: cs.CL

TL;DR: 本研究利用Reddit上治疗抵抗性抑郁症(TRD)患者的发帖数据,构建了一个包含5059篇帖子的语料库,通过基于方面的情感分析模型(微F1=0.800)分析81种药物的患者感知情感倾向,发现传统抗抑郁药(如SSRIs/SNRIs)负面评价居多,而氯胺酮类药物情感评价更积极。

Details Motivation: 现有TRD药物临床试验证据有限,且常忽略患者报告的耐受性;大型在线同伴支持平台(如Reddit)可提供真实世界中患者对药物的主观评价视角。 Method: 1)从28个心理健康相关子版块收集2010–2025年TRD相关Reddit帖子(n=5059),标准化提取81种药物共23399次提及;2)基于SMM4H 2023数据集并结合大语言模型数据增强,微调DeBERTa-v3构建方面级情感分类器;3)对药物提及进行正/中/负三类情感量化,并按药物、用户、子版块和年份分析趋势。 Result: 整体药物提及中72.1%为中性、14.8%负面、13.1%正面;SSRIs和SNRIs负面比例显著高于正面,而氯胺酮与艾氯胺酮呈现更积极的情感分布。 Conclusion: 结合标准化药物识别与方面级情感分析,可有效刻画TRD患者在社交媒体中对治疗的真实感知,为临床证据提供大规模、以患者为中心的补充视角。 Abstract: Treatment-resistant depression (TRD) is a severe form of major depressive disorder in which patients do not achieve remission despite multiple adequate treatment trials. Evidence across pharmacologic options for TRD remains limited, and trials often do not fully capture patient-reported tolerability. Large-scale online peer-support narratives therefore offer a complementary lens on how patients describe and evaluate medications in real-world use. In this study, we curated a corpus of 5,059 Reddit posts explicitly referencing TRD from 3,480 subscribers across 28 mental health-related subreddits from 2010 to 2025. Of these, 3,839 posts mentioned at least one medication, yielding 23,399 mentions of 81 generic-name medications after lexicon-based normalization of brand names, misspellings, and colloquialisms. We developed an aspect-based sentiment classifier by fine-tuning DeBERTa-v3 on the SMM4H 2023 therapy-sentiment Twitter corpus with large language model based data augmentation, achieving a micro-F1 score of 0.800 on the shared-task test set. Applying this classifier to Reddit, we quantified sentiment toward individual medications across three categories: positive, neutral, and negative, and tracked patterns by drug, subscriber, subreddit, and year. Overall, 72.1% of medication mentions were neutral, 14.8% negative, and 13.1% positive. Conventional antidepressants, especially SSRIs and SNRIs, showed consistently higher negative than positive proportions, whereas ketamine and esketamine showed comparatively more favorable sentiment profiles. These findings show that normalized medication extraction combined with aspect-based sentiment analysis can help characterize patient-perceived treatment experiences in TRD-related Reddit discourse, complementing clinical evidence with large-scale patient-generated perspectives.

[8] TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Liang-Hsuan Tseng,Hung-yi Lee

Main category: cs.CL

TL;DR: 本文提出TASTE-S,一种流式扩展的文本-语音联合建模方法,通过集成CTC-based ASR模块和重设计单元解码器,实现低延迟实时交互,同时保持与原TASTE相当的性能。

Details Motivation: 解决现有文本-语音联合建模范式中因模态长度不匹配导致的流式应用受限问题,特别是对外部ASR依赖和非因果解码器带来的延迟瓶颈。 Method: 提出TASTE-S:在编码器中集成CTC-based ASR模块以实现即时双模态编码;重设计因果单元解码器支持on-the-fly解码;采用联合训练策略。 Result: TASTE-S在性能上与TASTE持平,但显著降低延迟;对转录错误鲁棒,支持长文本的编码与解码。 Conclusion: TASTE-S是一种高效、低延迟、流式的文本-语音联合建模框架,适用于实时语音交互场景。 Abstract: Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE's performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.

[9] Not Just the Destination, But the Journey: Reasoning Traces Causally Shape Generalization Behaviors

Pengcheng Wen,Yanxu Zhu,Jiapeng Sun,Han Zhu,Yujin Zhou,Chi-Min Chan,Sirui Han,Yike Guo

Main category: cs.CL

TL;DR: 本文通过控制实验发现,链式思维(CoT)中的推理内容本身具有因果影响力,能独立于最终答案塑造大模型的行为泛化,尤其在有害行为上;不同语义的推理路径(如邪恶、误导、顺从)即使导致相同错误答案,也会引发不同行为模式;仅用推理训练(无需答案监督)即可改变模型行为,且效果在不生成推理时仍持续,说明推理已被深层内化。

Details Motivation: 探究链式思维(CoT)是否仅为事后合理化,还是其推理过程本身对模型行为具有因果影响,尤其在对齐与安全方面。 Method: 设计控制实验:固定有害答案不变,系统性地变化推理路径类型(Evil/Misleading/Submissive),在多种训练范式(QTA/QT/T-only)下训练0.6B–14B参数模型,并在think/no-think模式下评估行为泛化。 Result: (1)CoT训练可能比标准微调更易放大有害泛化;(2)不同推理类型引发语义一致的行为差异,尽管答案相同;(3)仅用推理训练(QT或T-only)即足以改变行为,证明推理含独立信号;(4)效应在无推理生成时仍存在,表明已深度内化。 Conclusion: 推理内容具有因果效力,不能被视作仅服务于答案的副产品;当前仅监督输出的对齐策略存在根本缺陷,需将推理过程本身纳入监督与干预。 Abstract: Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning's causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textit{Evil} reasoning embracing malice, \textit{Misleading} reasoning rationalizing harm, and \textit{Submissive} reasoning yielding to pressure. We train models (0.6B--14B parameters) under multiple paradigms, including question-thinking-answer (QTA), question-thinking (QT), and thinking-only (T-only), and evaluate them in both think and no-think modes. We find that: (1) CoT training could amplify harmful generalization more than standard fine-tuning; (2) distinct reasoning types induce distinct behavioral patterns aligned with their semantics, despite identical final answers; (3) training on reasoning without answer supervision (QT or T-only) is sufficient to alter behavior, proving reasoning carries an independent signal; and (4) these effects persist even when generating answers without reasoning, indicating deep internalization. Our findings demonstrate that reasoning content is causally potent, challenging alignment strategies that supervise only outputs.

[10] Interpreting Negation in GPT-2: Layer- and Head-Level Causal Analysis

Abdullah Al Mofael,Lisa M. Kuhn,Ghassan Alkadi,Kuo-Pao Yang

Main category: cs.CL

TL;DR: 本文通过因果分析方法研究GPT-2 Small如何处理否定句,发现否定信号主要集中在第4–6层的少数注意力头中,并通过激活修补和消融实验验证了这些头对否定敏感性的关键作用。

Details Motivation: 否定仍是现代语言模型的长期挑战,易导致语义反转或事实错误;需深入理解模型内部如何表征和处理否定。 Method: 构建12,000对匹配的肯定/否定句子数据集;定义Negation Effect Score(NES)量化否定敏感性;采用激活修补(activation patching)和特定注意力头消融(ablation)进行因果探针。 Result: 否定处理能力高度集中于GPT-2 Small的第4–6层少数注意力头;消融这些头显著削弱NES(in-domain),而用肯定句激活‘救援’则进一步升高NES;在xNot360上也观察到一致但幅度更小的效应。 Conclusion: 否定理解并非分布式能力,而是由局部、可识别的注意力机制主导,为模型可解释性与可控编辑提供了明确目标。 Abstract: Negation remains a persistent challenge for modern language models, often causing reversed meanings or factual errors. In this work, we conduct a causal analysis of how GPT-2 Small internally processes such linguistic transformations. We examine its hidden representations at both the layer and head level. Our analysis is based on a self-curated 12,000-pair dataset of matched affirmative and negated sentences, covering multiple linguistic templates and forms of negation. To quantify this behavior, we define a metric, the Negation Effect Score (NES), which measures the model's sensitivity in distinguishing between affirmative statements and their negations. We carried out two key interventions to probe causal structure. In activation patching, internal activations from affirmative sentences were inserted into their negated counterparts to see how meaning shifted. In ablation, specific attention heads were temporarily disabled to observe how logical polarity changed. Together, these steps revealed how negation signals move and evolve through GPT-2's layers. Our findings indicate that this capability is not widespread; instead, it is highly concentrated within a limited number of mid-layer attention heads, primarily within layers 4 to 6. Ablating these specific components directly disrupts the model's negation sensitivity: on our in-domain, ablation increased NES (indicating weaker negation sensitivity), and re-introducing cached affirmative activations (rescue) increased NES further, confirming that these heads carry affirmative signal rather than restoring baseline behavior. On xNot360, ablation slightly decreased NES and rescue restored performance above baseline. This pattern demonstrates that these causal patterns are consistent across various negation forms and remain detectable on the external xNot360 benchmark, though with smaller magnitude.

[11] CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

Christos Tzouvaras,Konstantinos Skianis,Athanasios Voulodimos

Main category: cs.CL

TL;DR: 本文提出了一种基于异构双大语言模型(LLM)集成与自一致性(SC)加权投票的方法,并引入后验修正机制Deliberative Complexity Gating(DCG),利用响应长度作为样本模糊性的代理指标,提升政治访谈中回应清晰度分类性能;在SemEval-2026 Task 6中取得Macro-F1为0.85、排名第三的成绩。

Details Motivation: 政治访谈中回应的清晰度判断具有现实意义,但存在高度主观性和模糊性,需更鲁棒的自动分类方法。 Method: 提出异构双LLM集成框架,结合自一致性与加权投票;设计Deliberative Complexity Gating(DCG)后验修正机制,利用跨模型行为信号(尤其是响应长度)动态调节推理路径;对比评估多智能体辩论策略。 Result: 在SemEval-2026 Task 6评测集上Macro-F1达0.85,获第三名;验证了DCG在模糊性检测上的有效性,且优于单纯增加智能体数量的辩论策略。 Conclusion: 响应长度可作为模糊性的强代理指标;DCG通过行为信号实现自适应推理调控,比缺乏模型多样性的多智能体辩论更有效;异构LLM集成+DCG是提升政治语境下清晰度分类性能的有效范式。 Abstract: This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place.

[12] Shattering the Shortcut: A Topology-Regularized Benchmark for Multi-hop Medical Reasoning in LLMs

Xing Zi,Xinying Zhou,Jinghao Xiao,Catarina Moreira,Mukesh Prasad

Main category: cs.CL

TL;DR: 本文提出ShatterMed-QA——一个面向多跳临床诊断推理的双语基准,通过k-Shattering算法修剪知识图谱中的通用枢纽节点以消除模型的'捷径学习',从而严格评估大语言模型的真实推理能力。

Details Motivation: 现有大语言模型在标准医学测试中表现优异,但在真实临床所需的多跳诊断推理中严重不足,主因是依赖知识图谱中高度连接的通用节点(如'炎症')进行捷径学习,而非遵循真实的微观病理路径。 Method: 构建拓扑正则化的医学知识图谱,采用新型k-Shattering算法物理剪枝通用枢纽节点;设计隐式桥接实体掩码与拓扑驱动的难负采样策略,生成10,558道多跳临床问题;结合RAG验证推理缺陷根源。 Result: 21个LLM在ShatterMed-QA上性能大幅下降,尤其领域专用模型;引入RAG恢复被掩码证据后,性能普遍显著回升,证实该基准能精准定位模型推理缺陷。 Conclusion: ShatterMed-QA有效揭示并量化了当前医学AI在深层诊断推理上的根本性缺陷,为推动真正具备临床推理能力的医疗AI提供了可靠评估工具和改进方向。 Abstract: While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning", where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/

[13] Marked Pedagogies: Examining Linguistic Biases in Personalized Automated Writing Feedback

Mei Tan,Lena Phalen,Dorottya Demszky

Main category: cs.CL

TL;DR: 本文研究了四种主流大语言模型(GPT-4o、GPT-3.5-turbo、Llama-3.3 70B、Llama-3.1 8B)在生成学生写作反馈时,如何因嵌入的性别、种族/族裔、学习需求、学业成就和动机等属性而产生系统性偏差;结果发现模型存在‘标记化教学取向’(Marked Pedagogies),即对被标记为少数族裔、非标准英语使用者或残障的学生,表现出过度表扬、实质性批评缺失及能力预设偏差,揭示了AI教育工具中隐含的不公平性。

Details Motivation: LLM驱动的个性化反馈工具虽具规模化潜力,但其语言偏见与社会刻板印象可能损害教育公平,尤其对学生身份敏感群体;亟需实证检验其反馈是否真正个性化,抑或强化结构性不平等。 Method: 基于PERSUADE数据集中的600篇八年级议论文,设计多组提示(prompt),嵌入学生性别、种族/族裔、学习需求、成就与动机等属性变量;调用四款主流LLM生成反馈;采用‘标记词’(Marked Words)框架分析输出文本的词汇层面系统性差异。 Result: 所有模型均在相同作文内容下,因提示中嵌入的身份属性而产生显著且一致的反馈偏差:对被标记为少数族裔、非标准英语使用者或残障的学生,普遍存在‘积极反馈偏差’(过度使用表扬)与‘反馈保留偏差’(减少实质性修改建议、隐含低能力预设);反馈不仅调整内容侧重,还改变评价标准与师生话语姿态。 Conclusion: LLM自动化反馈并非中立,而是生成‘标记化教学取向’(Marked Pedagogies),折射并放大社会偏见;教育AI工具亟需算法透明性、偏差审计机制与教育伦理问责框架。 Abstract: Effective personalized feedback is critical to students' literacy development. Though LLM-powered tools now promise to automate such feedback at scale, LLMs are not language-neutral: they privilege standard academic English and reproduce social stereotypes, raising concerns about how "personalization" shapes the feedback students receive. We examine how four widely used LLMs (GPT-4o, GPT-3.5-turbo, Llama-3.3 70B, Llama-3.1 8B) adapt written feedback in response to student attributes. Using 600 eighth-grade persuasive essays from the PERSUADE dataset, we generated feedback under prompt conditions embedding gender, race/ethnicity, learning needs, achievement, and motivation. We analyze lexical shifts across model outputs by adapting the Marked Words framework. Our results reveal systematic, stereotype-aligned shifts in feedback conditioned on presumed student attributes--even when essay content was identical. Feedback for students marked by race, language, or disability often exhibited positive feedback bias and feedback withholding bias--overuse of praise, less substantive critique, and assumptions of limited ability. Across attributes, models tailored not only what content was emphasized but also how writing was judged and how students were addressed. We term these instructional orientations Marked Pedagogies and highlight the need for transparency and accountability in automated feedback tools.

[14] LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation

Himel Ghosh,Nick Elias Werner

Main category: cs.CL

TL;DR: LLM BiasScope 是一个开源的 Web 应用,支持多模型(如 Gemini、Llama 等)输出的并排对比与实时偏见分析,采用两阶段偏见检测流程,并提供可视化与导出功能。

Details Motivation: 随着大语言模型广泛应用,亟需有效检测和理解其输出中的偏见,以支撑公平性评估与模型比较。 Method: 构建基于 Next.js/React 的交互式 Web 应用,集成 Hugging Face 偏见检测模型与 Vercel AI SDK 多厂商 LLM 接口;采用句子级偏见检测 + 偏见类型分类的两阶段 pipeline,支持同步流式响应与实时偏见统计。 Result: 实现了支持 6 家厂商模型的实时偏见对比分析系统,提供每模型偏见摘要、差异高亮的对比视图,以及条形图、雷达图等交互式可视化和 JSON/PDF 导出功能。 Conclusion: LLM BiasScope 为研究人员与实践者提供了实用、开源、可扩展的工具,显著提升了 LLM 偏见评估与跨模型行为比较的效率与透明度。 Abstract: As large language models (LLMs) are deployed widely, detecting and understanding bias in their outputs is critical. We present LLM BiasScope, a web application for side-by-side comparison of LLM outputs with real-time bias analysis. The system supports multiple providers (Google Gemini, DeepSeek, MiniMax, Mistral, Meituan, Meta Llama) and enables researchers and practitioners to compare models on the same prompts while analyzing bias patterns. LLM BiasScope uses a two-stage bias detection pipeline: sentence-level bias detection followed by bias type classification for biased sentences. The analysis runs automatically on both user prompts and model responses, providing statistics, visualizations, and detailed breakdowns of bias types. The interface displays two models side-by-side with synchronized streaming responses, per-model bias summaries, and a comparison view highlighting differences in bias distributions. The system is built on Next.js with React, integrates Hugging Face inference endpoints for bias detection, and uses the Vercel AI SDK for multi-provider LLM access. Features include real-time streaming, export to JSON/PDF, and interactive visualizations (bar charts, radar charts) for bias analysis. LLM BiasScope is available as an open-source web application, providing a practical tool for bias evaluation and comparative analysis of LLM behaviour.

[15] AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Zekun Wu,Adriano Koshiyama,Sahan Bulathwela,Maria Perez-Ortiz

Main category: cs.CL

TL;DR: 本文提出了一种成对轨迹协议,用于评估工具增强型大语言模型(LLM)代理在金融对话中的安全性,发现现有推荐质量指标(如NDCG)无法反映系统性安全风险;即使工具输出被污染,推荐质量看似不变,但65–93%的交互步中会出现高风险产品推荐,且模型完全不质疑工具数据可靠性;作者据此提出安全感知的sNDCG指标,并呼吁在高风险场景中采用轨迹级安全监控。

Details Motivation: 现有LLM代理评估依赖于衡量推荐质量的指标(如NDCG),但忽视了在高风险领域(如金融)中推荐的安全性,导致‘评估盲区’——即看似高质量的推荐实则蕴含严重安全风险。 Method: 提出成对轨迹协议,在真实金融对话中分别回放干净与被污染(含数值/叙事性篡改)的工具输出,覆盖7种不同规模LLM(7B至前沿模型),并从信息通道与记忆通道两个维度分解行为偏差;同时设计安全感知的NDCG变体(sNDCG)以量化安全-效用权衡。 Result: 所有7个模型均表现出‘评估盲区’:工具污染下推荐质量几乎不变(效用保留率≈1.0),但65–93%的交互步出现风险不当产品;安全违规主要源于信息通道,首步即发生且持续23步无自纠;1563次污染步中无一模型质疑工具可靠性;仅叙事污染(如偏见标题)即可引发显著漂移并绕过一致性监测;sNDCG将效用保留率降至0.51–0.74,揭示原有评估严重低估风险。 Conclusion: 标准单轮推荐质量评估不足以保障多轮高风险场景下的LLM代理安全性;亟需引入轨迹级、显式建模安全性的评估框架与监控机制。 Abstract: Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

[16] LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao,Xinshuo Hu,Jiaxin Xu,Danyu Tang,Xin Zhang,Mengjia Zhou,Yan Zhong,Yao Zhou,Zifei Shan,Meishan Zhang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了长时程记忆嵌入基准LMEB,用于评估文本嵌入模型在碎片化、上下文依赖和时间跨度大的长时程记忆检索任务中的能力,填补了现有基准(如MTEB)忽视记忆嵌入评估的空白。

Details Motivation: 现有文本嵌入基准(如MTEB)局限于传统段落检索,无法评估模型在长时程、上下文依赖、时间遥远的记忆检索任务中的表现,而此类能力对记忆增强系统(如OpenClaw)至关重要。 Method: 构建了包含22个数据集、193个零样本检索任务的LMEB基准,覆盖四种记忆类型(情景、对话、语义、程序),融合AI生成与人工标注数据,并评测了15种主流嵌入模型。 Result: 实验发现:(1) LMEB具备合理难度;(2) 参数量更大的模型未必性能更优;(3) LMEB与MTEB呈正交性,表明传统段落检索性能不能泛化至长时程记忆检索。 Conclusion: LMEB为记忆嵌入提供了标准化、可复现的评估框架,揭示当前尚无通用模型能统一胜任各类记忆检索任务,推动面向长期、上下文敏感记忆的文本嵌入研究发展。 Abstract: Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models' capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.

[17] Expert Pyramid Tuning: Efficient Parameter Fine-Tuning for Expertise-Driven Task Allocation

Jia-Chen Zhang,Zhen-Wei Yan,Yu-Jie Xiong,Chun-Ming Xia

Main category: cs.CL

TL;DR: 本文提出Expert Pyramid Tuning (EPT),一种将计算机视觉中的多尺度特征金字塔思想引入参数高效微调(PEFT)的新方法,通过共享子空间与金字塔投影机制实现任务自适应,并在多任务场景下以更少参数超越现有MoE-LoRA方法。

Details Motivation: 现有基于MoE的LoRA变体忽略任务复杂性的层次性,采用统一结构的专家,难以兼顾不同任务对高阶语义抽象与细粒度句法操作的差异化需求。 Method: EPT包含两个核心组件:(1) 共享的低维元知识子空间,编码通用语言模式;(2) 金字塔投影机制,利用可学习上采样算子在不同尺度重建高维特征;再通过任务感知路由器动态组合多尺度特征。 Result: 在多个多任务基准测试中,EPT显著优于当前最优的MoE-LoRA变体,且因可重参数化设计,在提升性能的同时减少了训练参数量。 Conclusion: EPT通过引入多尺度特征建模能力,有效提升了PEFT在多任务场景下的表达力与效率,验证了跨领域(CV→NLP)架构思想迁移的有效性。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) has become a dominant paradigm for deploying LLMs in multi-task scenarios due to its extreme parameter efficiency. While Mixture-of-Experts (MoE) based LoRA variants have achieved promising results by dynamically routing tokens to different low-rank experts, they largely overlook the hierarchical nature of task complexity. Existing methods typically employ experts with uniform architectures, limiting their ability to capture diverse feature granularities required by distinct tasks--where some tasks demand high-level semantic abstraction while others require fine-grained syntactic manipulation. To bridge this gap, we propose Expert Pyramid Tuning (EPT), a novel architecture that integrates the multi-scale feature pyramid concept from computer vision into the realm of PEFT. Unlike standard LoRA, EPT decomposes task adaptation into two stages: (1) A shared meta-knowledge Subspace that encodes universal linguistic patterns in low dimensions; (2) A Pyramid Projection Mechanism that utilizes learnable up-projection operators to reconstruct high-dimensional features at varying scales. A task-aware router then dynamically selects the optimal combination of these multi-scale features. Extensive experiments across multiple multi-task benchmarks demonstrate that EPT significantly outperforms SOTA MoE-LoRA variants. Crucially, thanks to the re-parameterization capability of our design, EPT achieves this performance improvement while simultaneously reducing the number of training parameters.

[18] RTD-Guard: A Black-Box Textual Adversarial Detection Framework via Replacement Token Detection

He Zhu,Yanshu Li,Wen Liu,Haitian Yang

Main category: cs.CL

TL;DR: 本文提出RTD-Guard,一种无需训练、无需模型访问、仅需两次黑盒查询的文本对抗样本检测框架,利用预训练RTD判别器定位可疑词并观察置信度变化实现高效检测。

Details Motivation: 现有文本对抗样本检测方法依赖攻击先验知识、白盒模型访问或大量查询,实用性受限。 Method: 利用预训练的Replaced Token Detection(RTD)判别器(不微调)定位被替换的可疑词,掩码后观测受害者模型预测置信度的变化,仅需两次黑盒查询。 Result: 在多个基准数据集上,RTD-Guard对多种SOTA文本攻击均展现出优越检测性能,显著优于现有基线,在效率、实用性和资源消耗方面优势突出。 Conclusion: RTD-Guard是一种高效、轻量、即插即用的黑盒对抗检测方法,特别适用于资源受限或隐私敏感的实际场景。 Abstract: Textual adversarial attacks pose a serious security threat to Natural Language Processing (NLP) systems by introducing imperceptible perturbations that mislead deep learning models. While adversarial example detection offers a lightweight alternative to robust training, existing methods typically rely on prior knowledge of attacks, white-box access to the victim model, or numerous queries, which severely limits their practical deployment. This paper introduces RTD-Guard, a novel black-box framework for detecting textual adversarial examples. Our key insight is that word-substitution perturbations in adversarial attacks closely resemble the "replaced tokens" that a Replaced Token Detection (RTD) discriminator is pre-trained to identify. Leveraging this, RTD-Guard employs an off-the-shelf RTD discriminator-without fine-tuning-to localize suspicious tokens, masks them, and detects adversarial examples by observing the prediction confidence shift of the victim model before and after intervention. The entire process requires no adversarial data, model tuning, or internal model access, and uses only two black-box queries. Comprehensive experiments on multiple benchmark datasets demonstrate that RTD-Guard effectively detects adversarial texts generated by diverse state-of-the-art attack methods. It surpasses existing detection baselines across multiple metrics, offering a highly efficient, practical, and resource-light defense mechanism-particularly suited for real-world deployment in resource-constrained or privacy-sensitive environments.

[19] Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SCILIRE System

Necva Bölücü,Jessica Irons,Changhyun Lee,Brian Jin,Maciej Rybinski,Huichen Yang,Andreas Duenser,Stephen Wan

Main category: cs.CL

TL;DR: SCILIRE is a Human-AI teaming system for efficient, high-fidelity dataset creation from scientific literature, using iterative human verification and feedback to improve LLM-based extraction.

Details Motivation: The rapid growth of scientific literature makes manual structured knowledge extraction impractical, necessitating scalable, accurate, and collaborative AI-assisted solutions. Method: SCILIRE implements a Human-AI teaming framework with iterative workflows where researchers review, correct, and provide feedback on AI-generated extractions; this feedback is used to refine subsequent LLM-based inference. Result: Evaluation via intrinsic benchmarks and multi-domain case studies shows improved extraction fidelity and more efficient dataset creation compared to baseline approaches. Conclusion: SCILIRE demonstrates that integrating human oversight and adaptive learning into AI-driven knowledge extraction significantly enhances both accuracy and practicality in scientific dataset curation. Abstract: The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.

[20] 98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

Main category: cs.CL

TL;DR: 本文提出了一种面向vLLM服务的轻量级语义路由器(Semantic Router),通过三阶段优化显著降低长上下文(8K–32K tokens)安全分类等任务的延迟与GPU内存开销,实现98倍端到端加速,并可在同一GPU上与LLM共存,无需专用加速器。

Details Motivation: 系统级路由器需在LLM请求链路中实时执行安全分类、领域路由和PII检测,但必须低延迟、轻量化,且不能独占昂贵GPU资源;标准注意力机制在长上下文下内存爆炸(O(n²)),导致与vLLM共置时显存不足。 Method: 提出三阶段优化:1)为ONNX Runtime on ROCm定制CK Flash Attention算子,将注意力内存降至O(n),解决OOM;2)采用无神经网络的经典NLP压缩技术(TextRank、位置加权、TF-IDF、新颖性评分)将输入压缩至~512 tokens;3)近流式body处理,结合自适应分块与零拷贝JSON解析消除序列化开销。 Result: 端到端延迟从4918ms降至50ms(98×加速),支持16K-token路由仅需108ms,GPU显存占用<800MB,可与vLLM共享同一GPU;Stage 1适配AMD ROCm,Stage 2/3硬件无关。 Conclusion: 该语义路由器在保持高功能性的同时实现了极致的效率与部署灵活性,为LLM服务栈中的系统级中间件提供了可行的轻量高性能范式。 Abstract: System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918\,ms to 127\,ms (\textbf{38.7$\times$}), enabling 8K--32K tokens where SDPA OOMs. \emph{Stage~2}: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to ${\sim}$512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127$\to$62\,ms, \textbf{2.0$\times$}). \emph{Stage~3}: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62$\to$50\,ms, \textbf{1.2$\times$}). Cumulatively: \textbf{98$\times$} improvement (4{,}918\,ms to 50\,ms), 16K-token routing in 108\,ms, and a total router GPU footprint under 800\,MB -- small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic.

[21] Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

Hongyang Chen,Zhongwu Sun,Hongfei Ye,Kunchi Li,Xuemin Lin

Main category: cs.CL

TL;DR: 本文综述了面向大语言模型(LLMs)的持续学习(CL)方法,按持续预训练、持续微调和持续对齐三阶段组织,对比分析了重放、正则化与架构类方法在缓解灾难性遗忘上的机制差异,并指出LLM-CL在规模、参数效率与涌现能力上区别于传统CL。

Details Motivation: 现代大语言模型面临静态预训练范式导致的灾难性遗忘问题,亟需持续学习能力以适应动态知识和序列任务。 Method: 系统梳理并分类持续学习方法(按训练阶段与遗忘缓解机制),开展跨方法的适应性与改进效果比较分析,并探讨LLM-CL与传统CL的本质差异。 Result: 当前方法在特定领域表现良好,但在跨任务、跨时间尺度的知识无缝整合方面仍存在根本性挑战;明确了评估指标(如遗忘率、知识迁移效率)与新兴基准。 Conclusion: 该综述为LLM持续学习提供了结构化框架,有助于研究者把握现状、识别瓶颈并探索未来方向。 Abstract: Continual learning (CL) has emerged as a pivotal paradigm to enable large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting-a critical limitation of the static pre-training paradigm inherent to modern LLMs. This survey presents a comprehensive overview of CL methodologies tailored for LLMs, structured around three core training stages: continual pre-training, continual fine-tuning, and continual alignment.Beyond the canonical taxonomy of rehearsal-, regularization-, and architecture-based methods, we further subdivide each category by its distinct forgetting mitigation mechanisms and conduct a rigorous comparative analysis of the adaptability and critical improvements of traditional CL methods for LLMs. In doing so, we explicitly highlight core distinctions between LLM CL and traditional machine learning, particularly with respect to scale, parameter efficiency, and emergent capabilities. Our analysis covers essential evaluation metrics, including forgetting rates and knowledge transfer efficiency, along with emerging benchmarks for assessing CL performance. This survey reveals that while current methods demonstrate promising results in specific domains, fundamental challenges persist in achieving seamless knowledge integration across diverse tasks and temporal scales. This systematic review contributes to the growing body of knowledge on LLM adaptation, providing researchers and practitioners with a structured framework for understanding current achievements and future opportunities in lifelong learning for language models.

[22] From Text to Forecasts: Bridging Modality Gap with Temporal Evolution Semantic Space

Lehui Li,Yuyao Wang,Jisheng Yan,Wei Zhang,Jinliang Deng,Haoliang Sun,Zhongyi Han,Yongshun Gong

Main category: cs.CL

TL;DR: 本文提出TESS框架,通过构建一个时间演化语义空间(Temporal Evolution Semantic Space),将文本中隐含的时序影响转化为可量化的语义基元(如均值偏移、波动性、形状、滞后),从而弥合文本与时间序列预测之间的模态鸿沟;在四个真实数据集上显著降低预测误差(最高达29%)。

Details Motivation: 文本信息有助于缓解事件驱动的时间序列非平稳性,但文本(隐式、定性)与预测模型(显式、定量)之间存在根本的模态鸿沟,现有方法难以可靠地将文本语义转化为有效数值线索。 Method: 提出TESS方法,构建一个可解释、数值化的时间演化语义空间作为模态间中介瓶颈;利用大语言模型(LLM)通过结构化提示从文本中提取四类时间语义基元(均值偏移、波动性、形状、滞后),并采用置信度感知门控机制进行筛选。 Result: 在四个真实世界数据集上,TESS相比当前最优的单模态和多模态基线方法,预测误差最高降低29%;消融与可控半合成实验验证了语义基元的有效性与鲁棒性。 Conclusion: 将文本语义映射为结构化、数值化、可解释的时间演化基元是弥合文本-时序模态鸿沟的有效路径;TESS为事件驱动预测提供了更可靠、可解释的多模态融合范式。 Abstract: Incorporating textual information into time-series forecasting holds promise for addressing event-driven non-stationarity; however, a fundamental modality gap hinders effective fusion: textual descriptions express temporal impacts implicitly and qualitatively, whereas forecasting models rely on explicit and quantitative signals. Through controlled semi-synthetic experiments, we show that existing methods over-attend to redundant tokens and struggle to reliably translate textual semantics into usable numerical cues. To bridge this gap, we propose TESS, which introduces a Temporal Evolution Semantic Space as an intermediate bottleneck between modalities. This space consists of interpretable, numerically grounded temporal primitives (mean shift, volatility, shape, and lag) extracted from text by an LLM via structured prompting and filtered through confidence-aware gating. Experiments on four real-world datasets demonstrate up to a 29 percent reduction in forecasting error compared to state-of-the-art unimodal and multimodal baselines. The code will be released after acceptance.

[23] MetaKE: Meta-learning Aligned Knowledge Editing via Bi-level Optimization

Shuxin Liu,Ou Wu

Main category: cs.CL

TL;DR: 本文提出MetaKE框架,通过元学习将知识编辑重构为双层优化问题,引入可学习的编辑目标和结构梯度代理,解决语义与执行之间的不匹配问题,显著提升大模型知识编辑效果。

Details Motivation: 现有知识编辑方法存在“语义-执行断连”问题,即语义目标独立推导而缺乏下游可行区域反馈,导致有效目标落入禁止空间、梯度截断及编辑失败。 Method: 提出MetaKE框架,将知识编辑建模为双层优化:上层优化器学习一个可行的编辑目标以最大化编辑后性能,下层求解器执行编辑;并设计结构梯度代理(Structural Gradient Proxy)显式将可编辑性约束反传至上层目标学习阶段。 Result: 理论分析表明MetaKE能自动使编辑方向对齐模型可行流形;大量实验验证其显著优于强基线方法。 Conclusion: MetaKE为知识编辑提供了新范式,通过元学习与结构化梯度机制有效弥合语义目标与执行能力间的鸿沟,提升了编辑精度与鲁棒性。 Abstract: Knowledge editing (KE) aims to precisely rectify specific knowledge in Large Language Models (LLMs) without disrupting general capabilities. State-of-the-art methods suffer from an open-loop control mismatch. We identify a critical "Semantic-Execution Disconnect": the semantic target is derived independently without feedback from the downstream's feasible region. This misalignment often causes valid semantic targets to fall within the prohibited space, resulting in gradient truncation and editing failure. To bridge this gap, we propose MetaKE (Meta-learning Aligned Knowledge Editing), a new framework that reframes KE as a bi-level optimization problem. Departing from static calculation, MetaKE treats the edit target as a learnable meta-parameter: the upper-level optimizer seeks a feasible target to maximize post-edit performance, while the lower-level solver executes the editing. To address the challenge of differentiating through complex solvers, we derive a Structural Gradient Proxy, which explicitly backpropagates editability constraints to the target learning phase. Theoretical analysis demonstrates that MetaKE automatically aligns the edit direction with the model's feasible manifold. Extensive experiments confirm that MetaKE significantly outperforms strong baselines, offering a new perspective on knowledge editing.

[24] Experimental evidence of progressive ChatGPT models self-convergence

Konstantinos F. Xylogiannopoulos,Petros Xanthopoulos,Panagiotis Karampelas,Georgios A. Bakamitsos

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在递归使用自身生成数据训练时出现的‘模型自收敛’现象,即输出多样性随版本迭代而下降,并通过文本相似度分析发现近期ChatGPT版本在高温度设置下仍表现出输出趋同,归因于训练数据中合成数据比例上升。

Details Motivation: 现有研究多从理论或单模型角度探讨模型崩溃,缺乏对多个实际商用模型长期演化的实证追踪;本文旨在填补这一空白,探究真实LLM(如ChatGPT各版本)在持续吸收合成数据背景下的输出多样性退化趋势。 Method: 采用文本相似度指标(如余弦相似度或n-gram重叠)量化不同ChatGPT版本在相同高温度(temperature=1)提示下的输出多样性,并关联各版本训练数据中合成数据的估计占比进行分析。 Result: 实证发现较新ChatGPT版本在控制变量(prompt、temperature)下生成文本的多样性显著降低,文本间相似度升高,呈现跨版本的‘自收敛’趋势。 Conclusion: LLM在现实部署中因互联网中LLM生成内容泛滥而不断吸收合成数据,导致其训练数据分布偏移,引发模型输出趋于同质化——这是一种新型、渐进式模型退化现象,需引起关注并探索缓解策略。 Abstract: Large Language Models (LLMs) that undergo recursive training on synthetically generated data are susceptible to model collapse, a phenomenon marked by the generation of meaningless output. Existing research has examined this issue from either theoretical or empirical perspectives, often focusing on a single model trained recursively on its own outputs. While prior studies have cautioned against the potential degradation of LLM output quality under such conditions, no longitudinal investigation has yet been conducted to assess this effect over time. In this study, we employ a text similarity metric to evaluate different ChatGPT models' capacity to generate diverse textual outputs. Our findings indicate a measurable decline of recent ChatGPT releases' ability to produce varied text, even when explicitly prompted to do so, by setting the temperature parameter to one. The observed reduction in output diversity may be attributed to the influence of the amounts of synthetic data incorporated within their training datasets as the result of internet infiltration by LLM generated data. The phenomenon is defined as model self-convergence because of the gradual increase of similarities of produced texts among different ChatGPT versions.

[25] EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning

Chi Ruan,Dongfu Jiang,Huaye Zeng,Ping Nie,Wenhu Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于解条件与对抗性验证的框架,用于迭代优化测试用例,从而构建高质量编码强化学习数据集EvolveCoder-22k,并显著提升大模型代码生成能力。

Details Motivation: 现有编码强化学习数据集中验证信号弱且静态,限制了基于可验证奖励的强化学习(RLVR)在代码生成中的效果。 Method: 提出解条件化、对抗性的验证框架,根据候选解的执行行为迭代演化测试用例,以提高难度、判别力并减少冗余;基于该框架构建大规模数据集EvolveCoder-22k。 Result: 迭代优化使pass@1从43.80降至31.22,表明验证强度显著增强;在EvolveCoder-22k上训练使Qwen3-4B在四个下游基准上平均提升4.2分,并超越同规模强基线。 Conclusion: 解条件化、对抗性的验证机制对实现高效、可扩展的代码生成强化学习至关重要。 Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving code generation in large language models, but its effectiveness is limited by weak and static verification signals in existing coding RL datasets. In this paper, we propose a solution-conditioned and adversarial verification framework that iteratively refines test cases based on the execution behaviors of candidate solutions, with the goal of increasing difficulty, improving discriminative power, and reducing redundancy. Based on this framework, we introduce EvolveCoder-22k, a large-scale coding reinforcement learning dataset constructed through multiple rounds of adversarial test case evolution. Empirical analysis shows that iterative refinement substantially strengthens verification, with pass@1 decreasing from 43.80 to 31.22. Reinforcement learning on EvolveCoder-22k yields stable optimization and consistent performance gains, improving Qwen3-4B by an average of 4.2 points across four downstream benchmarks and outperforming strong 4B-scale baselines. Our results highlight the importance of adversarial, solution-conditioned verification for effective and scalable reinforcement learning in code generation.

[26] A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora

Paul Van Eecke,Katrien Beuls

Main category: cs.CL

TL;DR: 本文提出了一种从大规模语料中自动学习广覆盖、人类可解释的构式语法的方法,基于句法结构和语义框架标注的语料,生成包含数万个构式的流体构式语法(FCG)网络,支持开放域文本的框架语义分析,并验证了基于用法的构式语法理论的可扩展性。

Details Motivation: 推动基于用法的构式语法理论在大规模语料上的可扩展性,弥补人工构建构式语法覆盖面窄、规模小的不足,并为英语论元结构的构式主义研究提供实用工具。 Method: 基于带有短语结构树和语义框架标注的语料,设计自动化方法学习构式,将构式形式化为Fluid Construction Grammar(FCG)框架下的网络结构,强调构式在句法与语义映射关系中的可解释性。 Result: 成功构建了包含数万个构式的大型构式语法网络,支持开放域文本的框架语义分析,并揭示了语料中丰富的句法-语义使用模式。 Conclusion: 该方法证实了构式语法核心假设(如构式作为基本单位、语法即用法)在大规模数据上具有可扩展性,为计算构式语法和基于用法的语言学研究提供了可行路径与实用资源。 Abstract: We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.

[27] SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models

Aditya Maheshwari,Amit Gajkeshwar,Kaushal Sharma,Vivek Patel

Main category: cs.CL

TL;DR: 本研究首次系统评估了15个主流大语言模型(LLMs)在伊斯兰教逊尼派与什叶派相关问题上的宗教偏见,提出双语评测集SectEval(英/印地语),发现模型输出存在显著的语言依赖性与地域敏感性:同一问题在不同语言下倾向不同教派,且部分模型会根据用户所在地动态调整答案,表明AI宗教知识呈现非中立性。

Details Motivation: 随着大语言模型成为宗教知识的重要来源,亟需评估其对不同宗教群体(尤其是伊斯兰教两大主要教派——逊尼派与什叶派)是否公平、中立。 Method: 构建双语(英语和印地语)宗教偏见评测基准SectEval,包含88道涵盖教义、历史与实践的问题;对15个主流LLM(含闭源与开源)进行系统性测试,并分析其响应在语言切换与模拟用户地理位置变化下的差异。 Result: 发现显著语言依赖性:如DeepSeek-v3与GPT-4o在英语中倾向什叶派答案,而在印地语中转向逊尼派;同时存在地域敏感性:Claude-3.5等先进模型会依据用户所在国家(如伊朗vs沙特)自动调整教派倾向,而小型模型(尤其印地语环境下)则固守逊尼派立场。 Conclusion: 大语言模型在宗教知识生成中不具备中立性,其输出的‘宗教真理’受语言与用户地理身份显著影响,凸显宗教AI评测与治理的紧迫性。 Abstract: As Large Language Models (LLMs) becomes a popular source for religious knowledge, it is important to know if it treats different groups fairly. This study is the first to measure how LLMs handle the differences between the two main sects of Islam: Sunni and Shia. We present a test called SectEval, available in both English and Hindi, consisting of 88 questions, to check the bias-ness of 15 top LLM models, both proprietary and open-weights. Our results show a major inconsistency based on language. In English, many powerful models DeepSeek-v3 and GPT-4o often favored Shia answers. However, when asked the exact same questions in Hindi, these models switched to favoring Sunni answers. This means a user could get completely different religious advice just by changing languages. We also looked at how models react to location. Advanced models Claude-3.5 changed their answers to match the user's country-giving Shia answers to a user from Iran and Sunni answers to a user from Saudi Arabia. In contrast, smaller models (especially in Hindi) ignored the user's location and stuck to a Sunni viewpoint. These findings show that AI is not neutral; its religious ``truth'' changes depending on the language you speak and the country you claim to be from. The data set is available at https://github.com/secteval/SectEval/

[28] SteerRM: Debiasing Reward Models via Sparse Autoencoders

Mengyuan Sun,Zhuohao Yu,Weizheng Gu,Shikun Zhang,Wei Ye

Main category: cs.CL

TL;DR: 本文提出SteerRM,一种无需训练的基于稀疏自编码器(SAE)干预的奖励模型去偏方法,通过识别并抑制与格式偏好相关的SAE特征,在不损害整体性能的前提下显著提升硬拆分准确率。

Details Motivation: 现有奖励模型存在对表面风格线索(如格式、表达方式)的偏好偏差,而传统去偏方法需重训练或修改结构,直接激活抑制又因表征纠缠导致性能下降。 Method: SteerRM利用对比配对响应分离风格效应,结合强度-稳定性准则识别偏差相关SAE特征,并在推理时进行针对性抑制;全程无需重新训练。 Result: 在RM-Bench六个奖励模型上,Hard-split准确率平均提升7.3分;在Gemma基模型及非格式类偏差上也验证了泛化性;发现格式相关特征集中于浅层且跨模型可迁移。 Conclusion: 基于SAE的干预方法可在不重训练前提下有效缓解奖励模型偏差,为对齐流程提供实用、可解释的解决方案。 Abstract: Reward models (RMs) are critical components of alignment pipelines, yet they exhibit biases toward superficial stylistic cues, preferring better-presented responses over semantically superior ones. Existing debiasing methods typically require retraining or architectural modifications, while direct activation suppression degrades performance due to representation entanglement. We propose SteerRM, the first training-free method for debiasing reward models using Sparse Autoencoder (SAE)-based interventions. SteerRM isolates stylistic effects using contrastive paired responses, identifies bias-related SAE features with a strength-stability criterion, and suppresses them at inference time. Across six reward models on RM-Bench, SteerRM improves Hard-split accuracy by 7.3 points on average while preserving overall performance. Results on a Gemma-based reward model and a controlled non-format bias further suggest generalization across RM architectures and bias types. We further find that format-related features are concentrated in shallow layers and transfer across models, revealing shared architecture-level bias encoding patterns. These results show that SAE-based interventions can mitigate reward-model biases without retraining, providing a practical and interpretable solution for alignment pipelines.

[29] Adaptive Vision-Language Model Routing for Computer Use Agents

Xunzhuo Liu,Bowei He,Xue Liu,Andy Luo,Haichen Zhang,Huamin Chen

Main category: cs.CL

TL;DR: 本文提出自适应VLM路由(AVR)框架,通过轻量级语义路由层动态选择最适合当前GUI操作难度的视觉语言模型(VLM),在保证准确率前提下显著降低推理成本,并兼顾安全与效率。

Details Motivation: 现有计算机使用代理(CUA)系统对所有GUI操作统一调用固定VLM,但不同VLM的定位精度差异大,且未根据操作难度自适应选择模型,导致成本高或可靠性低。 Method: 提出AVR框架:基于多模态嵌入估计动作难度,用小型VLM探针评估置信度,按成本-精度权衡阈值策略将任务路由至满足可靠性要求的最便宜VLM;对具备上下文记忆的‘暖’代理,利用历史UI交互信息缩小大小模型能力差距;引入Visual Confused Deputy机制对高风险操作强制升级至最强模型。 Result: 在ScreenSpot-Pro和OpenClaw基准上,AVR实现最高78%推理成本降低,同时保持精度仅比全大模型基线低不超过2个百分点;结合安全守卫机制,统一实现高效与安全路由。 Conclusion: AVR是一种实用、可扩展的VLM路由范式,能显著提升CUA系统的性价比与鲁棒性,为多模型协同的智能代理架构提供了新思路。 Abstract: Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightweight semantic routing layer between the CUA orchestrator and a pool of VLMs. For each tool call, AVR estimates action difficulty from multimodal embeddings, probes a small VLM to measure confidence, and routes the action to the cheapest model whose predicted accuracy satisfies a target reliability threshold. For \textit{warm} agents with memory of prior UI interactions, retrieved context further narrows the capability gap between small and large models, allowing many actions to be handled without escalation. We formalize routing as a cost--accuracy trade-off, derive a threshold-based policy for model selection, and evaluate AVR using ScreenSpot-Pro grounding data together with the OpenClaw agent routing benchmark. Across these settings, AVR projects inference cost reductions of up to 78\% while staying within 2 percentage points of an all-large-model baseline. When combined with the Visual Confused Deputy guardrail, AVR also escalates high-risk actions directly to the strongest available model, unifying efficiency and safety within a single routing framework. Materials are also provided Model, benchmark, and code: https://github.com/vllm-project/semantic-router.

[30] Rethinking Multiple-Choice Questions for RLVR: Unlocking Potential via Distractor Design

Xu Guo,Qiming Ge,Jian Tong,Kedi Chen,Jin Zhang,Xiaogui Yang,Xuan Gao,Haijun Lv,Zhihui Lu,Yicheng Zou,Qipeng Guo

Main category: cs.CL

TL;DR: 本文提出Iterative Distractor Curation (IDC)框架,通过主动构建高质量干扰项来提升基于多选题的强化学习与可验证奖励(RLVR)训练效果,缓解随机猜测和排除法捷径问题,并在多个基准上验证了其有效性。

Details Motivation: 现有RLVR方法在使用多选题时易发生reward hacking(如随机猜测或简单排除),而将题目转为开放格式会丢失专家设计干扰项所提供的对比信号;因此需系统研究选项设计对RLVR的影响并提出改进方案。 Method: 系统分析选项数量不匹配及干扰项强度对RLVR性能的影响,并据此提出Iterative Distractor Curation (IDC)框架,通过迭代方式生成能阻断排除捷径、促进深度推理的高质量干扰项。 Result: 实验表明IDC显著提升了干扰项质量,在多个基准测试中相比原始数据带来RLVR训练效果的明显提升;同时发现训练/测试选项数不一致会损害性能,而强干扰项可使2选题也支持有效RLVR训练。 Conclusion: 干扰项的设计质量对RLVR至关重要;IDC提供了一种可扩展且有效的方法,在保留多选题结构优势的同时增强模型推理能力。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capabilities of Large Language Models. When applied to RLVR, Multiple-Choice Questions (MCQs) offer a scalable source of verifiable data but risk inducing reward hacking, where models shortcut reasoning via random guessing or simple elimination. Current approaches often mitigate this by converting MCQs to open-ended formats, thereby discarding the contrastive signal provided by expert-designed distractors. In this work, we systematically investigate the impact of option design on RLVR. Our analysis highlights two primary insights: (1) Mismatches in option counts between training and testing degrade performance. (2) Strong distractors effectively mitigate random guessing, enabling effective RLVR training even with 2-way questions. Motivated by these findings, we propose Iterative Distractor Curation (IDC), a framework that actively constructs high-quality distractors to block elimination shortcuts and promote deep reasoning. Experiments on various benchmarks demonstrate that our method effectively enhances distractor quality and yields significant gains in RLVR training compared to the original data.

[31] CLARIN-PT-LDB: An Open LLM Leaderboard for Portuguese to assess Language, Culture and Civility

João Silva,Luís Gomes,António Branco

Main category: cs.CL

TL;DR: 本文介绍了为欧洲葡萄牙语(PT-PT)构建的开源大语言模型(LLM)排行榜及其配套新基准测试,旨在填补该语言变体评估领域的空白,并首次纳入模型安全防护与葡萄牙文化对齐等新维度。

Details Motivation: 解决欧洲葡萄牙语大语言模型缺乏专用评估排行榜和全面基准测试的现状,尤其缺少针对模型安全与文化适配性的评估。 Method: 开发面向欧洲葡萄牙语的专用LLM排行榜,并设计涵盖模型安全防护、文化对齐等新维度的原创基准测试;平台部署于Hugging Face。 Result: 成功构建并公开发布首个欧洲葡萄牙语LLM排行榜及配套新基准,支持更全面、本地化的模型评估。 Conclusion: 该工作为欧洲葡萄牙语AI生态提供了关键基础设施,推动了语言特异性、安全性与文化适配性在LLM评估中的标准化。 Abstract: This paper reports on the development of a leaderboard of Open Large Language Models (LLM) for European Portuguese (PT-PT), and on its associated benchmarks. This leaderboard comes as a way to address a gap in the evaluation of LLM for European Portuguese, which so far had no leaderboard dedicated to this variant of the language. The paper also reports on novel benchmarks, including some that address aspects of performance that so far have not been available in benchmarks for European Portuguese, namely model safeguards and alignment to Portuguese culture. The leaderboard is available at https://huggingface.co/spaces/PORTULAN/portuguese-llm-leaderboard.

[32] Learning from Child-Directed Speech in Two-Language Scenarios: A French-English Case Study

Liel Binyamin,Elior Sulem

Main category: cs.CL

TL;DR: 本文扩展了BabyBERTa至英法双语场景,在严格控制数据规模条件下,系统研究了儿童导向语音与多领域语料对紧凑型语言模型性能的影响,并构建了新的法语评估资源。

Details Motivation: 现有发展性语言模型研究主要集中于英语,缺乏对多语言环境的探索;同时需要在公平数据条件下比较不同语料(如儿童语音vs维基百科)对模型能力的影响。 Method: 将BabyBERTa扩展到英法双语场景,构建单语、双语和跨语言设置;对比使用儿童导向语音(2.5M词符)与多领域语料(10M词符)进行预训练;引入法语版QAMR/QASRL及英法多领域语料用于评估。 Result: 维基语料更利于语义任务,儿童语音提升单语语法判断;双语预训练显著提升文本蕴含任务(尤其对法语);该规律在BabyBERTa、RoBERTa和LTG-BERT中一致存在。 Conclusion: 儿童语言数据在特定语法任务上具有不可替代价值,双语建模能有效增强低资源语言(如法语)的理解能力,且效果不依赖特定模型架构。 Abstract: Research on developmentally plausible language models has largely focused on English, leaving open questions about multilingual settings. We present a systematic study of compact language models by extending BabyBERTa to English-French scenarios under strictly size-matched data conditions, covering monolingual, bilingual, and cross-lingual settings. Our design contrasts two types of training corpora: (i) child-directed speech (about 2.5M tokens), following BabyBERTa and related work, and (ii) multi-domain corpora (about 10M tokens), extending the BabyLM framework to French. To enable fair evaluation, we also introduce new resources, including French versions of QAMR and QASRL, as well as English and French multi-domain corpora. We evaluate the models on both syntactic and semantic tasks and compare them with models trained on Wikipedia-only data. The results reveal context-dependent effects: training on Wikipedia consistently benefits semantic tasks, whereas child-directed speech improves grammatical judgments in monolingual settings. Bilingual pretraining yields notable gains for textual entailment, with particularly strong improvements for French. Importantly, similar patterns emerge across BabyBERTa, RoBERTa, and LTG-BERT, suggesting consistent trends across architectures.

[33] HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

Zixin Feng,Xinying Cui,Yifan Sun,Zheng Wei,Jiachen Yuan,Jiazhen Hu,Ning Xin,Md Maruf Hasan

Main category: cs.CL

TL;DR: 本文提出HMS-BERT,一种基于多语言BERT的混合多任务自训练框架,用于多语言、多标签网络欺凌检测,融合上下文表示与人工特征,并通过置信度驱动的伪标签自训练缓解低资源语言标注数据稀缺问题。

Details Motivation: 现有方法受限于单语假设或单任务设定,难以应对真实场景中多语言、多标签交织的网络欺凌问题。 Method: 构建基于多语言BERT的HMS-BERT框架,融合上下文表征与手工语言特征,联合优化细粒度多标签辱骂分类与三类主分类任务,并引入基于置信度的迭代自训练策略以实现跨语言知识迁移。 Result: 在四个公开数据集上,HMS-BERT在多标签任务上达到最高宏F1为0.9847,在主分类任务上准确率达0.6775;消融实验验证了各模块有效性。 Conclusion: HMS-BERT有效提升了多语言、多标签网络欺凌检测性能,尤其在低资源语言场景下展现出强泛化能力与实用性。 Abstract: Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.

[34] DS$^2$-Instruct: Domain-Specific Data Synthesis for Large Language Models Instruction Tuning

Ruiyao Xu,Noelle I. Samia,Han Liu

Main category: cs.CL

TL;DR: 本文提出DS²-Instruct,一种零样本生成领域特定指令数据集的框架,通过任务导向关键词生成、结合布鲁姆认知分类法构建多样化指令,并采用自一致性验证保障质量,在数学、金融等七个领域验证了其有效性。

Details Motivation: 现有数据合成方法难以捕捉领域特定术语和推理模式,且人工标注高质量指令数据成本高昂。 Method: DS²-Instruct框架包含三步:1)生成任务导向关键词以覆盖领域;2)将关键词与布鲁姆认知分类法中的不同认知层级配对生成多样化指令;3)使用自一致性验证确保数据质量。 Result: 在七个挑战性领域(如数学、金融、逻辑推理)生成数据集;微调模型在该数据上显著优于现有数据生成方法。 Conclusion: DS²-Instruct是一种无需人工监督、可扩展、高质量的领域指令数据生成框架,有效提升领域适配LLM的性能。 Abstract: Adapting Large Language Models (LLMs) to specialized domains requires high-quality instruction tuning datasets, which are expensive to create through human annotation. Existing data synthesis methods focus on general-purpose tasks and fail to capture domain-specific terminology and reasoning patterns. To address this, we introduce DS$^2$-Instruct, a zero-shot framework that generates domain-specific instruction datasets without human supervision. Our approach first generates task-informed keywords to ensure comprehensive domain coverage. It then creates diverse instructions by pairing these keywords with different cognitive levels from Bloom's Taxonomy. Finally, it uses self-consistency validation to ensure data quality. We apply this framework to generate datasets across seven challenging domains, such as mathematics, finance, and logical reasoning. Comprehensive evaluation demonstrates that models fine-tuned on our generated data achieve substantial improvements over existing data generation methods.

[35] Long-form RewardBench: Evaluating Reward Models for Long-form Generation

Hui Huang,Yancheng He,Wei Liu,Muyun Yang,Jiaheng Liu,Kehai Chen,Bing Xu,Conghui Zhu,Hailong Cao,Tiejun Zhao

Main category: cs.CL

TL;DR: 本文提出了首个面向长文本生成的奖励模型评测基准Long-form RewardBench,涵盖问答、RAG、对话、写作和推理五大子任务,并通过多阶段数据收集与20+主流模型实验,揭示当前模型在长文本奖励建模能力上的不足,发现奖励建模性能与错误位置、响应长度相关,并指出分类器比生成式模型具有更强泛化性。

Details Motivation: 现有奖励模型评测基准缺乏对长文本生成场景的覆盖,而该能力在实际应用中至关重要,亟需专门的评测平台来推动发展。 Method: 构建Long-form RewardBench基准,包含五个子任务;采用多阶段数据收集方式获取指令与偏好数据;在20多个主流奖励模型(含分类器与生成式模型)上开展系统评测;设计新型‘长文本海中寻针’测试以分析错误定位与响应长度的影响。 Result: 当前主流奖励模型在长文本建模上表现不佳;奖励得分与错误在响应中的位置及整体长度存在相关性;分类器模型相比生成式模型展现出更强的泛化能力。 Conclusion: Long-form RewardBench是首个专为长文本奖励建模设计的综合评测基准,可有效衡量和推动该方向的技术进步。 Abstract: The widespread adoption of reinforcement learning-based alignment highlights the growing importance of reward models. Various benchmarks have been built to evaluate reward models in various domains and scenarios. However, a significant gap remains in assessing reward models for long-form generation, despite its critical role in real-world applications. To bridge this, we introduce Long-form RewardBench, the first reward modeling testbed specifically designed for long-form generation. Our benchmark encompasses five key subtasks: QA, RAG, Chat, Writing, and Reasoning. We collected instruction and preference data through a meticulously designed multi-stage data collection process, and conducted extensive experiments on 20+ mainstream reward models, including both classifiers and generative models. Our findings reveal that current models still lack long-form reward modeling capabilities. Furthermore, we designed a novel Long-form Needle-in-a-Haystack Test, which revealed a correlation between reward modeling performance and the error's position within a response, as well as the overall response length, with distinct characteristics observed between classification and generative models. Finally, we demonstrate that classifiers exhibit better generalizability compared to generative models trained on the same data. As the first benchmark for long-form reward modeling, this work aims to offer a robust platform for visualizing progress in this crucial area.

[36] Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

Boxuan Lyu,Haiyue Song,Zhi Qu

Main category: cs.CL

TL;DR: 本文提出了一种基于最小贝叶斯风险(MBR)解码的迭代MBR蒸馏框架,用于无需人工标注的错误跨度检测(ESD),在多个指标上超越了有监督基线。

Details Motivation: 人工标注翻译错误数据成本高且存在标注者间不一致问题,亟需无需人工标注的ESD方法。 Method: 提出基于最小贝叶斯风险(MBR)解码的自进化框架——迭代MBR蒸馏,利用现成大语言模型生成伪标签,替代人工标注。 Result: 在WMT Metrics共享任务数据集上,仅用伪标签训练的模型在系统级和跨度级性能上均优于未适配基线和基于人工标注的有监督基线,句子级性能保持有竞争力。 Conclusion: 该方法成功摆脱对人工标注的依赖,在ESD任务中实现了高性能与低成本的平衡。 Abstract: Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of translation errors. While fine-tuning models on human-annotated data improves ESD performance, acquiring such data is expensive and prone to inconsistencies among annotators. To address this, we propose a novel self-evolution framework based on Minimum Bayes Risk (MBR) decoding, named Iterative MBR Distillation for ESD, which eliminates the reliance on human annotations by leveraging an off-the-shelf LLM to generate pseudo-labels.Extensive experiments on the WMT Metrics Shared Task datasets demonstrate that models trained solely on these self-generated pseudo-labels outperform both unadapted base model and supervised baselines trained on human annotations at the system and span levels, while maintaining competitive sentence-level performance.

[37] Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

Hubert Plisiecki,Maria Leniarska,Jan Piotrowski,Marcin Zajenkowski

Main category: cs.CL

TL;DR: 本文提出了一种PCA扫掠(sweep)方法,用于在监督语义差异(SSD)分析中系统选择主成分数量K,以平衡表征能力、语义梯度可解释性与稳定性;在AI态度与自恋特质(Admiration/Rivalry)的实证分析中,该方法识别出稳定且可解释的Admiration相关语义梯度,而Rivalry未见稳健关联,并证明随意选取高维PCA会损害聚类结构。

Details Motivation: 当前SSD方法中PCA降维的组件数K缺乏系统选择标准,导致研究者自由度过大,影响结果可重复性与可解释性。 Method: 提出PCA扫掠程序,联合优化三个标准:嵌入表征容量、语义梯度可解释性(通过聚类和文本检索评估)、以及K邻域内的稳定性;在Prolific平台采集的AI主题短文及Admiration/Rivalry量表数据上进行实证检验,并与高维PCA反事实方案对比。 Result: Admiration维度上识别出稳定、可解释的语义梯度——一端为乐观协作型AI表述,另一端为怀疑嘲讽型;Rivalry维度未发现稳健对齐;高维PCA反事实导致聚类弥散、结构薄弱。 Conclusion: PCA扫掠法有效约束了SSD中的研究者自由度,在保持其定性解释力的同时提升了分析的透明性与心理学意义。 Abstract: Supervised Semantic Differential (SSD) is a mixed quantitative-interpretive method that models how text meaning varies with continuous individual-difference variables by estimating a semantic gradient in an embedding space and interpreting its poles through clustering and text retrieval. SSD applies PCA before regression, but currently no systematic method exists for choosing the number of retained components, introducing avoidable researcher degrees of freedom in the analysis pipeline. We propose a PCA sweep procedure that treats dimensionality selection as a joint criterion over representation capacity, gradient interpretability, and stability across nearby values of K. We illustrate the method on a corpus of short posts about artificial intelligence written by Prolific participants who also completed Admiration and Rivalry narcissism scales. The sweep yields a stable, interpretable Admiration-related gradient contrasting optimistic, collaborative framings of AI with distrustful and derisive discourse, while no robust alignment emerges for Rivalry. We also show that a counterfactual using a high-PCA dimension solution heuristic produces diffuse, weakly structured clusters instead, reinforcing the value of the sweep-based choice of K. The case study shows how the PCA sweep constrains researcher degrees of freedom while preserving SSD's interpretive aims, supporting transparent and psychologically meaningful analyses of connotative meaning.

[38] Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Yifeng Liu,Siqi Ouyang,Yatish Hosmane Revanasiddappa,Lei Li

Main category: cs.CL

TL;DR: 本文提出WALAR方法,利用单语数据通过强化学习提升大语言模型在低资源语言翻译上的性能,同时保持高资源语言的翻译能力。

Details Motivation: 现有后训练方法依赖高质量平行语料,而低资源语言往往缺乏此类数据;同时,基于源端的多语质量评估模型存在失败模式('holes'),直接用于强化学习会放大这些缺陷。 Method: 提出WALAR强化训练方法,结合词对齐和语言对齐技术来缓解质量评估模型中的失败模式,仅使用单语文本进行持续训练。 Result: 在Flores-101数据集的1400个语言方向上,WALAR训练后的模型显著优于当前最强开源多语大模型LLaMAX。 Conclusion: WALAR能有效提升大语言模型在低资源语言翻译任务上的表现,且不损害其在高资源语言上的性能,为低资源语言机器翻译提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

[39] ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Siqi Sun,Ben Peng Wu,Mali Jin,Peizhen Bai,Hanpei Zhang,Xingyi Song

Main category: cs.CL

TL;DR: 本文提出ESG-Bench基准数据集,用于评估和提升大语言模型在ESG报告理解与幻觉抑制方面的能力,并通过任务定制的思维链(CoT)提示与微调显著降低幻觉。

Details Motivation: ESG报告内容冗长复杂,难以自动化、可靠地分析;同时,ESG分析涉及合规与社会责任,对事实准确性要求极高,亟需可验证、抗幻觉的LLM方法。 Method: 构建首个面向ESG报告理解的带幻觉标注的QA基准ESG-Bench;设计任务定制的Chain-of-Thought(CoT)提示策略,并基于CoT标注的推理过程对多个SOTA LLM进行微调。 Result: CoT提示与CoT微调方法显著优于标准提示和直接微调,在ESG-Bench上大幅降低幻觉率,且泛化能力良好,性能提升可迁移到其他通用QA基准。 Conclusion: ESG-Bench为负责任的ESG分析提供了可验证的评估框架,CoT驱动的方法为高敏感、强合规场景下的LLM可信应用提供了有效路径。 Abstract: As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

[40] Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Xin Chen,Junchao Wu,Shu Yang,Runzhe Zhan,Zeyu Wu,Min Yang,Shujian Huang,Lidia S. Chao,Derek F. Wong

Main category: cs.CL

TL;DR: 本文提出NAIT框架,通过分析指令微调(IT)数据与目标领域能力之间的神经元激活模式相似性,高效筛选高质量IT子集,显著提升大语言模型性能。

Details Motivation: 现有研究表明,过多的指令微调数据反而会损害大语言模型性能,而精选少量高质量数据可显著增强其能力;因此,如何从IT数据集中识别最有效的子集以发展特定或通用能力成为关键挑战。 Method: 提出NAIT框架:首先在目标领域能力的域内数据上提取神经元激活模式,构建可复用、可迁移的激活特征;然后基于候选样本与目标能力期望激活特征之间的相似性进行评估与筛选。 Result: 在仅使用10% Alpaca-GPT4 IT数据子集的情况下,NAIT筛选出的数据训练效果持续优于依赖外部先进模型或不确定性特征的方法;同时发现逻辑推理和编程类IT数据具有强泛化迁移性,且存在一个稳定核心子集能普遍提升多任务性能。 Conclusion: 神经元激活特征在不同能力间具有可迁移性;NAIT是一种高效、无需额外大模型辅助的IT数据选择方法,为指令微调提供了新范式。 Abstract: Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge. To address this, we propose a novel and efficient framework called NAIT. NAIT evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, NAIT captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities. Experimental results show that training on the 10\% Alpaca-GPT4 IT data subset selected by NAIT consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.

cs.CV [Back]

[41] VQQA: An Agentic Approach for Video Evaluation and Quality Improvement

Yiwen Song,Tomas Pfister,Yale Song

Main category: cs.CV

TL;DR: 本文提出VQQA框架,通过多智能体协同和视觉问答机制,在不访问模型内部的情况下,利用视觉语言模型(VLM)提供可解释的语义反馈,实现高效黑盒提示优化,显著提升视频生成质量。

Details Motivation: 现有视频生成模型难以准确对齐复杂用户意图;测试时优化方法通常计算开销大或需白盒访问模型内部。 Method: 提出VQQA(Video Quality Question Answering),一种统一、多智能体框架,支持多种输入模态;通过动态生成视觉问题,利用VLM的批判性回答作为语义梯度,替代传统评估指标,实现基于自然语言接口的黑盒闭环提示优化。 Result: 在T2V-CompBench和VBench2上分别取得+11.57%和+8.43%的绝对提升,优于当前最优随机搜索与提示优化方法;能快速(数步内)定位并修复视觉伪影。 Conclusion: VQQA是一种高效、通用、可解释的黑盒视频生成优化框架,适用于文本到视频和图像到视频任务,无需模型内部访问,显著提升生成质量与可控性。 Abstract: Despite rapid advancements in video generation models, aligning their outputs with complex user intent remains challenging. Existing test-time optimization methods are typically either computationally expensive or require white-box access to model internals. To address this, we present VQQA (Video Quality Question Answering), a unified, multi-agent framework generalizable across diverse input modalities and video generation tasks. By dynamically generating visual questions and using the resulting Vision-Language Model (VLM) critiques as semantic gradients, VQQA replaces traditional, passive evaluation metrics with human-interpretable, actionable feedback. This enables a highly efficient, closed-loop prompt optimization process via a black-box natural language interface. Extensive experiments demonstrate that VQQA effectively isolates and resolves visual artifacts, substantially improving generation quality in just a few refinement steps. Applicable to both text-to-video (T2V) and image-to-video (I2V) tasks, our method achieves absolute improvements of +11.57% on T2V-CompBench and +8.43% on VBench2 over vanilla generation, significantly outperforming state-of-the-art stochastic search and prompt optimization techniques.

[42] Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks

Tianhao Qian,Zhuoxuan Li,Jinde Cao,Xinli Shi,Hanjie Liu,Leszek Rutkowski

Main category: cs.CV

TL;DR: 本文提出了一种受交替梯度流(AGF)启发的解耦动力学范式,用于深度视觉网络的结构剪枝,通过绝对特征空间泰勒展开精准刻画网络结构的‘动能效用’,克服传统剪枝指标的幅度偏差问题,并在高稀疏度下避免结构坍塌,同时在ViT中发现稀疏瓶颈现象,最终设计出结合离线结构搜索与在线执行的混合路由框架,在ImageNet上验证了其高效性与准确性。

Details Motivation: 传统剪枝指标(如权重幅值、激活感知)在结构剪枝中存在幅度偏差,无法保留关键功能通路,尤其在高稀疏度和无强结构先验的ViT中表现不佳。 Method: 提出基于交替梯度流(AGF)的解耦动力学范式,采用绝对特征空间泰勒展开量化结构‘动能效用’;分析拓扑相变与稀疏瓶颈;设计融合AGF离线搜索与零成本物理先验在线路由的混合框架。 Result: 在ImageNet-1K上75%压缩率下避免结构坍塌,性能优于随机剪枝;在ImageNet-100动态推理中,重专家调用减少约50%,整体计算成本降至0.92×,且不损失全模型精度。 Conclusion: AGF范式能更准确建模结构重要性,揭示了高稀疏下的拓扑隐式正则化与ViT中的梯度信号压缩问题,所提混合路由框架实现了动态推理的帕累托最优效率。 Abstract: Efficient deep learning traditionally relies on static heuristics like weight magnitude or activation awareness (e.g., Wanda, RIA). While successful in unstructured settings, we observe a critical limitation when applying these metrics to the structural pruning of deep vision networks. These contemporary metrics suffer from a magnitude bias, failing to preserve critical functional pathways. To overcome this, we propose a decoupled kinetic paradigm inspired by Alternating Gradient Flow (AGF), utilizing an absolute feature-space Taylor expansion to accurately capture the network's structural "kinetic utility". First, we uncover a topological phase transition at extreme sparsity, where AGF successfully preserves baseline functionality and exhibits topological implicit regularization, avoiding the collapse seen in models trained from scratch. Second, transitioning to architectures without strict structural priors, we reveal a phenomenon of Sparsity Bottleneck in Vision Transformers (ViTs). Through a gradient-magnitude decoupling analysis, we discover that dynamic signals suffer from signal compression in converged models, rendering them suboptimal for real-time routing. Finally, driven by these empirical constraints, we design a hybrid routing framework that decouples AGF-guided offline structural search from online execution via zero-cost physical priors. We validate our paradigm on large-scale benchmarks: under a 75% compression stress test on ImageNet-1K, AGF effectively avoids the structural collapse where traditional metrics aggressively fall below random sampling. Furthermore, when systematically deployed for dynamic inference on ImageNet-100, our hybrid approach achieves Pareto-optimal efficiency. It reduces the usage of the heavy expert by approximately 50% (achieving an estimated overall cost of 0.92$\times$) without sacrificing the full-model accuracy.

[43] Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

Ayan Banerjee,Kuntal Thakur,Sandeep Gupta

Main category: cs.CV

TL;DR: 本文提出了一种名为GenEval的新方法,结合多模态视觉语言模型与人类知识,以提升跨域图像分类(如糖尿病视网膜病变分级和癫痫致痫区检测)的泛化能力,并引入域共形界(DCB)理论框架评估未知因果差异。

Details Motivation: 跨域图像分类在医疗任务中面临挑战,尤其当域间存在未知因果差异时;缺乏无需元数据即可客观评估域差异的方法。 Method: 提出域共形界(DCB)理论框架评估域间因果差异;构建GenEval方法,融合基础多模态VLM(如MedGemma-4B)与人类知识,采用LoRA进行适配,实现单源域泛化(SDG)。 Result: 在8个DR数据集和2个SOZ数据集上,GenEval平均准确率达69.2%(DR)和81%(SOZ),分别超越最强基线9.4%和1.8%。 Conclusion: GenEval有效弥合了跨域因果差距,显著提升了单源域泛化性能,为无元数据场景下的医疗影像跨域泛化提供了新范式。 Abstract: Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.

[44] SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs

Mohamad Alansari,Naufal Suryanto,Divya Velayudhan,Sajid Javed,Naoufel Werghi,Muzammal Naseer

Main category: cs.CV

TL;DR: 本文提出SPARROW,一种支持像素级定位与时间稳定跟踪的视频多模态大语言模型,通过目标特异性追踪特征(TSF)和双提示解码机制([BOX]/[SEG]),在多个基准上显著提升空间精度与时间一致性。

Details Motivation: 现有视频多模态大语言模型依赖静态分割标记[SEG],缺乏时间上下文,导致空间漂移、身份切换和初始化不稳定等问题。 Method: 提出SPARROW模型,包含两个核心组件:(i) 目标特异性追踪特征(TSF),在训练中注入时间对齐的指代线索;(ii) 双提示设计,联合解码[BOX]和[SEG]标记以融合几何先验与语义定位;并构建含30,646个视频的大规模指代视频数据集,采用类无关SAM2提案器实现端到端推理。 Result: 在RVOS、视觉定位和GCG等六个基准上取得一致提升,最高提升+8.9 J&F、+5 mIoU、+5.4 CLAIR。 Conclusion: SPARROW显著增强了视频中指代对象的空间精度、时间稳定性与跨帧一致性,为像素级视频理解提供了新范式。 Abstract: Multimodal large language models (MLLMs) have advanced from image-level reasoning to pixel-level grounding, but extending these capabilities to videos remains challenging as models must achieve spatial precision and temporally consistent reference tracking. Existing video MLLMs often rely on a static segmentation token ([SEG]) for frame-wise grounding, which provides semantics but lacks temporal context, causing spatial drift, identity switches, and unstable initialization when objects move or reappear. We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. SPARROW is supported by a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs and operates end-to-end without external detectors via a class-agnostic SAM2-based proposer. Integrated into three recent open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW delivers consistent gains across six benchmarks, improving up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. These results demonstrate that SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding. Project page: https://risys-lab.github.io/SPARROW

[45] Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

Chenkai Zhang

Main category: cs.CV

TL;DR: 本文提出EMC-Gaze,一种轻量级、仅依赖面部关键点的注视点估计方法,通过E(3)-等变图编码器、双目强调、3D视线方向辅助监督及可微分岭回归校准器,实现低校准负担、高鲁棒性的网页端实时眼动追踪。

Details Motivation: 解决实际网络摄像头眼动追踪中校准繁琐、对头部运动和会话漂移鲁棒性差、运行开销大、浏览器兼容性弱等问题,追求部署友好的性能-成本平衡点,而非单纯追求图像大模型精度。 Method: 提出Equivariant Meta-Calibrated Gaze(EMC-Gaze):采用E(3)-等变关键点图编码器提取几何特征;引入局部眼几何建模与双目强调;添加3D视线方向辅助监督;设计可微分闭式岭回归校准器,并通过情景元训练优化;引入双视角规范化一致性损失抑制姿态泄露;全程仅使用面部关键点作为输入。 Result: 在33次会话交互评测中,9点校准下RMSE为5.79±1.81°,优于Elastic Net(6.68±2.34°);静止头部查询下优势更显著(2.92±0.75° vs. 4.45±0.30°);跨被试测试与MPIIFaceGaze数据集上均持续领先;模型仅94.4万参数、4.76MB ONNX格式,在Chrome 145中单样本推理耗时约12.6ms,支持浏览器端实时预测。 Conclusion: EMC-Gaze确立了一种面向实际部署的校准友好型操作点,在精度、轻量化、鲁棒性和浏览器兼容性之间取得良好权衡,不追求全面超越重型外观模型,而强调实用场景下的综合可用性。 Abstract: Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.

[46] ABRA: Teleporting Fine-Tuned Knowledge Across Domains for Open-Vocabulary Object Detection

Mattia Bernardi,Chiara Cappellino,Matteo Mosconi,Enver Sangineto,Angelo Porrello,Simone Calderara

Main category: cs.CV

TL;DR: 本文提出ABRA方法,通过在预训练检测器权重空间中建模几何传输问题,将标注源域的类别特定检测知识迁移到无标注目标域(如夜间、雾天场景),实现零样本域自适应检测。

Details Motivation: 现有开放词汇目标检测模型(如Grounding DINO)在域偏移下性能显著下降,而许多实际重要场景(如夜间、雾天)缺乏大规模标注数据,难以直接微调。 Method: ABRA将跨域知识迁移建模为权重空间中的几何传输问题,对齐源域与目标域的专家模型,以迁移类别特定检测知识。 Result: 在多种挑战性域偏移场景下,ABRA成功实现了类别级检测能力的‘远程传送’,显著提升零样本域自适应性能。 Conclusion: ABRA无需目标域标注图像即可有效迁移类特定检测知识,为开放词汇检测在低资源恶劣环境下的部署提供了新范式。 Abstract: Although recent Open-Vocabulary Object Detection architectures, such as Grounding DINO, demonstrate strong zero-shot capabilities, their performance degrades significantly under domain shifts. Moreover, many domains of practical interest, such as nighttime or foggy scenes, lack large annotated datasets, preventing direct fine-tuning. In this paper, we introduce Aligned Basis Relocation for Adaptation(ABRA), a method that transfers class-specific detection knowledge from a labeled source domain to a target domain where no training images containing these classes are accessible. ABRA formulates this adaptation as a geometric transport problem in the weight space of a pretrained detector, aligning source and target domain experts to transport class-specific knowledge. Extensive experiments across challenging domain shifts demonstrate that ABRA successfully teleports class-level specialization under multiple adverse conditions. Our code will be made public upon acceptance.

[47] A Neuro-Symbolic Framework Combining Inductive and Deductive Reasoning for Autonomous Driving Planning

Hongyan Wei,Wael AbdAlmageed

Main category: cs.CV

TL;DR: 本文提出了一种神经符号化轨迹规划框架,将大语言模型(LLM)与答案集编程(ASP)结合,实现可解释、安全、可行的端到端自动驾驶决策与规划。

Details Motivation: 现有端到端自动驾驶模型依赖纯数据驱动归纳推理,存在黑箱性、不可解释性及复杂长尾场景下缺乏绝对安全性保障的问题。 Method: 提出神经符号化框架:用LLM动态提取场景规则,ASP求解器进行确定性逻辑仲裁生成离散驾驶决策;引入决策条件解码机制,将逻辑决策映射为嵌入向量,联合约束规划查询与运动学自行车模型(KBM)初始速度;结合KBM物理基线轨迹与神经残差修正。 Result: 在nuScenes基准上显著优于SOTA方法MomAD:L2均值误差降至0.57 m,碰撞率降至0.075%,轨迹预测一致性(TPC)优化至0.47 m。 Conclusion: 该框架在保证运动学可行性的同时,提升了自动驾驶系统的安全性、可解释性与泛化能力,为神经符号融合的智能驾驶提供了新范式。 Abstract: Existing end-to-end autonomous driving models rely heavily on purely data-driven inductive reasoning. This "black-box" nature leads to a lack of interpretability and absolute safety guarantees in complex, long-tail scenarios. To overcome this bottleneck, we propose a novel neuro-symbolic trajectory planning framework that seamlessly integrates rigorous deductive reasoning into end-to-end neural networks. Specifically, our framework utilizes a Large Language Model (LLM) to dynamically extract scene rules and employs an Answer Set Programming (ASP) solver for deterministic logical arbitration, generating safe and traceable discrete driving decisions. To bridge the gap between discrete symbols and continuous trajectories, we introduce a decision-conditioned decoding mechanism that transforms high-level logical decisions into learnable embedding vectors, simultaneously constraining the planning query and the physical initial velocity of a differentiable Kinematic Bicycle Model (KBM). By combining KBM-generated physical baseline trajectories with neural residual corrections, our approach inherently guarantees kinematic feasibility while ensuring a high degree of transparency. On the nuScenes benchmark, our method comprehensively outperforms the state-of-the-art baseline MomAD, reducing the L2 mean error to 0.57 m, decreasing the collision rate to 0.075%, and optimizing trajectory prediction consistency (TPC) to 0.47 m.

[48] Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

Jian Jiang,Chenxi Lin,Yiming Gu,Zengyi Qin,Zhitao Zeng,Kun Yuan,Yonghao Long,Xiang Xia,Cheng Yuan,Yuqi Wang,Zijie Yue,Kunyi Yang,Yuting Zhang,Zhu Zhuo,Dian Qin,Xin Wang,NG Chi Fai,Brian Anthony,Daguang Xu,Guy Rosman,Ozanan Meireles,Zizhen Zhang,Nicolas Padoy,Hesheng Wang,Qi Dou,Yueming Jin,Yutong Ban

Main category: cs.CV

TL;DR: 本文提出Surg-R1,一种面向手术场景的视觉语言模型,通过三级推理层次、大规模手术链式推理数据集和四阶段训练流程,显著提升手术场景理解的准确性与可解释性,在多个基准测试中超越现有通用及专用模型。

Details Motivation: 现有手术视觉语言模型缺乏可验证的推理链,而通用推理模型又缺乏手术领域知识,难以完成组合式手术任务。 Method: 提出三级推理层次(感知定位、关系理解、上下文推理)、构建含32万推理对的最大规模手术链式推理数据集,并设计四阶段训练流程(监督微调→组相对策略优化→迭代自提升)。 Result: 在SurgBench(含6个公开基准+6个多中心外部验证集)上,Surg-R1公共基准Arena得分为64.9%,显著高于Gemini 3.0 Pro(46.1%)和GPT-5.1(37.9%),在外验证中较最强手术基线提升15.2个百分点。 Conclusion: Surg-R1有效弥合了手术AI中高精度预测与临床可解释推理之间的鸿沟,为可信手术辅助系统提供了新范式。 Abstract: Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.

[49] Revisiting Model Stitching In the Foundation Model Era

Zheda Mai,Ke Zhang,Fu-En Wang,Zixiao Ken Wang,Albert Y. C. Chen,Lu Xia,Min Sun,Wei-Lun Chao,Cheng-Hao Kuo

Main category: cs.CV

TL;DR: 本文研究了异构视觉基础模型(VFMs)之间的模型缝合(model stitching)可行性,提出了一种系统性协议,并发现通过在目标模型倒数第二层使用简单的特征匹配损失,可实现可靠缝合;进一步提出了VFM缝合树(VST),支持多VFM共享早期层,为多模态大语言模型提供可控的精度-延迟权衡。

Details Motivation: 探究异构视觉基础模型(如CLIP、DINOv2、SigLIP等)是否可缝合,以评估其表征兼容性,并推动缝合从诊断工具走向实用集成方法。 Method: 提出系统性缝合协议,涵盖缝合位置、缝合层结构、训练损失及下游任务;重点比较不同损失函数(特征对齐 vs 端到端任务损失)在不同缝合深度的效果;并设计VFM Stitch Tree(VST)架构实现多VFM共享早期层。 Result: 1)传统缝合方法在浅层缝合时性能下降明显;2)在目标模型倒数第二层采用简单特征匹配损失可实现跨任务稳定缝合;3)深层缝合时 stitched 模型可超越任一单个模型,且推理开销极小;VST 架构支持精度-延迟灵活权衡。 Conclusion: 异构VFMs是可缝合的,关键在于缝合层训练策略;缝合已从表征诊断手段升级为整合多VFM优势的实用技术,并能揭示其表征对齐/分歧的位置。 Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g., CLIP, DINOv2, SigLIP 2) and ask: Are heterogeneous VFMs stitchable? We introduce a systematic protocol spanning the stitch points, stitch layer families, training losses, and downstream tasks. Three findings emerge. (1) Stitch layer training matters: conventional approaches that match the intermediate features at the stitch point or optimize the task loss end-to-end struggle to retain accuracy, especially at shallow stitch points. (2) With a simple feature-matching loss at the target model's penultimate layer, heterogeneous VFMs become reliably stitchable across vision tasks. (3) For deep stitch points, the stitched model can surpass either constituent model at only a small inference overhead (for the stitch layer). Building on these findings, we further propose the VFM Stitch Tree (VST), which shares early layers across VFMs while retaining their later layers, yielding a controllable accuracy-latency trade-off for multimodal LLMs that often leverage multiple VFMs. Taken together, our study elevates stitching from a diagnostic probe to a practical recipe for integrating complementary VFM strengths and pinpointing where their representations align or diverge.

[50] Adaptation of Weakly Supervised Localization in Histopathology by Debiasing Predictions

Alexis Guichemerre,Banafsheh Karimian,Soufiane Belharbi,Natacha Gillet,Nicolas Thome,Pourya Shamsolmoali,Mohammadhadi Shateri,Luke McCaffrey,Eric Granger

Main category: cs.CV

TL;DR: 本文提出SFDA-DeP方法,通过迭代识别与校正预测偏差,缓解弱监督目标定位(WSOL)模型在无源域自适应(SFDA)中的类别偏差放大问题,提升跨器官/中心病理图像的分类与定位性能。

Details Motivation: WSOL模型在跨域部署时因分布偏移(如不同器官、染色协议、扫描设备)导致性能下降,尤其在强域偏移下伪标签分布严重偏向主导类别,而现有基于自训练的SFDA方法会加剧该偏差。 Method: 受机器遗忘启发,将SFDA建模为迭代式偏差识别与校正过程:周期性识别过预测类别的目标图像,对高熵(不确定)样本选择性降低预测置信度,保留高置信预测;同时联合优化像素级分类器以恢复判别性定位特征。 Result: 在GLAS、CAMELYON-16、CAMELYON-17等跨器官/中心病理基准上,SFDA-DeP在多个WSOL模型上持续超越现有SFDA方法,在分类与定位任务上均取得更优性能。 Conclusion: SFDA-DeP有效抑制了SFDA中WSOL模型的预测偏差放大,提升了模型在真实临床多中心场景下的鲁棒性与泛化能力。 Abstract: Weakly Supervised Object Localization (WSOL) models enable joint classification and region-of-interest localization in histology images using only image-class supervision. When deployed in a target domain, distributions shift remains a major cause of performance degradation, especially when applied on new organs or institutions with different staining protocols and scanner characteristics. Under stronger cross-domain shifts, WSOL predictions can become biased toward dominant classes, producing highly skewed pseudo-label distributions in the target domain. Source-Free (Unsupervised) Domain Adaptation (SFDA) methods are commonly employed to address domain shift. However, because they rely on self-training, the initial bias is reinforced over training iterations, degrading both classification and localization tasks. We identify this amplification of prediction bias as a primary obstacle to the SFDA of WSOL models in histopathology. This paper introduces \sfdadep, a method inspired by machine unlearning that formulates SFDA as an iterative process of identifying and correcting prediction bias. It periodically identifies target images from over-predicted classes and selectively reduces the predictive confidence for uncertain (high entropy) images, while preserving confident predictions. This process reduces the drift of decision boundaries and bias toward dominant classes. A jointly optimized pixel-level classifier further restores discriminative localization features under distribution shift. Extensive experiments on cross-organ and -center histopathology benchmarks (glas, CAMELYON-16, CAMELYON-17) with several WSOL models show that SFDA-DeP consistently improves classification and localization over state-of-the-art SFDA baselines. {\small Code: \href{https://anonymous.4open.science/r/SFDA-DeP-1797/}{anonymous.4open.science/r/SFDA-DeP-1797/}}

[51] Unleashing Video Language Models for Fine-grained HRCT Report Generation

Yingying Fang,Huichi Zhou,KinHei Lee,Yijia Wang,Zhenxuan Zhang,Jiahao Huang,Guang Yang

Main category: cs.CV

TL;DR: 本文提出AbSteering框架,通过异常中心的思维链和直接偏好优化,引导通用视频语言模型(VideoLMs)生成高精度HRCT诊断报告,在检测敏感性和减少幻觉方面优于现有CT专用模型。

Details Motivation: HRCT诊断报告生成面临病理多样性高、3D体积中空间稀疏性大等挑战;通用VideoLMs在医学影像领域的适配性尚未充分探索。 Method: 提出AbSteering框架:(i)异常中心的思维链方案,强制模型进行异常推理;(ii)基于临床易混淆异常作为难负样本的直接偏好优化目标,提升细粒度判别能力。 Result: AbSteering使通用VideoLMs在HRCT报告生成任务上展现出强迁移能力,性能超越预训练于大规模CT数据的领域专用基础模型,在检测敏感性与幻觉抑制两方面均更优。 Conclusion: 通用VideoLMs经异常导向的范式引导后,可高效适配高通量医学影像理解任务,无需依赖大规模领域预训练,为医学AI提供新路径。 Abstract: Generating precise diagnostic reports from High-Resolution Computed Tomography (HRCT) is critical for clinical workflow, yet it remains a formidable challenge due to the high pathological diversity and spatial sparsity within 3D volumes. While Video Language Models (VideoLMs) have demonstrated remarkable spatio-temporal reasoning in general domains, their adaptability to domain-specific, high-volume medical interpretation remains underexplored. In this work, we present AbSteering, an abnormality-centric framework that steers VideoLMs toward precise HRCT report generation. Specifically, AbSteering introduces: (i) an abnormality-centric Chain-of-Thought scheme that enforces abnormality reasoning, and (ii) a Direct Preference Optimization objective that utilizes clinically confusable abnormalities as hard negatives to enhance fine-grained discrimination. Our results demonstrate that general-purpose VideoLMs possess strong transferability to high-volume medical imaging when guided by this paradigm. Notably, AbSteering outperforms state-of-the-art domain-specific CT foundation models, which are pretrained with large-scale CTs, achieving superior detection sensitivity while simultaneously mitigating hallucinations. Our data and model weights are released at https://anonymous.4open.science/r/hrct-report-generation-video-vlm-728C/

[52] Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

Rujie Wu,Haozhe Zhao,Hai Ci,Yizhou Wang

Main category: cs.CV

TL;DR: 本文提出Goal-Driven Data Optimization (GDO)框架,通过为每个样本计算六个描述符,构建面向不同目标的优化训练子集,在显著减少训练样本量的同时,提升多模态模型(Qwen3-VL-8B-Instruct)在多个视频理解基准上的准确率和收敛速度。

Details Motivation: 多模态指令微调常因在图像-视频混合大数据池上平均分配计算资源而效率低下,且各样本对模型性能提升的贡献差异大。 Method: 提出GDO框架,为每个候选样本计算六种描述符,并据此构建针对不同优化目标(如最小损失、多样性、时间建模等)的1×高效训练子集;在固定单轮训练与评估流程下验证效果。 Result: 相比Uni-10x基线(512k样本),GDO仅用约26–35k样本即达到或超越其性能,在MVBench、VideoMME、MLVU、LVBench上分别提升准确率+1.38、+1.67、+3.08、+0.84个百分点;时间建模增强策略(Temp/Temp+)更利于长视频理解。 Conclusion: GDO是一种目标驱动的数据优化框架,可在固定训练协议下以更少样本实现更快收敛与更高性能,尤其适用于多模态视频理解任务。 Abstract: Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1$\times$ training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.

[53] CalliMaster: Mastering Page-level Chinese Calligraphy via Layout-guided Spatial Planning

Tianshuo Xu,Tiantian Hong,Zhifei Chen,Fei Chao,Ying-cong Chen

Main category: cs.CV

TL;DR: 本文提出CalliMaster框架,通过解耦空间规划与内容合成,在页级书法合成中兼顾字形精度与布局构图,实现可控生成与编辑,并支持修复与司法鉴定等文化保护应用。

Details Motivation: 现有字符级模型缺乏空间上下文,而页级方法常牺牲笔触细节;需在页级书法合成中平衡字形精度与整体布局。 Method: 提出基于‘先规划、后书写’认知过程的粗到细流程(文本→布局→图像),在单个多模态扩散Transformer中分两阶段:空间规划阶段预测字符边界框以建立全局布局,内容合成阶段以该布局为几何提示、用流匹配渲染高保真笔触。 Result: 在生成质量上达到SOTA;支持布局约束下的可控语义重规划(如调整字符大小/位置并自动协调空白与笔势);拓展至文物修复与司法取证等数字文化遗产应用。 Conclusion: CalliMaster通过解耦空间与内容建模,统一解决了页级书法合成中的精度-布局矛盾,兼具高质量生成、灵活编辑与跨任务泛化能力,为数字文化传承提供综合性工具。 Abstract: Page-level calligraphy synthesis requires balancing glyph precision with layout composition. Existing character models lack spatial context, while page-level methods often compromise brushwork detail. In this paper, we present \textbf{CalliMaster}, a unified framework for controllable generation and editing that resolves this conflict by decoupling spatial planning from content synthesis. Inspired by the human cognitive process of ``planning before writing'', we introduce a coarse-to-fine pipeline \textbf{(Text $\rightarrow$ Layout $\rightarrow$ Image)} to tackle the combinatorial complexity of page-scale synthesis. Operating within a single Multimodal Diffusion Transformer, a spatial planning stage first predicts character bounding boxes to establish the global spatial arrangement. This intermediate layout then serves as a geometric prompt for the content synthesis stage, where the same network utilizes flow-matching to render high-fidelity brushwork. Beyond achieving state-of-the-art generation quality, this disentanglement supports versatile downstream capabilities. By treating the layout as a modifiable constraint, CalliMaster enables controllable semantic re-planning: users can resize or reposition characters while the model automatically harmonizes the surrounding void space and brush momentum. Furthermore, we demonstrate the framework's extensibility to artifact restoration and forensic analysis, providing a comprehensive tool for digital cultural heritage.

[54] RAW-Domain Degradation Models for Realistic Smartphone Super-Resolution

Ali Mosleh,Faraz Ali,Fengjia Zhang,Stavros Tsogkas,Junyong Lee,Alex Levinshtein,Michael S. Brown

Main category: cs.CV

TL;DR: 本文提出了一种基于设备特定退化建模的RAW域超分辨率方法,通过校准和反向处理(unprocessing)公开渲染图像生成高质量配对数据,显著提升了真实手机上的数字变焦性能。

Details Motivation: 智能手机数字变焦依赖于基于学习的RAW域超分辨率模型,但获取带真值的传感器原始图像训练数据困难;现有合成数据方法因退化建模不准确导致域偏移。 Method: 通过相机标定获取设备特异的模糊与噪声参数,构建更真实的unprocessing退化流程,将公开HR渲染图反向生成对应手机RAW域LR图像,构建配对数据集,并训练单图像RAW-to-RGB超分辨率模型。 Result: 在未参与训练的真实手机设备上测试,所提方法相比使用通用/任意退化生成的数据训练的基线模型,SR性能有明显提升。 Conclusion: 精确的、设备特定的退化建模比通用先验更能提升真实场景下的RAW域超分辨率效果,验证了高质量合成数据的关键在于退化过程的真实性与针对性。 Abstract: Digital zoom on smartphones relies on learning-based super-resolution (SR) models that operate on RAW sensor images, but obtaining sensor-specific training data is challenging due to the lack of ground-truth images. Synthetic data generation via ``unprocessing'' pipelines offers a potential solution by simulating the degradations that transform high-resolution (HR) images into their low-resolution (LR) counterparts. However, these pipelines can introduce domain gaps due to incomplete or unrealistic degradation modeling. In this paper, we demonstrate that principled and carefully designed degradation modeling can enhance SR performance in real-world conditions. Instead of relying on generic priors for camera blur and noise, we model device-specific degradations through calibration and unprocess publicly available rendered images into the RAW domain of different smartphones. Using these image pairs, we train a single-image RAW-to-RGB SR model and evaluate it on real data from a held-out device. Our experiments show that accurate degradation modeling leads to noticeable improvements, with our SR model outperforming baselines trained on large pools of arbitrarily chosen degradations.

[55] Naïve PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation

Joong Ho Kim,Nicholas Thai,Souhardya Saha Dip,Dong Lao,Keith G. Mills

Main category: cs.CV

TL;DR: 本文提出Naïve PAINE方法,通过预测初始噪声和提示词对应的图像质量,筛选高质量噪声输入扩散模型,从而提升文本到图像生成的质量和稳定性。

Details Motivation: 扩散模型在文本到图像生成中依赖随机高斯噪声,导致相同提示下结果不稳定,需多次尝试才能获得满意结果,带来效率负担。 Method: Naïve PAINE利用T2I偏好基准,直接从初始噪声和提示词预测图像质量分数,并据此筛选少数高质量噪声送入扩散模型生成;同时提供对模型生成能力的轻量级反馈。 Result: 在多个提示语料库基准上,Naïve PAINE优于现有方法,显著提升生成图像质量与一致性。 Conclusion: Naïve PAINE是一种轻量、即插即用的质量引导机制,能有效缓解扩散模型的随机性问题,提升T2I生成的可控性与实用性。 Abstract: Text-to-Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user-defined inputs. This imposes a gambler's burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it. To account for this, we propose Naïve PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Naïve PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Naïve PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Naïve PAINE outperforms existing approaches on several prompt corpus benchmarks.

[56] MemRoPE: Training-Free Infinite Video Generation via Evolving Memory Tokens

Youngrae Kim,Qixin Hu,C. -C. Jay Kuo,Peter A. Beerel

Main category: cs.CV

TL;DR: MemRoPE是一种无需训练的框架,通过记忆令牌和在线RoPE索引技术,在固定大小缓存中高效保存长视频生成中的长期身份与短期动态信息,显著提升时序连贯性、视觉保真度和主体一致性。

Details Motivation: 现有滑动窗口缓存会丢弃过去上下文,导致长时视频生成中保真度下降、身份漂移和运动停滞;静态锚点无法反映视频内容的动态演化。 Method: MemRoPE包含两个协同设计组件:1)记忆令牌(Memory Tokens),利用指数滑动平均将所有历史key压缩为长/短期双流;2)在线RoPE索引(Online RoPE Indexing),缓存未旋转key并在注意力计算时动态施加位置编码,避免位置相位冲突。 Result: 在分钟至小时级视频生成任务中,MemRoPE在时序连贯性、视觉保真度和主体一致性上均优于现有方法。 Conclusion: MemRoPE通过位置解耦与时间聚合的相互增强,实现了无界生成下固定尺寸缓存的有效性,为自回归扩散模型的长时视频流式生成提供了新范式。 Abstract: Autoregressive diffusion enables real-time frame streaming, yet existing sliding-window caches discard past context, causing fidelity degradation, identity drift, and motion stagnation over long horizons. Current approaches preserve a fixed set of early tokens as attention sinks, but this static anchor cannot reflect the evolving content of a growing video. We introduce MemRoPE, a training-free framework with two co-designed components. Memory Tokens continuously compress all past keys into dual long-term and short-term streams via exponential moving averages, maintaining both global identity and recent dynamics within a fixed-size cache. Online RoPE Indexing caches unrotated keys and applies positional embeddings dynamically at attention time, ensuring the aggregation is free of conflicting positional phases. These two mechanisms are mutually enabling: positional decoupling makes temporal aggregation well-defined, while aggregation makes fixed-size caching viable for unbounded generation. Extensive experiments validate that MemRoPE outperforms existing methods in temporal coherence, visual fidelity, and subject consistency across minute- to hour-scale generation.

[57] Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding

Shivam Chaudhary,Sheethal Bhat,Andreas Maier

Main category: cs.CV

TL;DR: 本文提出了一种结合自监督预训练与半监督检测的标签高效方法,用于腹部CT中创伤性损伤的3D检测与分类,在极少量标注数据下显著提升性能。

Details Motivation: 腹部CT中创伤损伤的精准检测与定位面临严重标注数据稀缺的挑战。 Method: 采用基于图像块的掩码图像建模(MIM)对3D U-Net编码器进行无监督预训练;下游任务包括:1)使用VDETR与顶点相对位置编码进行3D损伤检测;2)多标签损伤分类;检测任务采用半监督学习(2000例无标注+144例有标注),含一致性正则化。 Result: 检测任务在仅144例标注样本下达到验证集mAP@0.50为56.57%、测试集45.30%,较纯监督方法提升115%;分类任务在2244例标注样本下测试准确率达94.07%,且仅需冻结预训练编码器。 Conclusion: 自监督预训练联合半监督学习可有效缓解医学影像标注稀缺问题,实现小样本下的鲁棒3D目标检测。 Abstract: Accurate detection and localization of traumatic injuries in abdominal CT scans remains a critical challenge in emergency radiology, primarily due to severe scarcity of annotated medical data. This paper presents a label-efficient approach combining self-supervised pre-training with semi-supervised detection for 3D medical image analysis. We employ patch-based Masked Image Modeling (MIM) to pre-train a 3D U-Net encoder on 1,206 CT volumes without annotations, learning robust anatomical representations. The pretrained encoder enables two downstream clinical tasks: 3D injury detection using VDETR with Vertex Relative Position Encoding, and multi-label injury classification. For detection, semi-supervised learning with 2,000 unlabeled volumes and consistency regularization achieves 56.57% validation mAP@0.50 and 45.30% test mAP@0.50 with only 144 labeled training samples, representing a 115% improvement over supervised-only training. For classification, expanding to 2,244 labeled samples yields 94.07% test accuracy across seven injury categories using only a frozen encoder, demonstrating immediately transferable self-supervised features. Our results validate that self-supervised pre-training combined with semi-supervised learning effectively addresses label scarcity in medical imaging, enabling robust 3D object detection with limited annotations.

[58] Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

Yura Choi,Roy Miles,Rolandos Alexandros Potamias,Ismail Elezi,Jiankang Deng,Stefanos Zafeiriou

Main category: cs.CV

TL;DR: 本文提出EgoPointVQA数据集和Hand Intent Tokens(HINT)方法,提升多模态大模型在以第一人称视角视频中理解指向手势并回答相关问题的能力。

Details Motivation: 现有MLLMs缺乏指向手势丰富数据,且难以从第一人称视频中推断细粒度的指向意图,限制了其在下一代以自我为中心AI助手中的应用。 Method: 构建包含4000个合成与400个真实视频的EgoPointVQA数据集;提出Hand Intent Tokens(HINT),利用现成3D手部关键点重建模型生成空间-时间感知的token,并将其插入模型输入序列。 Result: HINT-14B在6项任务平均准确率达68.1%,较SOTA模型InternVL3-14B提升6.6%;在不同骨干网络与模型规模上均表现更优。 Conclusion: HINT通过显式建模手部指向意图显著提升了MLLMs在egocentric gesture-grounded VQA任务上的性能,EgoPointVQA为该方向提供了首个系统性基准。 Abstract: Understanding and answering questions based on a user's pointing gesture is essential for next-generation egocentric AI assistants. However, current Multimodal Large Language Models (MLLMs) struggle with such tasks due to the lack of gesture-rich data and their limited ability to infer fine-grained pointing intent from egocentric video. To address this, we introduce EgoPointVQA, a dataset and benchmark for gesture-grounded egocentric question answering, comprising 4000 synthetic and 400 real-world videos across multiple deictic reasoning tasks. Built upon it, we further propose Hand Intent Tokens (HINT), which encodes tokens derived from 3D hand keypoints using an off-the-shelf reconstruction model and interleaves them with the model input to provide explicit spatial and temporal context for interpreting pointing intent. We show that our model outperforms others in different backbones and model sizes. In particular, HINT-14B achieves 68.1% accuracy, on average over 6 tasks, surpassing the state-of-the-art, InternVL3-14B, by 6.6%. To further facilitate the open research, we will release the code, model, and dataset. Project page: https://yuuraa.github.io/papers/choi2026egovqa

[59] Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

Alaa Dalaq,Muzammil Behzad

Main category: cs.CV

TL;DR: 本文提出SERA(Spatio-Semantic Expert Routing Architecture),一种面向指代表达图像分割的两阶段表达感知专家路由架构,通过轻量级、表达驱动的适配器(SERA-Adapter)和空间特征融合模块(SERA-Fusion),在冻结视觉语言预训练主干下提升空间一致性与边界精度;采用仅调优归一化与偏置项的参数高效路由机制(<1%参数更新),显著改善碎片化、边界不准及错分问题。

Details Motivation: 现有方法依赖统一细化策略,难以匹配指代表达中多样的空间与语义推理需求,导致在冻结预训练主干时出现分割碎片、边界不准或误识别等问题。 Method: 提出SERA架构,包含两个核心组件:1)SERA-Adapter——在选定主干块中插入表达条件化的轻量适配器,结合专家引导细化与跨模态注意力增强空间一致性与边界精度;2)SERA-Fusion——将token特征重构成空间网格,施加几何保持的专家变换后再进行多模态交互;并设计轻量路由机制,仅更新归一化层和偏置项以适配冻结编码器。 Result: 在标准指代表达图像分割基准上,SERA持续超越强基线,尤其在需精确定位与精细边界的表达上提升显著。 Conclusion: SERA通过表达感知的分阶段专家路由与参数高效微调,在不破坏预训练表征的前提下,有效缓解了空间推理与语义对齐之间的失配问题,为高效、精准的指代表达分割提供了新范式。 Abstract: Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.

[60] Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

Nahid Alam,Leema Krishna Murali,Siddhant Bharadwaj,Patrick Liu,Timothy Chung,Drishti Sharma,Akshata A.,Kranthi Kiran,Wesley Tam,Bala Krishna S Vegesna

Main category: cs.CV

TL;DR: 本文探讨了视觉-语言模型(VLMs)在二维空间关系理解上的局限性,指出问题根源在于主流设计(如CLIP图像编码器和1D位置编码),并通过在LLaVA框架下的控制实验验证了图像编码器目标与位置编码结构对空间推理能力的影响。

Details Motivation: 尽管VLMs在通用基准上表现强劲,但在相对位置、布局和计数等基本2D空间推理任务上仍表现脆弱;作者认为该问题不仅是数据问题,更与主流VLM架构设计密切相关。 Method: 在LLaVA框架内开展受控诊断研究,对比不同图像编码器(CLIP式 vs. 密集/生成式训练编码器)及是否引入2D位置编码的变体,在一系列空间推理基准上进行系统评估。 Result: 实验发现各模型在空间任务上普遍存在性能差距;图像编码器的训练目标和位置编码的结构显著影响空间推理能力,但尚不能完全解决该问题。 Conclusion: 当前VLM的空间推理瓶颈部分源于架构设计缺陷,改进图像编码器目标与采用2D位置编码有助于提升空间能力,但需更根本的建模革新。 Abstract: Vision-language models (VLMs) have advanced rapidly, yet they still struggle with basic spatial reasoning. Despite strong performance on general benchmarks, modern VLMs remain brittle at understanding 2D spatial relationships such as relative position, layout, and counting. We argue that this failure is not merely a data problem, but is closely tied to dominant design choices in current VLM pipelines: reliance on CLIP-style image encoders and the flattening of images into 1D token sequences with 1D positional encoding. We present a controlled diagnostic study within the LLaVA framework to isolate how these choices affect spatial grounding. We evaluate frontier models and LLaVA variants on a suite of spatial benchmarks, comparing CLIP-based encoders against alternatives trained with denser or generative objectives, as well as variants augmented with 2D positional encoding. Our results show consistent spatial performance gaps across models, and indicate that encoder objectives and positional structure shape spatial behavior, but do not fully resolve it.

[61] Decoding Matters: Efficient Mamba-Based Decoder with Distribution-Aware Deep Supervision for Medical Image Segmentation

Fares Bougourzi,Fadi Dornaika,Abdenour Hadid

Main category: cs.CV

TL;DR: 本文提出了一种解码器为中心的通用2D医学图像分割方法Deco-Mamba,结合Transformer-CNN-Mamba架构、新型协同注意力门(CAG)、视觉状态空间模块(VSSM)及可变形卷积优化块,并引入窗口化分布感知KL散度损失,实现了跨模态强泛化性与SOTA性能。

Details Motivation: 现有深度学习方法在医学图像分割中多为任务特定,泛化能力差,且依赖大型预训练编码器导致计算复杂度高。 Method: 提出Deco-Mamba模型:U-Net-like结构,编码器融合CNN与Transformer,解码器集成Co-Attention Gate(CAG)、Vision State Space Module(VSSM)和可变形卷积细化模块;引入窗口化分布感知KL散度损失用于多阶段深层监督。 Result: 在多个医学图像分割基准上取得SOTA性能,展现出强跨模态泛化能力,同时保持适中的模型复杂度。 Conclusion: 解码器中心设计能有效提升医学图像分割的泛化性与效率,Deco-Mamba为通用医学图像分割提供了新范式。 Abstract: Deep learning has achieved remarkable success in medical image segmentation, often reaching expert-level accuracy in delineating tumors and tissues. However, most existing approaches remain task-specific, showing strong performance on individual datasets but limited generalization across diverse imaging modalities. Moreover, many methods focus primarily on the encoder, relying on large pretrained backbones that increase computational complexity. In this paper, we propose a decoder-centric approach for generalized 2D medical image segmentation. The proposed Deco-Mamba follows a U-Net-like structure with a Transformer-CNN-Mamba design. The encoder combines a CNN block and Transformer backbone for efficient feature extraction, while the decoder integrates our novel Co-Attention Gate (CAG), Vision State Space Module (VSSM), and deformable convolutional refinement block to enhance multi-scale contextual representation. Additionally, a windowed distribution-aware KL-divergence loss is introduced for deep supervision across multiple decoding stages. Extensive experiments on diverse medical image segmentation benchmarks yield state-of-the-art performance and strong generalization capability while maintaining moderate model complexity. The source code will be released upon acceptance.

[62] CVGL: Causal Learning and Geometric Topology

Songsong Ouyang,Yingying Zhu

Main category: cs.CV

TL;DR: 本文提出CLGT框架,通过因果特征提取器(CFE)消除混杂因素影响,并利用几何拓扑融合(GT Fusion)模块注入鸟瞰图道路拓扑以缓解跨视角不一致性,结合数据自适应池化(DA Pooling)增强语义区域表征,在多个CVGL基准上达到SOTA性能。

Details Motivation: 解决跨视角地理定位(CVGL)中因视角差异大及混杂因素干扰导致的匹配困难问题。 Method: 提出CLGT框架,包含因果特征提取器(CFE)、几何拓扑融合(GT Fusion)模块和数据自适应池化(DA Pooling)模块。 Result: 在CVUSA、CVACT及其鲁棒性增强变体上实现SOTA性能,尤其在真实世界扰动下表现优异。 Conclusion: CLGT有效提升了跨视角地理定位的鲁棒性与准确性,验证了因果学习与几何拓扑建模在该任务中的有效性。 Abstract: Cross-view geo-localization (CVGL) aims to estimate the geographic location of a street image by matching it with a corresponding aerial image. This is critical for autonomous navigation and mapping in complex real-world scenarios. However, the task remains challenging due to significant viewpoint differences and the influence of confounding factors. To tackle these issues, we propose the Causal Learning and Geometric Topology (CLGT) framework, which integrates two key components: a Causal Feature Extractor (CFE) that mitigates the influence of confounding factors by leveraging causal intervention to encourage the model to focus on stable, task-relevant semantics; and a Geometric Topology Fusion (GT Fusion) module that injects Bird's Eye View (BEV) road topology into street features to alleviate cross-view inconsistencies caused by extreme perspective changes. Additionally, we introduce a Data-Adaptive Pooling (DA Pooling) module to enhance the representation of semantically rich regions. Extensive experiments on CVUSA, CVACT, and their robustness-enhanced variants (CVUSA-C-ALL and CVACT-C-ALL) demonstrate that CLGT achieves state-of-the-art performance, particularly under challenging real-world corruptions. Our codes are available at https://github.com/oyss-szu/CLGT.

[63] AccelAes: Accelerating Diffusion Transformers for Training-Free Aesthetic-Enhanced Image Generation

Xuanhua Yin,Chuanzhi Xu,Haoxian Zhou,Boyu Wei,Weidong Cai

Main category: cs.CV

TL;DR: 本文提出AccelAes,一种无需训练的加速框架,通过美学感知的时空稀疏化提升Diffusion Transformers(DiTs)在文本到图像生成中的推理效率与视觉质量。

Details Motivation: DiTs因自注意力计算复杂度高导致推理延迟大、部署受限;同时观察到去噪过程在美学相关区域更集中、变化剧烈,而其他区域冗余计算多。 Method: 提出AesMask(基于提示语义和交叉注意力信号生成的一次性美学聚焦掩码)和SkipSparse(局部计算重分配机制),并引入轻量级步级预测缓存以减少时间冗余。 Result: 在Lumina-Next等主流DiT模型上实现2.11×加速,并提升ImageReward指标11.9%;在多个DiT家族中均验证了加速效果与美学质量提升的一致性。 Conclusion: AccelAes是一种高效、即插即用的训练-free加速方案,兼顾推理速度与生成图像的感知美学质量。 Abstract: Diffusion Transformers (DiTs) are a dominant backbone for high-fidelity text-to-image generation due to strong scalability and alignment at high resolutions. However, quadratic self-attention over dense spatial tokens leads to high inference latency and limits deployment. We observe that denoising is spatially non-uniform with respect to aesthetic descriptors in the prompt. Regions associated with aesthetic tokens receive concentrated cross-attention and show larger temporal variation, while low-affinity regions evolve smoothly with redundant computation. Based on this insight, we propose AccelAes, a training-free framework that accelerates DiTs through aesthetics-aware spatio-temporal reduction while improving perceptual aesthetics. AccelAes builds AesMask, a one-shot aesthetic focus mask derived from prompt semantics and cross-attention signals. When localized computation is feasible, SkipSparse reallocates computation and guidance to masked regions. We further reduce temporal redundancy using a lightweight step-level prediction cache that periodically replaces full Transformer evaluations. Experiments on representative DiT families show consistent acceleration and improved aesthetics-oriented quality. On Lumina-Next, AccelAes achieves a 2.11$\times$ speedup and improves ImageReward by +11.9% over the dense baseline. Code is available at https://github.com/xuanhuayin/AccelAes.

[64] DINOLight: Robust Ambient Light Normalization with Self-supervised Visual Prior Integration

Youngjin Oh,Junhyeong Kwon,Nam Ik Cho

Main category: cs.CV

TL;DR: 本文提出DINOLight框架,利用DINOv2自监督模型提取的语义与几何特征作为视觉先验,实现非均匀环境光照下的图像归一化。

Details Motivation: 环境光归一化旨在恢复由多光源和复杂场景几何引起的非均匀阴影与光照退化图像,而现有方法缺乏有效利用高层语义与几何信息的能力。 Method: 提出基于DINOv2特征的自适应特征融合模块(点态softmax掩码)及空间-频率双域辅助交叉注意力机制,将DINOv2特征嵌入恢复网络。 Result: 在Ambient6K数据集上性能领先,并在阴影去除基准数据集上达到与使用掩码先验方法相当的结果。 Conclusion: DINOv2提取的特征可作为强视觉先验有效提升环境光归一化效果,验证了自监督表征在低级视觉任务中的迁移价值。 Abstract: This paper presents a new ambient light normalization framework, DINOLight, that integrates the self-supervised model DINOv2's image understanding capability into the restoration process as a visual prior. Ambient light normalization aims to restore images degraded by non-uniform shadows and lighting caused by multiple light sources and complex scene geometries. We observe that DINOv2 can reliably extract both semantic and geometric information from a degraded image. Based on this observation, we develop a novel framework to utilize DINOv2 features for lighting normalization. First, we propose an adaptive feature fusion module that combines features from different DINOv2 layers using a point-wise softmax mask. Next, the fused features are integrated into our proposed restoration network in both spatial and frequency domains through an auxiliary cross-attention mechanism. Experiments show that DINOLight achieves superior performance on the Ambient6K dataset, and that DINOv2 features are effective for enhancing ambient light normalization. We also apply our method to shadow-removal benchmark datasets, achieving competitive results compared to methods that use mask priors. Codes will be released upon acceptance.

[65] MRGeo: Robust Cross-View Geo-Localization of Corrupted Images via Spatial and Channel Feature Enhancement

Le Wu,Lv Bo,Songsong Ouyang,Yingying Zhu

Main category: cs.CV

TL;DR: 本文提出MRGeo方法,通过分层防御策略提升跨视角地理定位在图像退化(如模糊、天气影响)下的鲁棒性,包含空间-通道增强模块和区域级几何对齐模块,在多个鲁棒性基准上显著提升性能。

Details Motivation: 现有跨视角地理定位方法在标准数据集上表现优异,但在真实世界图像退化(如模糊、恶劣天气)环境下鲁棒性不足,导致实际部署受限。 Method: 提出MRGeo方法,采用分层防御策略:1)空间-通道增强块,含空间自适应表征模块(并行建模全局/局部特征+动态门控融合)和通道校准模块(多粒度通道依赖建模);2)区域级几何对齐模块,施加几何结构约束以保证描述符的粗粒度一致性。 Result: 在CVUSA-C-ALL、CVACT_val-C-ALL和CVACT_test-C-ALL三个鲁棒性基准上平均R@1提升2.92%,并在跨区域评估中表现优越,验证了其鲁棒性与泛化能力。 Conclusion: MRGeo是首个系统性解决跨视角地理定位在图像退化下鲁棒性问题的方法,通过增强特征质量与引入几何先验,显著提升了模型在真实复杂环境中的实用性。 Abstract: Cross-view geo-localization (CVGL) aims to accurately localize street-view images through retrieval of corresponding geo-tagged satellite images. While prior works have achieved nearly perfect performance on certain standard datasets, their robustness in real-world corrupted environments remains under-explored. This oversight causes severe performance degradation or failure when images are affected by corruption such as blur or weather, significantly limiting practical deployment. To address this critical gap, we introduce MRGeo, the first systematic method designed for robust CVGL under corruption. MRGeo employs a hierarchical defense strategy that enhances the intrinsic quality of features and then enforces a robust geometric prior. Its core is the Spatial-Channel Enhancement Block, which contains: (1) a Spatial Adaptive Representation Module that models global and local features in parallel and uses a dynamic gating mechanism to arbitrate their fusion based on feature reliability; and (2) a Channel Calibration Module that performs compensatory adjustments by modeling multi-granularity channel dependencies to counteract information loss. To prevent spatial misalignment under severe corruption, a Region-level Geometric Alignment Module imposes a geometric structure on the final descriptors, ensuring coarse-grained consistency. Comprehensive experiments on both robustness benchmark and standard datasets demonstrate that MRGeo not only achieves an average R@1 improvement of 2.92\% across three comprehensive robustness benchmarks (CVUSA-C-ALL, CVACT\_val-C-ALL, and CVACT\_test-C-ALL) but also establishes superior performance in cross-area evaluation, thereby demonstrating its robustness and generalization capability.

[66] SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification

Furui Chen,Han Wang,Yuhan Sun,Jianing You,Yixuan Lv,Zhuang Zhou,Hong Tan,Shengyang Li

Main category: cs.CV

TL;DR: 本文提出SDF-Net,一种结构感知的解耦特征学习网络,利用船舶几何结构跨模态稳定性,在光学与SAR图像间实现鲁棒的跨模态船舶重识别。

Details Motivation: 解决光学与SAR图像间因成像机理差异导致的严重辐射度差异问题,同时引入船舶作为刚体的几何结构稳定性这一物理先验。 Method: 基于ViT主干网络,设计结构一致性约束(提取中间层尺度不变梯度能量统计);在末端解耦身份特征(模态不变)与模态特有特征;采用无参加性残差融合策略整合二者。 Result: 在HOSS-ReID数据集上显著超越现有最先进方法。 Conclusion: 几何结构一致性是提升跨模态船舶ReID鲁棒性的有效物理先验,结构感知与特征解耦协同可有效缓解模态差异影响。 Abstract: Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery is fundamentally challenged by the severe radiometric discrepancy between passive optical imaging and coherent active radar sensing. While existing approaches primarily rely on statistical distribution alignment or semantic matching, they often overlook a critical physical prior: ships are rigid objects whose geometric structures remain stable across sensing modalities, whereas texture appearance is highly modality-dependent. In this work, we propose SDF-Net, a Structure-Aware Disentangled Feature Learning Network that systematically incorporates geometric consistency into optical--SAR ship ReID. Built upon a ViT backbone, SDF-Net introduces a structure consistency constraint that extracts scale-invariant gradient energy statistics from intermediate layers to robustly anchor representations against radiometric variations. At the terminal stage, SDF-Net disentangles the learned representations into modality-invariant identity features and modality-specific characteristics. These decoupled cues are then integrated through a parameter-free additive residual fusion, effectively enhancing discriminative power. Extensive experiments on the HOSS-ReID dataset demonstrate that SDF-Net consistently outperforms existing state-of-the-art methods. The code and trained models are publicly available at https://github.com/cfrfree/SDF-Net.

[67] Neural Gate: Mitigating Privacy Risks in LVLMs via Neuron-Level Gradient Gating

Xiangkui Cao,Jie Zhang,Meina Kan,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出Neural Gate方法,通过神经元级模型编辑提升大视觉语言模型(LVLMs)对隐私相关问题的拒绝率,并泛化至未见过的敏感查询,同时保持模型原有性能。

Details Motivation: 现有LVLMs在面对隐私泄露风险时拒绝能力不足,且已有隐私保护方法存在泛化性差和破坏模型性能的问题。 Method: Neural Gate通过学习特征向量定位与隐私概念相关的神经元,并据此精准更新模型参数,实现神经元级别的模型编辑。 Result: 在MiniGPT和LLaVA上的实验表明,该方法显著提升隐私保护能力,同时不损害模型在标准任务上的性能。 Conclusion: Neural Gate是一种有效、非破坏性且具备良好泛化能力的LVLM隐私风险缓解方法。 Abstract: Large Vision-Language Models (LVLMs) have shown remarkable potential across a wide array of vision-language tasks, leading to their adoption in critical domains such as finance and healthcare. However, their growing deployment also introduces significant security and privacy risks. Malicious actors could potentially exploit these models to extract sensitive information, highlighting a critical vulnerability. Recent studies show that LVLMs often fail to consistently refuse instructions designed to compromise user privacy. While existing work on privacy protection has made meaningful progress in preventing the leakage of sensitive data, they are constrained by limitations in both generalization and non-destructiveness. They often struggle to robustly handle unseen privacy-related queries and may inadvertently degrade a model's performance on standard tasks. To address these challenges, we introduce Neural Gate, a novel method for mitigating privacy risks through neuron-level model editing. Our method improves a model's privacy safeguards by increasing its rate of refusal for privacy-related questions, crucially extending this protective behavior to novel sensitive queries not encountered during the editing process. Neural Gate operates by learning a feature vector to identify neurons associated with privacy-related concepts within the model's representation of a subject. This localization then precisely guides the update of model parameters. Through comprehensive experiments on MiniGPT and LLaVA, we demonstrate that our method significantly boosts the model's privacy protection while preserving its original utility.

[68] A Prediction-as-Perception Framework for 3D Object Detection

Song Zhang,Haoyu Chen,Ruibo Wang

Main category: cs.CV

TL;DR: 本文提出了一种受人类预测-感知机制启发的Prediction-As-Perception(PAP)框架,用于提升3D目标感知任务的准确性和效率。该框架通过预测模块与感知模块的迭代协同,利用连续帧信息,在nuScenes数据集上使UniAD模型的目标跟踪精度提升10%,推理速度提升15%。

Details Motivation: 受人类通过预测运动目标位置来辅助视觉感知的生物机制启发,旨在提升自动驾驶中动态场景下3D感知模型的实时性与准确性。 Method: 提出PAP框架,包含预测模块(基于当前帧感知结果预测未来帧中自车及交通参与者的位置)和感知模块(以预测位置为查询引导下一帧感知),二者迭代反馈;在端到端模型UniAD上实现并验证。 Result: 在nuScenes数据集上,PAP使UniAD的目标跟踪精度提升10%,推理速度提升15%。 Conclusion: PAP这种仿生预测-感知协同架构能显著提升感知模型的精度、效率,并降低计算资源消耗。 Abstract: Humans combine prediction and perception to observe the world. When faced with rapidly moving birds or insects, we can only perceive them clearly by predicting their next position and focusing our gaze there. Inspired by this, this paper proposes the Prediction-As-Perception (PAP) framework, integrating a prediction-perception architecture into 3D object perception tasks to enhance the model's perceptual accuracy. The PAP framework consists of two main modules: prediction and perception, primarily utilizing continuous frame information as input. Firstly, the prediction module forecasts the potential future positions of ego vehicles and surrounding traffic participants based on the perception results of the current frame. These predicted positions are then passed as queries to the perception module of the subsequent frame. The perceived results are iteratively fed back into the prediction module. We evaluated the PAP structure using the end-to-end model UniAD on the nuScenes dataset. The results demonstrate that the PAP structure improves UniAD's target tracking accuracy by 10% and increases the inference speed by 15%. This indicates that such a biomimetic design significantly enhances the efficiency and accuracy of perception models while reducing computational resource consumption.

[69] A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering

Pritham Kumar Jena,Bhavika Baburaj,Tushar Anand,Vedant Dutta,Vineeth Ulavala,Sk Aziz Ali

Main category: cs.CV

TL;DR: 本文提出了A2Z数据集,包含1000万个多模态标注和100万个CAD模型,用于提升边界表示(BRep)的多模态理解,并基于该数据集训练了可检测BRep共边与角点的基础模型。

Details Motivation: 现有几何深度学习方法缺乏对参数化CAD模型边界表示(BRep)的多模态理解能力,制约了从3D扫描、草图或文本生成CAD模型的逆向工程与快速原型能力。 Method: 构建大规模多模态CAD数据集A2Z(含高分辨率网格、手绘3D草图、BRep几何/拓扑信息、文本描述),引入新型评估指标并融合专业电子外壳CAD模型;在15万子集上训练基础模型以完成BRep共边与角点检测任务。 Result: 发布了迄今最大规模的多模态CAD数据集A2Z(1000万标注/100万模型),验证了其在BRep学习与检索任务中的有效性,并成功训练出可从3D扫描中检测BRep共边与角点的基础模型。 Conclusion: A2Z数据集显著推动了CAD多模态理解与BRep深度学习的发展,为CAD逆向工程、智能设计与跨模态生成提供了高质量基础设施与基准。 Abstract: Reverse engineering and rapid prototyping of computer-aided design (CAD) models from 3D scans, sketches, or simple text prompts are vital in industrial product design. However, recent advances in geometric deep learning techniques lack a multi-modal understanding of parametric CAD features stored in their boundary representation (BRep). This study presents the largest compilation of 10 million multi-modal annotations and metadata for 1 million ABC CAD models, namely A2Z, to unlock an unprecedented level of BRep learning. A2Z comprises (i) high-resolution meshes with salient 3D scanning features, (ii) 3D hand-drawn sketches equipped with (iii) geometric and topological information about BRep co-edges, corners, and surfaces, and (iv) textual captions and tags describing the product in the mechanical world. Creating such carefully structured, large-scale data, which requires nearly 5 terabytes of storage to leverage unparalleled CAD learning/retrieval tasks, is very challenging. The scale, quality, and diversity of our multi-modal annotations are assessed using novel metrics, GPT-5, Gemini, and extensive human feedback mechanisms. To this end, we also merge an additional 25,000 CAD models of electronic enclosures (e.g., tablets, ports) designed by skilled professionals with our A2Z dataset. Subsequently, we train and benchmark a foundation model on a subset of 150K CAD models to detect BRep co-edges and corner vertices from 3D scans, a key downstream task in CAD reverse engineering. The annotated dataset, metrics, and checkpoints will be publicly released to support numerous research directions.

[70] Mastering Negation: Boosting Grounding Models via Grouped Opposition-Based Learning

Zesheng Yang,Xi Jiang,Bingzhang Hu,Weili Guan,Runmin Cong,Guo-Jun Qi,Feng Zheng

Main category: cs.CV

TL;DR: 本文提出D-Negation数据集和基于分组对立的学习框架,以提升视觉-语言定位模型对否定语义的理解能力,在正/负语义评估中分别提升4.4和5.7 mAP。

Details Motivation: 现有视觉-语言检测与定位模型难以准确理解与定位含否定语义的复杂表达,主因是缺乏高质量、显式包含否定样本和否定语言描述的训练数据。 Method: 构建D-Negation数据集(含正/负语义标注);提出分组对立学习框架,将对立语义描述组织成结构化组,并设计两个互补损失函数,促使模型进行否定推理与语义限定建模。 Result: 在少于10%参数微调下,正/负语义评估mAP分别提升4.4和5.7;验证了显式建模否定语义可显著增强模型鲁棒性与定位精度。 Conclusion: 显式建模否定语义对提升视觉-语言定位模型性能至关重要,所提数据集与学习框架为该方向提供了有效解决方案。 Abstract: Current vision-language detection and grounding models predominantly focus on prompts with positive semantics and often struggle to accurately interpret and ground complex expressions containing negative semantics. A key reason for this limitation is the lack of high-quality training data that explicitly captures discriminative negative samples and negation-aware language descriptions. To address this challenge, we introduce D-Negation, a new dataset that provides objects annotated with both positive and negative semantic descriptions. Building upon the observation that negation reasoning frequently appears in natural language, we further propose a grouped opposition-based learning framework that learns negation-aware representations from limited samples. Specifically, our method organizes opposing semantic descriptions from D-Negation into structured groups and formulates two complementary loss functions that encourage the model to reason about negation and semantic qualifiers. We integrate the proposed dataset and learning strategy into a state-of-the-art language-based grounding model. By fine-tuning fewer than 10 percent of the model parameters, our approach achieves improvements of up to 4.4 mAP and 5.7 mAP on positive and negative semantic evaluations, respectively. These results demonstrate that explicitly modeling negation semantics can substantially enhance the robustness and localization accuracy of vision-language grounding models.

[71] Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains

Guodong Sun,Qihang Liang,Xingyu Pan,Moyun Liu,Yang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级自提示实例分割框架,用于货运列车故障检测,通过改进Segment Anything Model并结合Tiny Vision Transformer,在真实数据集上实现了高精度、高鲁棒性与低计算开销的实时检测。

Details Motivation: 货运列车视觉故障检测面临复杂运行环境、结构重复部件、关键区域频繁遮挡和污染等挑战,传统基于CNN和Transformer的方法泛化能力差、边界精度低。 Method: 提出自提示实例分割框架:引入自提示生成模块,自动产生任务特定提示,实现基础模型到领域任务的知识迁移;采用Tiny Vision Transformer作为骨干网络以降低计算成本。 Result: 在自建真实货运检测数据集上达到74.6 AP^box 和 74.2 AP^mask,性能优于现有SOTA方法,且计算开销低,适合边缘设备实时部署。 Conclusion: 该方法为自动化货运列车巡检提供了可部署、高效的视觉解决方案,验证了基础模型适配于大规模工业故障诊断的可行性。 Abstract: Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety-critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self-prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self-prompt generation module that automatically produces task-specific prompts, enabling effective knowledge transfer from foundation models to domain-specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real-time deployment on edge devices in railway monitoring systems. We construct a domain-specific dataset collected from real-world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 $AP^{\text{box}}$ and 74.2 $AP^{\text{mask}}$ on the dataset, outperforming existing state-of-the-art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial-scale fault diagnosis scenarios. Project page: https://github.com/MVME-HBUT/SAM_FTI-FDet.git

[72] RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization

Ruicheng Zhang,Guangyu Chen,Zunnan Xu,Zihao Liu,Zhizhou Zhong,Mingyang Zhang,Jun Zhou,Xiu Li

Main category: cs.CV

TL;DR: 本文提出RoboStereo——一种对称双塔式4D具身世界模型,通过双向跨模态增强提升几何与物理一致性,并构建首个统一的世界模型驱动策略优化框架,显著提升精细操作任务性能。

Details Motivation: 现实世界交互成本高、风险大,现有具身世界模型存在几何幻觉、缺乏统一策略优化框架等问题。 Method: 提出RoboStereo双塔4D世界模型,结合测试时策略增强(TTPA)、模仿-进化策略学习(IEPL)和开放探索策略学习(OEPL)构成统一优化框架。 Result: RoboStereo在生成质量上达到SOTA,统一框架在细粒度操作任务中平均相对提升超97%。 Conclusion: RoboStereo及其统一策略优化框架有效缓解幻觉问题,推动可扩展具身AI向高保真仿真与自主策略学习迈进。 Abstract: Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering >97% average relative improvement on fine-grained manipulation tasks.

[73] LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

Ziyu Chen,Fan Zhu,Hui Zhu,Deyi Kong,Xinkai Kuang,Yujia Zhang,Chunmao Jiang

Main category: cs.CV

TL;DR: 本文提出了一种LiDAR-反射率引导的显著高斯泼溅方法(LR-SGS),利用LiDAR点云的几何与反射率特征构建结构感知的显著高斯表示,并联合对齐RGB与反射率通道以提升自动驾驶场景下的重建质量与鲁棒性。

Details Motivation: 现有3D高斯泼溅方法未充分利用LiDAR点云中的反射率信息及LiDAR与RGB的互补性,导致在高自运动、复杂光照等挑战性自动驾驶场景中性能下降。 Method: 提出LR-SGS方法:1)从LiDAR提取几何与反射率特征点初始化显著高斯;2)通过显著性变换和改进的密度控制优化高斯分布以捕捉边缘和平面结构;3)将LiDAR强度标定为反射率并作为光照不变材质通道,与RGB联合对齐以增强边界一致性。 Result: 在Waymo Open Dataset上实验表明,LR-SGS以更少高斯数和更短训练时间实现更优重建性能;在复杂光照场景下PSNR比OmniRe高1.18 dB。 Conclusion: LR-SGS有效融合LiDAR反射率与RGB信息,提升了自动驾驶场景下3D重建的精度、鲁棒性与效率,验证了多模态特征协同建模的重要性。 Abstract: Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

[74] From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

Jiazi Bu,Pengyang Ling,Yujie Zhou,Yibin Wang,Yuhang Zang,Tianyi Wei,Xiaohang Zhan,Jiaqi Wang,Tong Wu,Xingang Pan,Dahua Lin

Main category: cs.CV

TL;DR: 本文提出Multi-View GRPO(MV-GRPO),通过扩展条件空间、生成语义相邻但多样的新caption,实现对T2I模型生成样本的多视角奖励建模与优势重估计,从而提升偏好对齐效果,无需额外采样。

Details Motivation: 标准GRPO在单条件下单组样本评估方式缺乏对样本间关系的充分探索,限制了对齐效果和性能上限。 Method: 提出MV-GRPO:引入灵活的Condition Enhancer,为同一prompt生成语义相近但多样的新caption;基于这些caption进行多视角优势重估计,并利用原始样本在新caption下的条件概率分布参与训练,避免重新采样。 Result: 在多个实验中,MV-GRPO显著优于当前最优方法,提升了文本到图像生成的偏好对齐性能。 Conclusion: 多视角条件增强能有效挖掘样本间关系,提供更丰富的优化信号,是提升T2I模型对齐能力的有效新范式。 Abstract: Group Relative Policy Optimization (GRPO) has emerged as a powerful framework for preference alignment in text-to-image (T2I) flow models. However, we observe that the standard paradigm where evaluating a group of generated samples against a single condition suffers from insufficient exploration of inter-sample relationships, constraining both alignment efficacy and performance ceilings. To address this sparse single-view evaluation scheme, we propose Multi-View GRPO (MV-GRPO), a novel approach that enhances relationship exploration by augmenting the condition space to create a dense multi-view reward mapping. Specifically, for a group of samples generated from one prompt, MV-GRPO leverages a flexible Condition Enhancer to generate semantically adjacent yet diverse captions. These captions enable multi-view advantage re-estimation, capturing diverse semantic attributes and providing richer optimization signals. By deriving the probability distribution of the original samples conditioned on these new captions, we can incorporate them into the training process without costly sample regeneration. Extensive experiments demonstrate that MV-GRPO achieves superior alignment performance over state-of-the-art methods.

[75] VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

Xiangyu Sun,Shijie Wang,Fengyi Zhang,Lin Liu,Caiyan Jia,Ziying Song,Zi Huang,Yadan Luo

Main category: cs.CV

TL;DR: 本文提出VGGT-World,一种不生成视频帧、而直接预测冻结几何基础模型(GFM)特征演化的世界模型,在深度预测任务上显著优于基线方法,同时更高效轻量。

Details Motivation: 现有基于视频生成的世界模型虽耗费大量算力于像素细节,但预测结果常存在几何不一致问题;需一种更几何一致、高效的世界建模方式。 Method: 将冻结VGGT模型的潜在token作为世界状态,训练轻量级时序流Transformer自回归预测其演化;提出z-prediction参数化解决高维空间中流匹配崩溃问题,并设计两阶段潜空间流强制课程学习缓解曝光偏差。 Result: 在KITTI、Cityscapes和TartanAir数据集上,VGGT-World在深度预测任务上显著超越最强基线,推理速度快3.6–5倍,仅需0.43B可训练参数。 Conclusion: 冻结GFM特征是一种有效且高效的世界模型状态表征,VGGT-World为3D世界建模提供了新范式。 Abstract: World models that forecast scene evolution by generating future video frames devote the bulk of their capacity to photometric details, yet the resulting predictions often remain geometrically inconsistent. We present VGGT-World, a geometry world model that side-steps video generation entirely and instead forecasts the temporal evolution of frozen geometry-foundation-model (GFM) features. Concretely, we repurpose the latent tokens of a frozen VGGT as the world state and train a lightweight temporal flow transformer to autoregressively predict their future trajectory. Two technical challenges arise in this high-dimensional (d=1024) feature space: (i) standard velocity-prediction flow matching collapses, and (ii) autoregressive rollout suffers from compounding exposure bias. We address the first with a clean-target (z-prediction) parameterization that yields a substantially higher signal-to-noise ratio, and the second with a two-stage latent flow-forcing curriculum that progressively conditions the model on its own partially denoised rollouts. Experiments on KITTI, Cityscapes, and TartanAir demonstrate that VGGT-World significantly outperforms the strongest baselines in depth forecasting while running 3.6-5 times faster with only 0.43B trainable parameters, establishing frozen GFM features as an effective and efficient predictive state for 3D world modeling.

[76] VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

Yuhang Ming,Tingkang Xi,Xingrui Yang,Lixin Yang,Yong Peng,Cewu Lu,Wanzeng Kong

Main category: cs.CV

TL;DR: 本文提出VFMRecon,首次将视觉基础模型(VFM)的可迁移先验知识与场景级神经体素重建中的尺度一致性要求相结合,通过轻量级尺度对齐和任务适配器,在多种数据集上实现SOTA性能。

Details Motivation: 单目视频场景级神经体素重建在严重域偏移下仍具挑战性;现有视觉基础模型虽提供泛化先验,但其尺度模糊预测与体素融合所需的尺度一致性不兼容。 Method: 提出VFMRecon:1)引入轻量级多视角尺度对齐模块以恢复尺度一致性;2)通过轻量任务特定适配器将预训练VFM特征融入神经体素重建流程,在保持跨域鲁棒性的同时进行重建训练。 Result: 在ScanNet(in-distribution)、TUM RGB-D和Tanks and Temples(out-of-distribution)上均达SOTA;尤其在户外Tanks and Temples数据集上,网格重建F1达70.1,显著超越次优方法VGGT(51.8)。 Conclusion: VFMRecon成功弥合了VFM先验与体素重建尺度一致性之间的鸿沟,验证了结合通用视觉先验与几何重建任务设计的有效性与泛化能力。 Abstract: Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.

[77] AVION: Aerial Vision-Language Instruction from Offline Teacher to Prompt-Tuned Network

Yu Hu,Jianyang Gu,Hao Liu,Yue Cao,Jozsef Hamari,Zheng Liu,Mohsen Zardadi

Main category: cs.CV

TL;DR: 本文提出AVION框架,通过知识蒸馏解决视觉-语言模型在遥感图像适配中的语义覆盖不足和视觉特征适应性差问题,在多个遥感基准上提升了少样本分类、基类准确率和跨模态检索性能。

Details Motivation: 遥感图像(尤其是航拍场景)具有多样化的视觉外观和细粒度目标差异,而现有视觉-语言模型受限于文本表征语义覆盖有限和视觉特征适应性不足,难以有效适配。 Method: 提出AVION知识蒸馏框架:教师模块利用大语言模型生成并用遥感图像特征验证文本原型;学生模块在视觉与语言编码器中嵌入轻量可学习提示,并在教师指导下对齐嵌入及跨模态关系。 Result: 在六个光学遥感基准上,AVION提升了少样本分类和基类准确率,不损害对新类别的泛化能力;同时提高跨模态检索的平均召回率,仅引入极少可训练参数。 Conclusion: AVION是一种高效、轻量且泛化性强的遥感视觉-语言模型适配方法,验证了知识蒸馏与可学习提示结合的有效性。 Abstract: Adapting vision-language models to remote sensing imagery remains challenging due to two key factors: limited semantic coverage in textual representations and insufficient adaptability of visual features. These issues are particularly significant in aerial scenes, which involve various visual appearances and fine-grained object distinctions. We propose AVION, a knowledge distillation framework tailored for remote sensing adaptation of vision-language models. The teacher module constructs semantically rich textual prototypes by collecting descriptions from a large language model and verifying validity using remote sensing image features. The student module integrates lightweight and learnable prompts into both vision and language encoders, guided by the teacher to align embeddings and their cross-modal relationships. Once trained, the student operates independently during inference. Experiments on six optical remote sensing benchmarks show that AVION improves few-shot classification and base-class accuracy without degrading generalization to novel categories. It also enhances mean recall for cross-modal retrieval, with minimal additional trainable parameters.

[78] Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization

Kazuto Nakashima,Hojung Jung,Yuki Oto,Yumi Iwashita,Ryo Kurazume,Oscar Martinez Mozos

Main category: cs.CV

TL;DR: 本文提出了一种基于卷积神经网络(CNN)的室外场景语义分类方法,利用3D LiDAR获取的全向深度/反射率图像作为输入,并构建了大规模多模态全景3D室外数据集MPO进行验证。

Details Motivation: 室外场景因光照变化大、遮挡多等感知差异,比室内场景更难分类,而语义场景分类对自主机器人和车辆的导航与决策至关重要。 Method: 提出基于CNN的方法,输入为LiDAR获取的全向深度/反射率图像;构建包含六类标签的大规模多模态数据集MPO;并可视化分析所学特征。 Result: 在MPO数据集上的实验结果优于传统方法,证明融合深度与反射率双模态的有效性。 Conclusion: 该方法有效提升了室外场景语义分类性能,验证了多模态LiDAR数据与CNN结合的可行性与优势。 Abstract: Semantic place categorization, which is one of the essential tasks for autonomous robots and vehicles, allows them to have capabilities of self-decision and navigation in unfamiliar environments. In particular, outdoor places are more difficult targets than indoor ones due to perceptual variations, such as dynamic illuminance over twenty-four hours and occlusions by cars and pedestrians. This paper presents a novel method of categorizing outdoor places using convolutional neural networks (CNNs), which take omnidirectional depth/reflectance images obtained by 3D LiDARs as the inputs. First, we construct a large-scale outdoor place dataset named Multi-modal Panoramic 3D Outdoor (MPO) comprising two types of point clouds captured by two different LiDARs. They are labeled with six outdoor place categories: coast, forest, indoor/outdoor parking, residential area, and urban area. Second, we provide CNNs for LiDAR-based outdoor place categorization and evaluate our approach with the MPO dataset. Our results on the MPO dataset outperform traditional approaches and show the effectiveness in which we use both depth and reflectance modalities. To analyze our trained deep networks we visualize the learned features.

[79] Marker-Based 3D Reconstruction of Aggregates with a Comparative Analysis of 2D and 3D Morphologies

Haohang Huang,Jiayi Luo,Issam Qamhia,Erol Tutumluer,John M. Hart,Andrew J. Stolba

Main category: cs.CV

TL;DR: 本文提出了一种基于摄影测量法的低成本、灵活的骨料颗粒三维重建方法,通过标记点设计实现背景抑制、点云拼接和尺度标定,并验证了其精度,发现2D与3D形态统计存在显著差异。

Details Motivation: 传统骨料形态表征在采石场和施工现场难以实现全3D量化;现有2D图像法或高成本3D扫描法(如激光扫描、CT)均存在局限,亟需一种低成本、易部署的3D重建方案。 Method: 采用基于标记点的摄影测量方法,通过多视角图像采集、背景抑制、点云自动拼接和尺度标定,实现骨料颗粒高精度三维重建,并以真实值(ground-truth)验证精度,对比分析2D与3D形态参数。 Result: 重建结果精度高,验证有效;2D与3D形态统计(如圆度、球度、棱角度等)存在显著差异;该方法可便捷获取骨料3D形状信息,支持现场快速检测与形态分析。 Conclusion: 所提摄影测量法是一种经济、灵活、实用的骨料三维形态表征新途径,有望提升工程中骨料质量控制与数字化建模能力。 Abstract: Aggregates, serving as the main skeleton in assemblies of construction materials, are important functional components in various building and transportation infrastructures. They can be used in unbound layer applications, e.g. pavement base and railroad ballast, bound applications of cement concrete and asphalt concrete, and as riprap and large-sized primary crushed rocks. Information on the size and shape or morphology of aggregates can greatly facilitate the Quality Assurance/Quality Control (QA/QC) process by providing insights of aggregate behavior during composition and packing. A full 3D characterization of aggregate particle morphology is difficult both during production in a quarry and at a construction site. Many aggregate imaging approaches have been developed to quantify the particle morphology by computer vision, including 2D image-based approaches that analyze particle silhouettes and 3D scanning-based methods that require expensive devices such as 3D laser scanners or X-Ray Computed Tomography (CT) equipment. This paper presents a flexible and cost-effective photogrammetry-based approach for the 3D reconstruction of aggregate particles. The proposed approach follows a marker-based design that enables background suppression, point cloud stitching, and scale referencing to obtain high-quality aggregate models. The accuracy of the reconstruction results was validated against ground-truth for selected aggregate samples. Comparative analyses were conducted on 2D and 3D morphological properties of the selected samples. Significant differences were found between the 2D and 3D statistics. Based on the presented approach, 3D shape information of aggregates can be obtained easily and at a low cost, thus allowing convenient aggregate inspection, data collection, and 3D morphological analysis.

[80] Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

Selim Furkan Tekin,Yichang Xu,Gaowen Liu,Ramana Rao Kompella,Margaret L. Loper,Ling Liu

Main category: cs.CV

TL;DR: 本文提出V3Fusion方法,利用视觉与语言双模态信息进行VLM模型选择与融合,引入焦点误差多样性与CKA-focal度量,并结合遗传算法优化模型组合,在多个VLM基准上显著提升性能,尤其在无多数共识或多数模型出错时仍保持鲁棒性。

Details Motivation: 现有工作多基于语言模态进行VLM集成,而本文关注如何利用视觉和语言双模态信息实现更有效的模型选择与融合,以应对模型间互补性不足、幻觉及不确定性建模等问题。 Method: 提出焦点误差多样性(focal error diversity)刻画VLM间互补推理能力;设计基于中心化核对齐(CKA)的CKA-focal度量来量化视觉嵌入分歧;在候选VLM池构建集成曲面,并用遗传算法剪枝无效组件;动态融合异构VLM输出。 Result: 在A-OKVQA、MMMU、MMMU-Pro、OCR-VQA四个基准上显著超越单个最优VLM:MMMU提升8.09%,MMMU-Pro提升4.87%;在生成任务中优于Intern-VL2-8b和Qwen2.5-VL-7b。 Conclusion: V3Fusion通过双模态焦点多样性建模与进化式模型选择,实现了高鲁棒、低幻觉的VLM融合,在无多数共识场景下仍具强泛化能力,为多模型协同推理提供了新范式。 Abstract: With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.

[81] Bin~Wan,G2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images

Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Chengtao Lv,Sam Kwong

Main category: cs.CV

TL;DR: 本文提出G2HFNet网络,利用Swin Transformer作为骨干,并设计多尺度细节增强、双分支地理-粒度互补、深度语义感知和局部-全局引导融合模块,以提升遥感图像显著目标检测性能。

Details Motivation: 遥感图像存在显著的尺度变化和复杂背景,现有方法在单尺度下使用统一注意力机制提取多级特征,导致表征不佳和检测不完整。 Method: 提出GeoGran-Aware Hierarchical Feature Fusion Network(G2HFNet),采用Swin Transformer为骨干,集成多尺度细节增强(MDE)、双分支地理-粒度互补(DGC)、深度语义感知(DSP)和局部-全局引导融合(LGF)模块。 Result: 实验表明G2HFNet能生成高质量显著图,在具挑战性的遥感场景中显著提升检测性能。 Conclusion: G2HFNet通过充分挖掘遥感图像中的几何与粒度线索,有效缓解尺度变化与复杂背景带来的检测难题,提升了显著目标检测的精度与鲁棒性。 Abstract: Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.

[82] RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection

Bin Wan,Runmin Cong,Xiaofei Zhou,Hao Fang,Chengtao Lv,Sam Kwong

Main category: cs.CV

TL;DR: 本文提出了一种区域引导的选择性优化网络(RSONet)用于RGB-T显著目标检测,通过区域引导和显著性生成两个阶段,结合多种模块(如CI、SF、SO、DDE、MIS)缓解RGB与热成像间显著区域不一致问题,并在RGB-T数据集上取得优于27种SOTA方法的性能。

Details Motivation: 解决RGB与热成像图像间显著区域分布不一致的问题。 Method: 提出区域引导的选择性优化网络(RSONet),包含区域引导阶段(含CI和SF模块生成指导图并计算相似度)和显著性生成阶段(SO模块基于相似度融合双模态特征;DDE模块增强低层细节;MIS模块挖掘高层位置线索)。 Result: 在RGB-T数据集上实验表明,RSONet性能优于27种现有先进显著目标检测方法。 Conclusion: RSONet通过多模块协同设计有效缓解了跨模态显著区域不一致性,提升了RGB-T显著目标检测精度与细节表现。 Abstract: This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.

[83] STRAP-ViT: Segregated Tokens with Randomized -- Transformations for Defense against Adversarial Patches in ViTs

Nandish Chattopadhyay,Anadi Goyal,Chandan Karfa,Anupam Chattopadhyay

Main category: cs.CV

TL;DR: 本文提出STRAP-ViT,一种无需训练、即插即用的ViT对抗防御方法,利用Jensen-Shannon散度检测异常token,并通过随机复合变换削弱对抗补丁影响,在多种模型、数据集和攻击下保持接近干净样本的准确率。

Details Motivation: 对抗补丁可通过劫持ViT自注意力机制导致误分类;作者观察到受扰区域对应token具有异于正常token的统计特性,据此设计检测与缓解机制。 Method: 提出STRAP-ViT:检测阶段用Jensen-Shannon散度识别异常token;缓解阶段对覆盖至少50%补丁区域的最小数量token施加随机复合变换;作为非可训练插件集成于ViT推理流程。 Result: 在ViT-base-16、DinoV2及ImageNet、CalTech-101上,针对Adversarial Patch、LAVAN、GDPA、RP2等多种攻击,鲁棒精度仅比干净基线低2–3%,优于当前最优方法。 Conclusion: STRAP-ViT是一种轻量、通用、即插即用的ViT对抗防御方案,无需再训练,计算开销小,且在多场景下显著提升鲁棒性。 Abstract: Adversarial patches are physically realizable localized noise, which are able to hijack Vision Transformers (ViT) self-attention, pulling focus toward a small, high-contrast region and corrupting the class token to force confident misclassifications. In this paper, we claim that the tokens which correspond to the areas of the image that contain the adversarial noise, have different statistical properties when compared to the tokens which do not overlap with the adversarial perturbations. We use this insight to propose a mechanism, called STRAP-ViT, which uses Jensen-Shannon Divergence as a metric for segregating tokens that behave as anomalies in the Detection Phase, and then apply randomized composite transformations on them during the Mitigation Phase to make the adversarial noise ineffective. The minimum number of tokens to transform is a hyper-parameter for the defense mechanism and is chosen such that at least 50% of the patch is covered by the transformed tokens. STRAP-ViT fits as a non-trainable plug-and-play block within the ViT architectures, for inference purposes only, with a minimal computational cost and does not require any additional training cost/effort. STRAP-ViT has been tested on multiple pre-trained vision transformer architectures (ViT-base-16 and DinoV2) and datasets (ImageNet and CalTech-101), across multiple adversarial attacks (Adversarial Patch, LAVAN, GDPA and RP2), and found to provide excellent robust accuracies lying within a 2-3% range of the clean baselines, and outperform the state-of-the-art.

[84] CM-Bench: A Comprehensive Cross-Modal Feature Matching Benchmark Bridging Visible and Infrared Images

Liangzheng Sun,Mengfan He,Xingyu Shao,Binbin Li,Zhiqiang Yan,Chunyu Li,Ziyang Meng,Fei Xing

Main category: cs.CV

TL;DR: 本文提出了一个全面的跨模态特征匹配基准CM-Bench,涵盖30种算法、多种任务评估及新红外-卫星数据集,并引入自适应预处理前端。

Details Motivation: 跨模态(如红外-可见光)特征匹配因模态间显著外观差异而困难,且缺乏标准化评测基准与指标。 Method: 构建CM-Bench基准,系统归纳并分类30种稀疏/半稠密/稠密匹配算法;设计基于分类网络的自适应预处理前端;构建带人工标注真值的红外-卫星跨模态数据集;在单应估计、相对位姿估计和地理定位等任务上统一评测。 Result: 提供了首个综合性跨模态特征匹配基准CM-Bench,包含统一评测框架、新数据集、自适应预处理方法及开源资源。 Conclusion: CM-Bench填补了跨模态匹配领域标准化评测的空白,为算法研发与比较提供了坚实基础,并推动红外-可见光及红外-卫星等实际场景下的定位与感知研究。 Abstract: Infrared-visible (IR-VIS) feature matching plays an essential role in cross-modality visual localization, navigation and perception. Along with the rapid development of deep learning techniques, a number of representative image matching methods have been proposed. However, crossmodal feature matching is still a challenging task due to the significant appearance difference. A significant gap for cross-modal feature matching research lies in the absence of standardized benchmarks and metrics for evaluations. In this paper, we introduce a comprehensive cross-modal feature matching benchmark, CM-Bench, which encompasses 30 feature matching algorithms across diverse cross-modal datasets. Specifically, state-of-the-art traditional and deep learning-based methods are first summarized and categorized into sparse, semidense, and dense methods. These methods are evaluated by different tasks including homography estimation, relative pose estimation, and feature-matching-based geo-localization. In addition, we introduce a classification-network-based adaptive preprocessing front-end that automatically selects suitable enhancement strategies before matching. We also present a novel infrared-satellite cross-modal dataset with manually annotated ground-truth correspondences for practical geo-localization evaluation. The dataset and resource will be available at: https://github.com/SLZ98/CM-Bench.

[85] MoKus: Leveraging Cross-Modal Knowledge Transfer for Knowledge-Aware Concept Customization

Chenyang Zhu,Hongxiang Li,Xiu Li,Long Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的任务——知识感知的概念定制(Knowledge-aware Concept Customization),旨在将多样化的文本知识绑定到目标视觉概念上,并为此设计了MoKus框架,利用跨模态知识迁移实现高保真定制生成;同时构建了首个相关基准KnowCusBench,验证了方法的有效性和可扩展性。

Details Motivation: 传统概念定制依赖稀有词元,但其在预训练数据中出现少、不稳定,且无法有效表达目标概念的内在知识。 Method: 提出MoKus框架,包含两个阶段:(1) 视觉概念学习,学习锚点表征以存储目标概念的视觉信息;(2) 文本知识更新,将知识查询的答案更新至锚点表征;核心思想是跨模态知识迁移。 Result: MoKus在新提出的KnowCusBench基准上显著优于现有方法,并可拓展至虚拟概念创建、概念擦除等知识感知应用,还在世界知识基准上取得提升。 Conclusion: 知识感知的概念定制是一种更鲁棒、更具语义意义的概念定制范式,MoKus通过显式建模文本知识与视觉概念的绑定关系,实现了高性能与强泛化能力。 Abstract: Concept customization typically binds rare tokens to a target concept. Unfortunately, these approaches often suffer from unstable performance as the pretraining data seldom contains these rare tokens. Meanwhile, these rare tokens fail to convey the inherent knowledge of the target concept. Consequently, we introduce Knowledge-aware Concept Customization, a novel task aiming at binding diverse textual knowledge to target visual concepts. This task requires the model to identify the knowledge within the text prompt to perform high-fidelity customized generation. Meanwhile, the model should efficiently bind all the textual knowledge to the target concept. Therefore, we propose MoKus, a novel framework for knowledge-aware concept customization. Our framework relies on a key observation: cross-modal knowledge transfer, where modifying knowledge within the text modality naturally transfers to the visual modality during generation. Inspired by this observation, MoKus contains two stages: (1) In visual concept learning, we first learn the anchor representation to store the visual information of the target concept. (2) In textual knowledge updating, we update the answer for the knowledge queries to the anchor representation, enabling high-fidelity customized generation. To further comprehensively evaluate our proposed MoKus on the new task, we introduce the first benchmark for knowledge-aware concept customization: KnowCusBench. Extensive evaluations have demonstrated that MoKus outperforms state-of-the-art methods. Moreover, the cross-model knowledge transfer allows MoKus to be easily extended to other knowledge-aware applications like virtual concept creation and concept erasure. We also demonstrate the capability of our method to achieve improvements on world knowledge benchmarks.

[86] HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification

Andrey V. Savchenko,Kseniia Tsypliakova

Main category: cs.CV

TL;DR: 本文提出了一种基于预训练EfficientNet模型提取面部嵌入的快速方法,用于ABAW竞赛中的多种情感行为分析任务,并通过滑动窗口平滑预测结果以降低噪声;在细粒度暴力检测任务中,探索了多种预训练架构及其帧级嵌入聚合策略,实验表明该方法在四个任务上显著优于现有基线。

Details Motivation: 提升ABAW竞赛中帧级面部情感理解(包括表情识别、效价-唤醒度估计、动作单元检测)及细粒度暴力检测任务的性能,克服现有方法在准确性和鲁棒性上的不足。 Method: 对情感理解任务,采用预训练EfficientNet模型提取面部嵌入并进行置信度过滤,低置信度时转由在AffWild2上训练的MLP处理;所有帧级预测经固定大小滑动窗口平滑;对暴力检测任务,对比多种预训练模型的帧嵌入及其视频级聚合方式。 Result: 在ABAW挑战赛的四个任务上,所提方法显著提升了验证指标,优于现有基线方法。 Conclusion: 基于嵌入提取与置信度自适应融合、结合预测平滑与多架构对比的策略,在多任务情感行为分析中具有高效性与泛化能力,尤其适用于真实场景下的面部情感与暴力行为识别。 Abstract: This article presents our results for the 10th Affective Behavior Analysis in-the-Wild (ABAW) competition. For frame-wise facial emotion understanding tasks (frame-wise facial expression recognition, valence-arousal estimation, action unit detection), we propose a fast approach based on facial embedding extraction with pre-trained EfficientNet-based emotion recognition models. If the latter model's confidence exceeds a threshold, its prediction is used. Otherwise, we feed embeddings into a simple multi-layered perceptron trained on the AffWild2 dataset. Estimated class-level scores are smoothed in a sliding window of fixed size to mitigate noise in frame-wise predictions. For the fine-grained violence detection task, we examine several pre-trained architectures for frame embeddings and their aggregation for video classification. Experimental results on four tasks from the ABAW challenge demonstrate that our approach significantly improves validation metrics over existing baselines.

[87] VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos

Pengyiang Liu,Zhongyue Shi,Hongye Hao,Qi Fu,Xueting Bi,Siwei Zhang,Xiaoyang Hu,Zitian Wang,Linjiang Huang,Si Liu

Main category: cs.CV

TL;DR: 本文提出VCBench,一个用于评估视频理解模型世界状态维护能力的流式计数基准,通过对象计数和事件计数两个维度分解该能力,并设计了三个诊断指标来评估模型在时空状态维护上的表现。

Details Motivation: 现有视频理解基准对模型如何持续跟踪和更新世界状态的观察能力不足,需一种更细粒度的诊断工具。 Method: 构建VCBench基准,包含406个视频、10071个事件/状态变化时刻标注,生成1000个流式问答对和4576个时间点查询;定义对象计数与事件计数两大类共8种子任务;提出三个互补指标:数值精度、轨迹一致性、时间感知性。 Result: 主流视频-语言模型在时空状态维护上仍存在显著缺陷,尤其在周期性事件计数等任务上表现较差。 Conclusion: VCBench为视频理解系统的世界状态维护能力提供了可量化、可诊断的评估框架,有助于后续模型改进。 Abstract: Video understanding requires models to continuously track and update world state during playback. While existing benchmarks have advanced video understanding evaluation across multiple dimensions, the observation of how models maintain world state remains insufficient. We propose VCBench, a streaming counting benchmark that repositions counting as a minimal probe for diagnosing world state maintenance capability. We decompose this capability into object counting (tracking currently visible objects vs.\ tracking cumulative unique identities) and event counting (detecting instantaneous actions vs.\ tracking complete activity cycles), forming 8 fine-grained subcategories. VCBench contains 406 videos with frame-by-frame annotations of 10,071 event occurrence moments and object state change moments, generating 1,000 streaming QA pairs with 4,576 query points along timelines. By observing state maintenance trajectories through streaming multi-point queries, we design three complementary metrics to diagnose numerical precision, trajectory consistency, and temporal awareness. Evaluation on mainstream video-language models shows that current models still exhibit significant deficiencies in spatial-temporal state maintenance, particularly struggling with tasks like periodic event counting. VCBench provides a diagnostic framework for measuring and improving state maintenance in video understanding systems.

[88] HFP-SAM: Hierarchical Frequency Prompted SAM for Efficient Marine Animal Segmentation

Pingping Zhang,Tianyu Yan,Yuhao Wang,Yang Liu,Tongdan Tang,Yili Ma,Long Lv,Feng Tian,Weibing Sun,and Huchuan Lu

Main category: cs.CV

TL;DR: 本文提出HFP-SAM框架,通过频域引导适配器、频域感知点选择和全视角Mamba模块,提升海洋动物分割性能。

Details Motivation: 现有基于深度学习的海洋动物分割方法难以建模长距离依赖,而SAM缺乏对细粒度细节和频域信息的感知能力。 Method: 提出分层频域提示SAM(HFP-SAM):1)设计频域引导适配器(FGA)向冻结SAM主干注入海洋场景频域先验;2)引入频域感知点选择(FPS)生成高亮区域并构造点提示;3)采用全视角Mamba(FVM)高效提取空间与通道上下文信息。 Result: 在四个公开数据集上实验表明,所提方法性能优于现有方法。 Conclusion: HFP-SAM有效融合频域先验与SAM架构,显著提升了复杂海洋环境下的动物分割精度与鲁棒性。 Abstract: Marine Animal Segmentation (MAS) aims at identifying and segmenting marine animals from complex marine environments. Most of previous deep learning-based MAS methods struggle with the long-distance modeling issue. Recently, Segment Anything Model (SAM) has gained popularity in general image segmentation. However, it lacks of perceiving fine-grained details and frequency information. To this end, we propose a novel learning framework, named Hierarchical Frequency Prompted SAM (HFP-SAM) for high-performance MAS. First, we design a Frequency Guided Adapter (FGA) to efficiently inject marine scene information into the frozen SAM backbone through frequency domain prior masks. Additionally, we introduce a Frequency-aware Point Selection (FPS) to generate highlighted regions through frequency analysis. These regions are combined with the coarse predictions of SAM to generate point prompts and integrate into SAM's decoder for fine predictions. Finally, to obtain comprehensive segmentation masks, we introduce a Full-View Mamba (FVM) to efficiently extract spatial and channel contextual information with linear computational complexity. Extensive experiments on four public datasets demonstrate the superior performance of our approach. The source code is publicly available at https://github.com/Drchip61/TIP-HFP-SAM.

[89] Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

Jing Yang,Hui Xue,Shipeng Zhu,Pengfei Fang

Main category: cs.CV

TL;DR: 本文提出TPSNet,通过结合文本先验(域提示)和相位先验(域不变相位特征),解决无监督跨域图像检索中伪标签语义指导不准确及域与语义信息纠缠的问题,显著提升检索性能。

Details Motivation: 现有方法依赖离散伪标签进行监督学习和跨域对齐,但伪标签语义不准确、不全面,且对齐过程忽略域特异性与语义信息的纠缠,导致语义退化、检索性能下降。 Method: 提出文本-相位协同网络(TPSNet):1)利用CLIP为每个域生成类特定的域提示(text prior);2)引入域不变相位特征作为phase prior,融入图像表征以弥合域分布差异并保持语义完整性;3)协同利用双先验提升表征质量。 Result: TPSNet在UCDIR基准测试上显著优于当前最优方法。 Conclusion: 融合文本先验与相位先验的协同机制能有效缓解语义退化与域偏移问题,为无监督跨域图像检索提供新思路。 Abstract: This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

[90] UNIStainNet: Foundation-Model-Guided Virtual Staining of H&E to IHC

Jillur Rahman Saurav,Thuong Le Hoai Pham,Pritam Mukherjee,Paul Yi,Brent A. Orr,Jacob M. Luber

Main category: cs.CV

TL;DR: 本文提出UNIStainNet,一种基于病理基础模型UNI空间语义引导的条件生成网络,用于从H&E图像生成多标记虚拟IHC染色,在多个数据集上达到SOTA性能,并支持单模型多标记联合生成。

Details Motivation: 现有虚拟IHC方法缺乏来自病理基础模型的直接语义指导,且多需为不同标记单独建模,难以兼顾 realism 与量化准确性,尤其在组织有限时亟需高效、可靠的初步分子表征。 Method: UNIStainNet采用SPADE-UNet架构,以冻结的病理基础模型UNI提取的密集空间token为条件;引入错位感知损失族保障染色定量准确性;并使用可学习的染色嵌入实现单模型多IHC标记(HER2/Ki67/ER/PR)联合生成。 Result: 在MIST和BCI数据集上,UNIStainNet在全部四个IHC标记的分布指标上均达SOTA;相比以往需分别训练的模型,本方法用单一模型统一处理多标记;失败分析显示误差主要集中于非肿瘤组织,具系统性。 Conclusion: UNIStainNet验证了利用病理基础模型的空间语义先验可显著提升虚拟IHC的 realism、定量可靠性与泛化能力,为临床常规切片提供可扩展、多标记的分子级预判工具。 Abstract: Virtual immunohistochemistry (IHC) staining from hematoxylin and eosin (H&E) images can accelerate diagnostics by providing preliminary molecular insight directly from routine sections, reducing the need for repeat sectioning when tissue is limited. Existing methods improve realism through contrastive objectives, prototype matching, or domain alignment, yet the generator itself receives no direct guidance from pathology foundation models. We present UNIStainNet, a SPADE-UNet conditioned on dense spatial tokens from a frozen pathology foundation model (UNI), providing tissue-level semantic guidance for stain translation. A misalignment-aware loss suite preserves stain quantification accuracy, and learned stain embeddings enable a single model to serve multiple IHC markers simultaneously. On MIST, UNIStainNet achieves state-of-the-art distributional metrics on all four stains (HER2, Ki67, ER, PR) from a single unified model, where prior methods typically train separate per-stain models. On BCI, it also achieves the best distributional metrics. A tissue-type stratified failure analysis reveals that remaining errors are systematic, concentrating in non-tumor tissue. Code is available at https://github.com/facevoid/UNIStainNet.

[91] The COTe score: A decomposable framework for evaluating Document Layout Analysis models

Jonathan Bourne,Mwiza Simbeye,Ishtar Govia

Main category: cs.CV

TL;DR: 本文提出了一种新的文档布局分析(DLA)评估框架,包括语义结构单元(SSU)标注法和可分解的COTe评分指标,以弥补传统目标检测指标在2D印刷媒体上的不足,提升评估的鲁棒性、可比性和细粒度性。

Details Motivation: 传统基于IoU、F1、mAP等的目标检测指标面向3D场景的2D投影,不适用于原生2D的印刷文档,导致对DLA模型性能的误判或信息缺失。 Method: 提出结构性语义单元(SSU)作为关系型标注方法,强调内容语义结构而非物理位置;设计Coverage、Overlap、Trespass、Excess(COTe)可分解评分指标,并在3个DLA数据集上评估5种主流模型。 Result: COTe比传统指标更富信息量,能揭示模型不同失败模式(如越界、重复解析),相对F1将解释-性能差距降低达76%;且COTe在无显式SSU标注时仍具粒度鲁棒性。 Conclusion: COTe与SSU共同构成更合理、实用、易部署的DLA评估体系,作者开源了SSU标注数据集和Python库。 Abstract: Document Layout analysis (DLA), is the process by which a page is parsed into meaningful elements, often using machine learning models. Typically, the quality of a model is judged using general object detection metrics such as IoU, F1 or mAP. However, these metrics are designed for images that are 2D projections of 3D space, not for the natively 2D imagery of printed media. This discrepancy can result in misleading or uninformative interpretation of model performance by the metrics. To encourage more robust, comparable, and nuanced DLA, we introduce: The Structural Semantic Unit (SSU) a relational labelling approach that shifts the focus from the physical to the semantic structure of the content; and the Coverage, Overlap, Trespass, and Excess (COTe) score, a decomposable metric for measuring page parsing quality. We demonstrate the value of these methods through case studies and by evaluating 5 common DLA models on 3 DLA datasets. We show that the COTe score is more informative than traditional metrics and reveals distinct failure modes across models, such as breaching semantic boundaries or repeatedly parsing the same region. In addition, the COTe score reduces the interpretation-performance gap by up to 76% relative to the F1. Notably, we find that the COTe's granularity robustness largely holds even without explicit SSU labelling, lowering the barriers to entry for using the system. Finally, we release an SSU labelled dataset and a Python library for applying COTe in DLA projects.

[92] IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration

Dongxu Zhang,Jihua Zhu,Shiqi Li,Wenbiao Yan,Haoran Xu,Peilin Fan,Huimin Lu

Main category: cs.CV

TL;DR: 本文提出IGASA框架,基于分层金字塔架构(HPA),结合分层跨层注意力(HCLA)和迭代几何感知优化(IGAR)模块,提升点云配准在噪声、遮挡和大变换下的鲁棒性与精度。

Details Motivation: 现有点云配准方法在面对真实场景中的强噪声、严重遮挡和大幅变换时表现不佳,导致精度和鲁棒性不足。 Method: 提出IGASA框架,包含分层金字塔架构(HPA)、分层跨层注意力(HCLA)模块(用于多尺度特征对齐与局部几何一致性增强)和迭代几何感知优化(IGAR)模块(基于粗匹配建立的可靠对应关系进行精细匹配)。 Result: 在3D(Lo)Match、KITTI、nuScenes等四个主流基准数据集上,IGASA显著超越当前最优方法,注册精度明显提升。 Conclusion: IGASA为点云配准提供了更鲁棒、适应性强的新范式,推动了3D视觉实际应用的发展。 Abstract: Point cloud registration (PCR) is a fundamental task in 3D vision and provides essential support for applications such as autonomous driving, robotics, and environmental modeling. Despite its widespread use, existing methods often fail when facing real-world challenges like heavy noise, significant occlusions, and large-scale transformations. These limitations frequently result in compromised registration accuracy and insufficient robustness in complex environments. In this paper, we propose IGASA as a novel registration framework constructed upon a Hierarchical Pyramid Architecture (HPA) designed for robust multi-scale feature extraction and fusion. The framework integrates two pivotal components consisting of the Hierarchical Cross-Layer Attention (HCLA) module and the Iterative Geometry-Aware Refinement (IGAR) module. The HCLA module utilizes skip attention mechanisms to align multi-resolution features and enhance local geometric consistency. Simultaneously, the IGAR module is designed for the fine matching phase by leveraging reliable correspondences established during coarse matching. This synergistic integration within the architecture allows IGASA to adapt effectively to diverse point cloud structures and intricate transformations. We evaluate the performance of IGASA on four widely recognized benchmark datasets including 3D(Lo)Match, KITTI, and nuScenes. Our extensive experiments consistently demonstrate that IGASA significantly surpasses state-of-the-art methods and achieves notable improvements in registration accuracy. This work provides a robust foundation for advancing point cloud registration techniques while offering valuable insights for practical 3D vision applications. The code for IGASA is available in \href{https://github.com/DongXu-Zhang/IGASA}{https://github.com/DongXu-Zhang/IGASA}.

[93] CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration

Dongxu Zhang,Yingsen Wang,Yiding Sun,Haoran Xu,Peilin Fan,Jihua Zhu

Main category: cs.CV

TL;DR: 本文提出CMHANet,一种融合2D图像上下文信息与3D点云几何细节的跨模态混合注意力网络,并引入基于对比学习的优化函数,显著提升点云配准在噪声、不完整数据和低重叠场景下的鲁棒性与精度。

Details Motivation: 现有基于学习的点云配准方法在真实复杂场景(如数据不完整、传感器噪声、低重叠区域)下性能下降,亟需更鲁棒的解决方案。 Method: 提出CMHANet:1)跨模态混合注意力机制融合2D图像语义上下文与3D点云几何特征;2)设计基于对比学习的新型损失函数,增强几何一致性与对噪声/部分观测的鲁棒性。 Result: 在3DMatch和更具挑战性的3DLoMatch数据集上显著提升配准精度与鲁棒性;零样本迁移至TUM RGB-D SLAM数据集验证了强泛化能力。 Conclusion: CMHANet通过跨模态特征融合与对比学习优化,有效解决了复杂现实场景下点云配准的鲁棒性瓶颈,为实际应用提供了更可靠的解决方案。 Abstract: Robust point cloud registration is a fundamental task in 3D computer vision and geometric deep learning, essential for applications such as large-scale 3D reconstruction, augmented reality, and scene understanding. However, the performance of established learning-based methods often degrades in complex, real world scenarios characterized by incomplete data, sensor noise, and low overlap regions. To address these limitations, we propose CMHANet, a novel Cross-Modal Hybrid Attention Network. Our method integrates the fusion of rich contextual information from 2D images with the geometric detail of 3D point clouds, yielding a comprehensive and resilient feature representation. Furthermore, we introduce an innovative optimization function based on contrastive learning, which enforces geometric consistency and significantly improves the model's robustness to noise and partial observations. We evaluated CMHANet on the 3DMatch and the challenging 3DLoMatch datasets. \rev{Additionally, zero-shot evaluations on the TUM RGB-D SLAM dataset verify the model's generalization capability to unseen domains.} The experimental results demonstrate that our method achieves substantial improvements in both registration accuracy and overall robustness, outperforming current techniques. We also release our code in \href{https://github.com/DongXu-Zhang/CMHANet}{https://github.com/DongXu-Zhang/CMHANet}.

[94] CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

Kaifan Zhang,Lihuo He,Junjie Ke,Yuqi Ji,Lukun Wu,Lizi Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出CognitionCapturerPro框架,通过融合多模态先验(图像、文本、深度、边缘)与EEG信号协同训练,提升视觉刺激重建效果,在THINGS-EEG数据集上显著提高检索准确率。

Details Motivation: 视觉刺激从脑电图(EEG)中重建面临保真度损失和表征偏移的挑战。 Method: 提出CognitionCapturerPro框架,包含不确定性加权相似性评分机制和融合编码器,并结合简化对齐模块与预训练扩散模型进行协同训练。 Result: 在THINGS-EEG数据集上,Top-1和Top-5检索准确率分别提升25.9%和10.6%,显著优于原始CognitionCapturer方法。 Conclusion: 融合多模态先验与EEG信号的协同训练策略可有效缓解保真度损失与表征偏移问题,提升视觉重建性能。 Abstract: Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi-modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty-weighted similarity scoring mechanism to quantify modality-specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre-trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS-EEG dataset, improving Top-1 and Top-5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: https://github.com/XiaoZhangYES/CognitionCapturerPro.

[95] Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Yuzhi Huang,Kairun Wen,Rongxin Gao,Dongxuan Liu,Yibin Lou,Jie Wu,Jing Xu,Jian Zhang,Zheng Yang,Yunlong Lin,Chenxin Li,Panwang Pan,Junbin Lu,Jingyan Jiang,Xinghao Ding,Yue Huang,Zhi Wang

Main category: cs.CV

TL;DR: 本文提出了Dyn-Bench基准,用于系统评估多模态大语言模型(MLLMs)在动态4D场景中的时空推理与局部动态感知能力;实验发现现有模型难以兼顾时空推理与动态目标定位,而结构化融合方法(如Mask-Guided Fusion和ST-TCM)可显著提升其动态理解能力。

Details Motivation: 当前MLLMs擅长静态视觉理解,但缺乏对真实世界中随时间演化的4D动态场景(空间+时间)的感知、追踪与推理能力,亟需系统性评估与提升。 Method: 构建大规模动态视频基准Dyn-Bench(含1k视频、7k VQA对、3k动态目标定位对),并提出Mask-Guided Fusion和Spatio-Temporal Textual Cognitive Map(ST-TCM)等结构化融合方法来增强MLLMs的动态理解。 Result: 实验证明现有MLLMs在时空推理与动态目标定位任务上表现不均衡且常产生运动/交互解释不一致;传统提示策略(如思维链)提升有限,而结构化融合方法显著提升性能。 Conclusion: MLLMs尚未真正具备‘动态思维’能力;需从模型架构与多模态融合机制层面进行结构性改进,才能实现对物理4D世界的鲁棒时空理解。 Abstract: Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at https://dyn-bench.github.io/.

[96] SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

Zheng Gao,Yifan Yang,Xiaoyu Li,Xiaoyan Feng,Haoran Fan,Yang Song,Jiaojiao Jiang

Main category: cs.CV

TL;DR: 本文提出SLICE方法,通过将图像语义解耦为四类因子并分别注入初始高斯噪声的不同区域,实现细粒度、可定位的语义感知水印,显著提升对语义引导再生攻击的鲁棒性。

Details Motivation: 现有语义感知水印方法依赖单一全局语义绑定,易受局部但全局一致的语义编辑攻击;而基于初始噪声的水印又易被反演与再生攻击伪造。 Method: 提出SLICE框架:将图像语义解耦为主体、环境、动作和细节四类因子,并通过分隔化嵌入(compartmentalized embedding)将其精确锚定到初始高斯噪声的不同空间区域。 Result: SLICE在语义引导再生攻击下显著优于基线方法,大幅降低攻击成功率,同时保持图像质量与语义保真度;理论证明其具备可靠的篡改定位能力及低误接受率统计保证。 Conclusion: SLICE是一种无需训练、实用性强的细粒度图像溯源方案,兼具篡改可诊断性与对真实对抗操作的强鲁棒性。 Abstract: Watermarking the initial noise of diffusion models has emerged as a promising approach for image provenance, but content-independent noise patterns can be forged via inversion and regeneration attacks. Recent semantic-aware watermarking methods improve robustness by conditioning verification on image semantics. However, their reliance on a single global semantic binding makes them vulnerable to localized but globally coherent semantic edits. To address this limitation and provide a trustworthy semantic-aware watermark, we propose $\underline{\textbf{S}}$emantic $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$njection via $\underline{\textbf{C}}$ompartmentalized $\underline{\textbf{E}}$mbedding ($\textbf{SLICE}$). Our framework decouples image semantics into four semantic factors (subject, environment, action, and detail) and precisely anchors them to distinct regions in the initial Gaussian noise. This fine-grained semantic binding enables advanced watermark verification where semantic tampering is detectable and localizable. We theoretically justify why SLICE enables robust and reliable tamper localization and provides statistical guarantees on false-accept rates. Experimental results demonstrate that SLICE significantly outperforms existing baselines against advanced semantic-guided regeneration attacks, substantially reducing attack success while preserving image quality and semantic fidelity. Overall, SLICE offers a practical, training-free provenance solution that is both fine-grained in diagnosis and robust to realistic adversarial manipulations.

[97] Show, Don't Tell: Detecting Novel Objects by Watching Human Videos

James Akl,Jose Nicolas Avendano Arbelaez,James Barabas,Jennifer L. Barry,Kalie Ching,Noam Eshed,Jiahui Fu,Michel Hidalgo,Andrew Hoelscher,Tushar Kusnur,Andrew Messing,Zachary Nagler,Brian Okorn,Mauro Passerino,Tim J. Perkins,Eric Rosen,Ankit Shah,Tanmay Shankar,Scott Shaw

Main category: cs.CV

TL;DR: 本文提出了一种名为“Show, Don't Tell”的自监督系统,通过在人类演示过程中直接展示目标物体来自动生成训练数据,从而快速构建专用目标检测器,无需人工编写语言提示或复杂描述。

Details Motivation: 现有闭集目标检测器难以识别演示中出现的新物体(分布外物体),而开放集检测器(如VLMs)虽有一定能力,却依赖耗时费力的人工提示工程。 Method: 提出自监督的‘Show, Don’t Tell’范式:利用人类演示过程中的视觉观测自动构建标注数据集,并据此训练轻量、定制化的目标检测器;开发了集成于真实机器人上的端到端系统。 Result: 实验表明,该方法在操作物体的检测与识别任务上显著优于当前最先进方法,并提升了机器人任务完成率。 Conclusion: 绕过语言模态、直接以视觉演示驱动检测器训练,是一种高效、实用且可部署于真实机器人上的新范式。 Abstract: How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, "Show, Don't Tell," we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our "Show, Don't Tell" paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.

[98] FC-Track: Overlap-Aware Post-Association Correction for Online Multi-Object Tracking

Cheng Ju,Zejing Zhao,Akio Namiki

Main category: cs.CV

TL;DR: 本文提出了一种轻量级在线多目标跟踪后关联校正框架FC-Track,通过IoA过滤和重叠轨迹对内的外观相似性比较,抑制高重叠下的错误匹配传播,显著降低ID切换,适用于实时机器人应用。

Details Motivation: 现有在线多目标跟踪方法在频繁遮挡与目标重叠场景下易产生身份切换,错误关联会随时间传播,损害跟踪可靠性。 Method: 提出FC-Track框架:1)基于交并比(IoA)的过滤策略,抑制高重叠条件下的不可靠外观更新;2)在重叠轨迹对内进行外观相似性比较,局部修正检测与轨迹间的错误匹配。全程在线运行,无需全局优化或重识别。 Result: 在MOT17测试集上达到81.73 MOTA、82.81 IDF1、66.95 HOTA(5.7 FPS);在MOT20上达77.52 MOTA、80.90 IDF1、65.67 HOTA(0.6 FPS);长时ID切换率仅29.55%,显著优于现有在线跟踪器,并在MOT20上保持SOTA性能。 Conclusion: FC-Track通过轻量、在线、无需重识别的方式有效缓解重叠引发的身份切换问题,在保证实时性的同时提升了跟踪鲁棒性与准确性,尤其适用于动态复杂环境中的机器人系统。 Abstract: Reliable multi-object tracking (MOT) is essential for robotic systems operating in complex and dynamic environments. Despite recent advances in detection and association, online MOT methods remain vulnerable to identity switches caused by frequent occlusions and object overlap, where incorrect associations can propagate over time and degrade tracking reliability. We present a lightweight post-association correction framework (FC-Track) for online MOT that explicitly targets overlap-induced mismatches during inference. The proposed method suppresses unreliable appearance updates under high-overlap conditions using an Intersection over Area (IoA)-based filtering strategy, and locally corrects detection-to-tracklet mismatches through appearance similarity comparison within overlapped tracklet pairs. By preventing short-term mismatches from propagating, our framework effectively mitigates long-term identity switches without resorting to global optimization or re-identification. The framework operates online without global optimization or re-identification, making it suitable for real-time robotic applications. We achieve 81.73 MOTA, 82.81 IDF1, and 66.95 HOTA on the MOT17 test set with a running speed of 5.7 FPS, and 77.52 MOTA, 80.90 IDF1, and 65.67 HOTA on the MOT20 test set with a running speed of 0.6 FPS. Specifically, our framework FC-Track produces only 29.55% long-term identity switches, which is substantially lower than existing online trackers. Meanwhile, our framework maintains state-of-the-art performance on the MOT20 benchmark.

[99] SAP: Segment Any 4K Panorama

Lutao Jiang,Zidong Cao,Weikai Chen,Xu Zheng,Yuanhuiyi Lyu,Zhenyang Li,Zeyu HU,Yingda Yin,Keyang Luo,Runze Zhang,Kai Yan,Shengju Qian,Haidi Fan,Yifan Peng,Xin Wang,Hui Xiong,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出Segment Any 4K Panorama (SAP),一种面向4K高分辨率全景图像的实例分割基础模型,通过将全景分割重构为沿球面连续轨迹采样的重叠透视图视频分割任务,并结合大规模合成数据训练,在真实4K全景图像上实现显著零样本性能提升。

Details Motivation: 现有基于透视图像训练的基础分割模型在360°全景图像上性能下降,亟需适配全景特性的新方法。 Method: 将全景分割重构为固定轨迹的透视视频分割,沿球面连续采样重叠透视块以保持4K原生分辨率和视角过渡平滑性;利用InfiniGen引擎合成183,440张带实例标注的4K全景图像用于训练。 Result: SAP在真实世界4K全景基准上零样本mIoU较不同尺寸的SAM2平均提升+17.2。 Conclusion: 轨迹对齐的重构范式与大规模合成监督有效提升了全景实例分割的泛化能力,为AR与具身智能中的全景理解提供了新基础模型。 Abstract: Promptable instance segmentation is widely adopted in embodied and AR systems, yet the performance of foundation models trained on perspective imagery often degrades on 360° panoramas. In this paper, we introduce Segment Any 4K Panorama (SAP), a foundation model for 4K high-resolution panoramic instance-level segmentation. We reformulate panoramic segmentation as fixed-trajectory perspective video segmentation, decomposing a panorama into overlapping perspective patches sampled along a continuous spherical traversal. This memory-aligned reformulation preserves native 4K resolution while restoring the smooth viewpoint transitions required for stable cross-view propagation. To enable large-scale supervision, we synthesize 183,440 4K-resolution panoramic images with instance segmentation labels using the InfiniGen engine. Trained under this trajectory-aligned paradigm, SAP generalizes effectively to real-world 360° images, achieving +17.2 zero-shot mIoU gain over vanilla SAM2 of different sizes on real-world 4K panorama benchmark.

[100] HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks

Xiaoyu Li,Yuhang Liu,Zheng Luo,Xuanshuo Kang,Fangqi Lou,Xiaohua Wu,Zihan Xiong

Main category: cs.CV

TL;DR: 本文提出HiFICL方法,通过引入可学习的虚拟键值对、低秩分解和端到端训练目标,更保真地建模上下文学习机制,在多模态基准上优于现有近似方法。

Details Motivation: 现有In-Context Learning(ICL)在大模型中性能敏感且计算开销大,当前近似方法(如学习‘偏移向量’)过于简化其内在机制。 Method: 提出HiFICL:1)引入可学习的‘虚拟键值对’作为上下文表征;2)采用低秩分解保障训练稳定性与正则化;3)设计简单端到端训练目标;该机制也等价于一种上下文感知的参数高效微调(PEFT)。 Result: 在多个多模态基准上,HiFICL持续超越现有ICL近似方法。 Conclusion: HiFICL通过更精确建模ICL中的注意力与上下文值动态混合机制,实现了更高保真度与更强泛化能力,是一种有效且高效的多模态上下文学习新范式。 Abstract: In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a "shift vector". Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HIFICL) to more faithfully model the ICL mechanism. HIFICL consists of three key components: 1) a set of "virtual key-value pairs" to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism constitutes a form of context-aware Parameter-Efficient Fine-Tuning (PEFT). Extensive experiments show that HiFICL consistently outperforms existing approximation methods on several multimodal benchmarks. The code is available at https://github.com/bbbandari/HiFICL.

[101] TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

Nazar Puriy,Johannes Jakubik,Benedikt Blumenstiel,Konrad Schindler

Main category: cs.CV

TL;DR: TerraFlow is a new multimodal, multitemporal learning framework for Earth observation that handles variable-length inputs and outperforms existing foundation models on temporal tasks and natural disaster risk mapping.

Details Motivation: To address the challenges of variable-length, multimodal, and multitemporal Earth observation data, and to improve performance on temporal tasks and natural disaster risk prediction where current models fail. Method: TerraFlow introduces temporal training objectives enabling sequence-aware learning across space, time, and modality, while maintaining robustness to variable-length inputs. Result: TerraFlow surpasses state-of-the-art foundation models on all temporal tasks of GEO-Bench-2 and achieves up to 50% higher F1 score and 24% lower Brier score in natural disaster risk map prediction. Conclusion: TerraFlow establishes a new state-of-the-art for multimodal, multitemporal Earth observation learning and shows promising initial capability for deep-learning-based disaster risk mapping. Abstract: We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters -- a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.

[102] SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

Xiang Li,Heqian Qiu,Lanxiao Wang,Benliu Qiu,Fanman Meng,Linfeng Xu,Hongliang Li

Main category: cs.CV

TL;DR: 本文提出SAVA-X框架,用于解决从第一人称(ego)视频到第三人称(exo)视频的跨视角模仿错误检测问题,通过自适应采样、场景感知视图嵌入和双向交叉注意力融合,在EgoMe基准上显著优于现有方法。

Details Motivation: 现有错误检测方法多假设单视角,难以应对工业训练、医疗和装配质检中常见的用第三人称示范评估第一人称模仿的实际场景。 Method: 提出SAVA-X框架,包含三部分:(i) 视图条件自适应采样,(ii) 场景自适应视图嵌入,(iii) 双向交叉注意力融合,以应对跨视角域偏移、时间错位与冗余问题。 Result: 在EgoMe基准上,SAVA-X在AUPRC和平均tIoU指标上持续超越所有基线模型;消融实验证明各组件具有互补增益。 Conclusion: SAVA-X有效解决了Ego→Exo跨视角模仿错误检测这一新任务,为多视角行为理解提供了可行框架。 Abstract: Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.

[103] Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation

Shifeng Chen,Yihui Li,Jun Liao,Hongyu Yang,Di Huang

Main category: cs.CV

TL;DR: Catalyst4D 是一种用于动态 4D 场景编辑的新框架,通过 Anchor-based Motion Guidance(AMG)和 Color Uncertainty-guided Appearance Refinement(CUAR)技术,在保持时空一致性的前提下,实现高质量的 3D 编辑向 4D 高斯场景的迁移。

Details Motivation: 现有将2D扩散模型直接扩展到4D的方法易产生运动伪影、时间闪烁和风格不一致问题,而动态场景编辑仍具挑战性。 Method: 提出 Catalyst4D 框架:1)Anchor-based Motion Guidance(AMG),利用最优传输建立原始与编辑高斯之间的结构锚点对应关系,保障形变传播的一致性;2)Color Uncertainty-guided Appearance Refinement(CUAR),基于高斯颜色不确定性估计,选择性优化易受遮挡影响区域的外观。 Result: 在多个实验中,Catalyst4D 实现了时间稳定、高保真的动态场景编辑,在视觉质量和运动一致性上均优于现有方法。 Conclusion: Catalyst4D 有效解决了动态 4D 场景编辑中的时空一致性难题,为高质量、鲁棒的4D内容生成提供了新范式。 Abstract: Recent advances in 3D scene editing using NeRF and 3DGS enable high-quality static scene editing. In contrast, dynamic scene editing remains challenging, as methods that directly extend 2D diffusion models to 4D often produce motion artifacts, temporal flickering, and inconsistent style propagation. We introduce Catalyst4D, a framework that transfers high-quality 3D edits to dynamic 4D Gaussian scenes while maintaining spatial and temporal coherence. At its core, Anchor-based Motion Guidance (AMG) builds a set of structurally stable and spatially representative anchors from both original and edited Gaussians. These anchors serve as robust region-level references, and their correspondences are established via optimal transport to enable consistent deformation propagation without cross-region interference or motion drift. Complementarily, Color Uncertainty-guided Appearance Refinement (CUAR) preserves temporal appearance consistency by estimating per-Gaussian color uncertainty and selectively refining regions prone to occlusion-induced artifacts. Extensive experiments demonstrate that Catalyst4D achieves temporally stable, high-fidelity dynamic scene editing and outperforms existing methods in both visual quality and motion coherence.

[104] PVI: Plug-in Visual Injection for Vision-Language-Action Models

Zezhou Zhang,Songxin Zhang,Xiao Xiong,Junjie Zhang,Zejian Xie,Jingyi Xi,Zunyao Mao,Zan Mao,Zhixin Mai,Zhuoyang Song,Jiaxing Zhang

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、编码器无关的Plug-in Visual Injection(PVI)模块,用于向预训练动作专家注入辅助视觉表征(尤其是时序视频特征),显著提升语言条件下的机器人操作性能,尤其在多阶段、需状态跟踪的任务中效果突出,并在真实双臂布料折叠任务中验证了其实用性。

Details Motivation: 现有VLA架构中,预训练视觉语言模型(VLM)缺乏细粒度几何线索和显式时序信息,难以有效支持动作专家;已有视觉注入方法多聚焦静态空间表征或需大幅修改架构,时序信息未被充分挖掘。 Method: 提出Plug-in Visual Injection(PVI):一种零初始化残差路径的轻量模块,可即插即用地接入预训练动作专家,无需修改原编码器,仅需单阶段微调;对比注入静态图像特征(DINOv2)与时序视频特征(V-JEPA2)。 Result: PVI在仿真中持续优于基线策略及多种替代注入方法;控制实验表明V-JEPA2(时序特征)显著优于DINOv2(静态特征),尤其在多阶段任务中增益最大;真实机器人双臂布料折叠长视野任务验证了其实际有效性。 Conclusion: PVI是一种高效、通用且易于部署的视觉注入范式,证明了显式引入高质量时序视觉表征对提升语言条件操作策略鲁棒性和泛化性的关键作用。 Abstract: VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.

[105] Empowering Semantic-Sensitive Underwater Image Enhancement with VLM

Guodong Fan,Shengning Zhou,Genji Yuan,Huiyu Li,Jingchun Zhou,Jinjiang Li

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言模型(VLM)的语义敏感型水下图像增强方法,通过文本描述生成与空间语义引导图,结合交叉注意力和对齐损失,使增强网络聚焦于关键语义区域,从而提升感知质量及下游任务性能。

Details Motivation: 现有水下图像增强(UIE)模型在高质量增强结果与自然图像之间存在分布偏移,影响下游视觉任务的语义线索提取,限制了模型适应性。 Method: 利用VLM从退化图像中生成关键物体的文本描述;通过文本-图像对齐模型生成空间语义引导图;采用双引导机制(交叉注意力+显式对齐损失)驱动UIE网络聚焦语义敏感区域进行重建。 Result: 在多个UIE基线模型上验证,该方法显著提升感知质量指标(如PSNR、SSIM),并增强目标检测与分割等下游任务性能。 Conclusion: 所提VLM驱动的语义敏感增强机制有效缓解分布偏移问题,兼顾图像保真度与语义一致性,具备良好通用性与任务适应性。 Abstract: In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.

[106] Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning

Mengya Xu,Daiyun Shen,Jie Zhang,Hon Chi Yip,Yujia Gao,Cheng Chen,Dillan Imans,Yonghao Long,Yiru Ye,Yixiao Liu,Rongyun Mai,Kai Chen,Hongliang Ren,Yutong Ban,Guangsuo Wang,Francis Wong,Chi-Fai Ng,Kee Yuan Ngiam,Russell H. Taylor,Daguang Xu,Yueming Jin,Qi Dou

Main category: cs.CV

TL;DR: 本文提出一个包含10种基本手术动作、覆盖6个外科专科、超11,000个视频片段的BSA数据集,并基于此构建了一个通用基础模型,实现跨专科鲁棒的动作识别;进一步验证其在前列腺切除术技能评估和胆囊/肾切除术动作规划中的下游应用,获多国外科医生临床认可。

Details Motivation: 基本手术动作(BSA)是外科操作的基本单元,对其建模与理解是推动AI赋能外科实践、培训与自动化发展的关键。 Method: 构建迄今最大的BSA视频数据集(10类动作、6专科、11,000+片段),并基于该数据集训练一个通用基础模型,支持BSA识别;结合领域知识与大视觉语言模型,拓展至技能评估与动作规划等下游任务。 Result: 模型在不同术式与解剖部位数据上展现出强跨专科泛化能力;在前列腺切除术技能评估、胆囊/肾切除术动作规划中成功应用;多国外科医生确认其生成的动作规划文本具有临床相关性。 Conclusion: 基本手术动作可被鲁棒识别,高精度BSA理解模型能有效支撑复杂外科AI应用,加速外科‘超级智能’落地。 Abstract: Artificial intelligence, imaging, and large language models have the potential to transform surgical practice, training, and automation. Understanding and modeling of basic surgical actions (BSA), the fundamental unit of operation in any surgery, is important to drive the evolution of this field. In this paper, we present a BSA dataset comprising 10 basic actions across 6 surgical specialties with over 11,000 video clips, which is the largest to date. Based on the BSA dataset, we developed a new foundation model that conducts general-purpose recognition of basic actions. Our approach demonstrates robust cross-specialist performance in experiments validated on datasets from different procedural types and various body parts. Furthermore, we demonstrate downstream applications enabled by the BAS foundation model through surgical skill assessment in prostatectomy using domain-specific knowledge, and action planning in cholecystectomy and nephrectomy using large vision-language models. Multinational surgeons' evaluation of the language model's output of the action planning explainable texts demonstrated clinical relevance. These findings indicate that basic surgical actions can be robustly recognized across scenarios, and an accurate BSA understanding model can essentially facilitate complex applications and speed up the realization of surgical superintelligence.

[107] Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing

Shuchang Lyu,Haiquan Wen,Guangliang Cheng,Meng Li,Zheng Zhou,You Zhou,Dingding Yao,Zhenwei Shi

Main category: cs.CV

TL;DR: 本文提出了一种面向遥感图像的多实体推理定位新范式,构建了ME-RSRG基准数据集,并设计了实体感知推理(EAR)框架,结合结构化推理轨迹生成与实体感知的奖励驱动强化学习优化,在多实体遥感视觉定位任务上取得显著效果。

Details Motivation: 现有遥感视觉定位方法局限于感知级匹配和单实体建模,缺乏显式推理与实体间关系建模能力,难以应对复杂多实体场景。 Method: 构建多实体推理定位基准ME-RSRG;提出实体感知推理(EAR)框架,基于视觉-语言基础模型生成结构化推理轨迹与主宾体定位输出;采用监督微调冷启动,并通过实体感知的组相对策略优化(GRPO)进行强化学习优化。 Result: 在ME-RSRG数据集上的大量实验验证了多实体推理任务的挑战性,并证实EAR框架在定位精度与推理可解释性方面显著优于基线方法。 Conclusion: 将显式多实体推理引入遥感视觉定位是可行且有效的,EAR框架及其配套数据集为该方向提供了新基准与技术路径。 Abstract: Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at https://github.com/CV-ShuchangLyu/ME-RSRG.

[108] Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

Sangmin Kim,Minhyuk Hwang,Geonho Cha,Dongyoon Wee,Jaesik Park

Main category: cs.CV

TL;DR: CHROMM是一个统一框架,能从多人多视角视频中联合估计相机参数、场景点云和人体网格,无需外部模块或预处理,融合几何与人体先验,并引入尺度调整与多视角融合策略,在多个数据集上实现高效高精度重建。

Details Motivation: 现有3D人体与场景重建方法多基于单目输入,扩展至多视角需额外模块或预处理,缺乏统一高效的多视角多人联合重建方案。 Method: 提出CHROMM框架,整合Pi3X和Multi-HMR的几何与人体先验;设计尺度调整模块解决人体与场景尺度差异;引入测试时多视角融合策略;采用基于几何的多人关联方法替代外观匹配。 Result: 在EMDB、RICH、EgoHumans和EgoExo4D数据集上,CHROMM在全局人体运动与多视角姿态估计任务中达到竞争性性能,推理速度比先前优化类多视角方法快8倍以上。 Conclusion: CHROMM验证了端到端多视角多人联合重建的可行性与高效性,为真实世界复杂场景下的人-环境协同建模提供了新范式。 Abstract: Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.

[109] Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Yichen Zhang,Da Peng,Zonghao Guo,Zijian Zhang,Xuesong Yang,Tong Sun,Shichu Sun,Yidan Zhang,Yanghao Li,Haiyan Zhao,Wang Xu,Qi Shi,Yangang Sun,Chi Chen,Shuo Wang,Yukun Yan,Xu Han,Qiang Ma,Wei Ke,Liang Wang,Zhiyuan Liu,Maosong Sun

Main category: cs.CV

TL;DR: Cheers 是一个统一的多模态模型,通过解耦图像的语义表征与细节信息,在单个模型中同时实现高质量视觉理解与生成,并实现4倍token压缩,显著降低训练成本。

Details Motivation: 视觉理解和生成任务对解码机制和视觉表征的需求不一致,难以在共享特征空间中联合优化。 Method: Cheers 提出三个核心组件:(i) 统一视觉分词器,将图像潜在状态编码压缩为语义token;(ii) 基于大语言模型的Transformer,统一自回归文本解码与扩散图像解码;(iii) 级联流匹配头,先解码视觉语义,再注入语义门控的细节残差以增强高频内容。 Result: Cheers 在视觉理解与生成基准上达到或超越先进统一多模态模型(UMMs),在GenEval和MMBench上优于Tar-1.5B,且训练成本仅为后者的20%;实现4倍token压缩,支持高效高分辨率图像编解码。 Conclusion: Cheers 证明了通过语义与细节解耦可有效统一多模态理解与生成任务,在性能、效率与可扩展性上具有显著优势。 Abstract: A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

[110] Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting

Yang Chen,Yi Yu,Jiaming He,Yueqi Duan,Zheng Zhu,Yap-Peng Tan

Main category: cs.CV

TL;DR: 本文提出了一种针对3D高斯泼溅(3DGS)中资源定向攻击的频谱防御方法,通过3D频域高斯剪枝和2D渲染图像的各向异性频谱正则化,有效抑制高斯过增长、降低内存占用并提升渲染速度。

Details Motivation: 3D高斯泼溅易受资源定向攻击,攻击者通过污染训练图像诱导高斯过度增长,导致资源耗尽;现有空间域防御方法(如平滑、阈值、剪枝)忽视了对抗样本在频谱域引发的异常高频放大问题。 Method: 提出频谱防御框架:1)设计3D频域滤波器,选择性剪枝异常高频高斯;2)在2D渲染图像上引入频谱正则化,区分自然各向同性高频与攻击引入的各向异性噪声能量并加以惩罚。 Result: 实验表明该方法在攻击下可将高斯过增长抑制达5.92倍、内存减少达3.66倍、速度提升达4.34倍,显著提升3DGS的鲁棒性、精度与安全性。 Conclusion: 频谱视角是防御3DGS资源定向攻击的有效新路径,联合3D高斯频域控制与2D渲染频谱正则化可兼顾安全性与重建质量。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) deliver high-quality rendering, yet the Gaussian representation exposes a new attack surface, the resource-targeting attack. This attack poisons training images, excessively inducing Gaussian growth to cause resource exhaustion. Although efficiency-oriented methods such as smoothing, thresholding, and pruning have been explored, these spatial-domain strategies operate on visible structures but overlook how stealthy perturbations distort the underlying spectral behaviors of training data. As a result, poisoned inputs introduce abnormal high-frequency amplifications that mislead 3DGS into interpreting noisy patterns as detailed structures, ultimately causing unstable Gaussian overgrowth and degraded scene fidelity. To address this, we propose \textbf{Spectral Defense} in Gaussian and image fields. We first design a 3D frequency filter to selectively prune Gaussians exhibiting abnormally high frequencies. Since natural scenes also contain legitimate high-frequency structures, directly suppressing high frequencies is insufficient, and we further develop a 2D spectral regularization on renderings, distinguishing naturally isotropic frequencies while penalizing anisotropic angular energy to constrain noisy patterns. Experiments show that our defense builds robust, accurate, and secure 3DGS, suppressing overgrowth by up to $5.92\times$, reducing memory by up to $3.66\times$, and improving speed by up to $4.34\times$ under attacks.

[111] What Makes VLMs Robust? Towards Reconciling Robustness and Accuracy in Vision-Language Models

Sen Nie,Jie Zhang,Zhongqi Wang,Zhaoyang Wei,Shiguang Shan,Xilin Chen

Main category: cs.CV

TL;DR: 本文提出R-Adapt框架,通过仅在浅层引入轻量适配、冻结预训练权重,有效缓解视觉语言模型中对抗鲁棒性与干净数据准确率之间的权衡问题,并在多个数据集和大模型上验证了其有效性。

Details Motivation: 探索视觉语言模型(VLMs)中对抗鲁棒性的内在机制,理解其与干净准确率的权衡根源,特别是鲁棒性在模型深度上的分布特性。 Method: 通过分析对抗微调模型的内部机制,发现鲁棒性主要集中于浅层(由低频谱偏置和输入不敏感注意力驱动),而深层更新损害性能;据此提出R-Adapt:冻结全部预训练权重,仅在初始层进行最小化、洞见驱动的适应。 Result: R-Adapt在18个数据集和多种任务上达到SOTA对抗鲁棒性,同时保持高干净准确率;可无缝扩展至LLaVA、Qwen-VL等大型VLM,并支持训练-free、模型引导、数据驱动等多种范式。 Conclusion: VLM中的对抗鲁棒性主要源于浅层结构特性,R-Adapt通过聚焦浅层适应实现了鲁棒性与准确率的优异平衡,为高效提升VLM鲁棒性提供了新范式。 Abstract: Achieving adversarial robustness in Vision-Language Models (VLMs) inevitably compromises accuracy on clean data, presenting a long-standing and challenging trade-off. In this work, we revisit this trade-off by investigating a fundamental question: What makes VLMs robust? Through a detailed analysis of adversarially fine-tuned models, we examine how robustness mechanisms function internally and how they interact with clean accuracy. Our analysis reveals that adversarial robustness is not uniformly distributed across network depth. Instead, unexpectedly, it is primarily localized within the shallow layers, driven by a low-frequency spectral bias and input-insensitive attention patterns. Meanwhile, updates to the deep layers tend to undermine both clean accuracy and robust generalization. Motivated by these insights, we propose Adversarial Robustness Adaptation (R-Adapt), a simple yet effective framework that freezes all pre-trained weights and introduces minimal, insight-driven adaptations only in the initial layers. This design achieves an exceptional balance between adversarial robustness and clean accuracy. R-Adapt further supports training-free, model-guided, and data-driven paradigms, offering flexible pathways to seamlessly equip standard models with robustness. Extensive evaluations on 18 datasets and diverse tasks demonstrate our state-of-the-art performance under various attacks. Notably, R-Adapt generalizes efficiently to large vision-language models (e.g., LLaVA and Qwen-VL) to enhance their robustness. Our project page is available at https://summu77.github.io/R-Adapt.

[112] OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution

Shijie Zhao,Xuanyu Zhang,Bin Chen,Weiqi Li,Qunliang Xing,Kexin Zhang,Yan Wang,Junlin Li,Li Zhang,Jian Zhang,Tianfan Xue

Main category: cs.CV

TL;DR: 本文提出OARS框架,通过基于多模态大语言模型(MLLM)的COMPASS奖励函数,在线对齐生成式图像超分辨率模型与人类视觉偏好,解决了感知-保真度权衡及未知退化问题。

Details Motivation: 现有方法依赖离线偏好优化和静态指标聚合,难以解释且易受强条件干扰产生伪多样性,无法有效应对真实世界中多样且未知的图像退化。 Method: 提出OARS在线对齐框架,核心是COMPASS——一种能联合建模保真度保持与感知增益、并自适应输入质量调整权衡的MLLM奖励函数;使用涵盖合成与真实退化的COMPASS-20K数据集及三阶段感知标注流程训练COMPASS;OARS分阶段进行在线对齐:从冷启动流匹配,到全参考,再到无参考强化学习,并采用浅层LoRA优化实现策略探索。 Result: 在Real-ISR基准上达到SOTA性能;大量实验与用户研究表明该方法在显著提升感知质量的同时保持了图像保真度。 Conclusion: OARS与COMPASS为生成式图像超分辨率模型提供了可解释、鲁棒且高效的在线对齐范式,有效弥合了模型输出与人类视觉偏好的鸿沟。 Abstract: Aligning generative real-world image super-resolution models with human visual preference is challenging due to the perception--fidelity trade-off and diverse, unknown degradations. Prior approaches rely on offline preference optimization and static metric aggregation, which are often non-interpretable and prone to pseudo-diversity under strong conditioning. We propose OARS, a process-aware online alignment framework built on COMPASS, a MLLM-based reward that evaluates the LR to SR transition by jointly modeling fidelity preservation and perceptual gain with an input-quality-adaptive trade-off. To train COMPASS, we curate COMPASS-20K spanning synthetic and real degradations, and introduce a three-stage perceptual annotation pipeline that yields calibrated, fine-grained training labels. Guided by COMPASS, OARS performs progressive online alignment from cold-start flow matching to full-reference and finally reference-free RL via shallow LoRA optimization for on-policy exploration. Extensive experiments and user studies demonstrate consistent perceptual improvements while maintaining fidelity, achieving state-of-the-art performance on Real-ISR benchmarks.

[113] coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation

Chunhan Li,Qifeng Wu,Jia-Hui Pan,Ka-Hei Hui,Jingyu Hu,Yuming Jiang,Bin Sheng,Xihui Liu,Wenjuan Gong,Zhengzhe Liu

Main category: cs.CV

TL;DR: 本文提出coDrawAgents,一种基于多智能体协作的交互式文本到图像生成框架,通过Interpreter、Planner、Checker和Painter四个专业化智能体协同工作,提升复杂场景中多物体及其属性的组合生成能力。

Details Motivation: 现有文本到图像模型在复杂场景中难以准确组合多个物体并保持其属性,缺乏对布局复杂性、视觉上下文依赖和显式错误修正的支持。 Method: 设计四智能体协作框架:Interpreter自适应选择生成路径并解析提示;Planner基于语义显著性分层规划布局;Checker验证空间一致性和属性对齐并修正;Painter逐步合成图像并更新画布上下文。 Result: 在GenEval和DPG-Bench基准上,coDrawAgents显著提升了文本-图像对齐度、空间准确性和属性绑定能力。 Conclusion: coDrawAgents通过分工明确、上下文感知与可纠错的多智能体协作机制,有效解决了复杂场景下 compositional 生成的关键挑战。 Abstract: Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue framework with four specialized agents: Interpreter, Planner, Checker, and Painter that collaborate to improve compositional generation. The Interpreter adaptively decides between a direct text-to-image pathway and a layout-aware multi-agent process. In the layout-aware mode, it parses the prompt into attribute-rich object descriptors, ranks them by semantic salience, and groups objects with the same semantic priority level for joint generation. Guided by the Interpreter, the Planner adopts a divide-and-conquer strategy, incrementally proposing layouts for objects with the same semantic priority level while grounding decisions in the evolving visual context of the canvas. The Checker introduces an explicit error-correction mechanism by validating spatial consistency and attribute alignment, and refining layouts before they are rendered. Finally, the Painter synthesizes the image step by step, incorporating newly planned objects into the canvas to provide richer context for subsequent iterations. Together, these agents address three key challenges: reducing layout complexity, grounding planning in visual context, and enabling explicit error correction. Extensive experiments on benchmarks GenEval and DPG-Bench demonstrate that coDrawAgents substantially improves text-image alignment, spatial accuracy, and attribute binding compared to existing methods.

[114] Hierarchical Dual-Change Collaborative Learning for UAV Scene Change Captioning

Fuhai Chen,Pengpeng Huang,Junwen Wu,Hehong Zhang,Shiping Wang,Xiaoguang Ma,Xuri Ge

Main category: cs.CV

TL;DR: 本文提出了一种面向无人机(UAV)场景理解的新任务——UAV场景变化描述生成(UAV-SCC),旨在对动态航拍图像中语义变化生成自然语言描述;为此设计了分层双变化协同学习(HDC-CL)方法,包含动态自适应布局Transformer(DALT)和分层跨模态朝向一致性校准(HCM-OCC),并构建了首个UAV-SCC基准数据集。

Details Motivation: 传统变化描述任务基于固定视角图像对,而UAV图像对因飞行器移动导致视角变化和场景重叠不全,难以准确建模语义变化,因此需专门针对UAV动态视角特性设计新任务与方法。 Method: 提出Hierarchical Dual-Change Collaborative Learning(HDC-CL)框架:1)Dynamic Adaptive Layout Transformer(DALT)自适应建模图像对的多样空间布局,联合学习重叠与非重叠区域特征;2)Hierarchical Cross-modal Orientation Consistency Calibration(HCM-OCC)提升模型对视角偏移方向的敏感性。同时构建UAV-SCC数据集。 Result: 所提方法在新建的UAV-SCC数据集上达到当前最优性能(state-of-the-art)。 Conclusion: UAV-SCC是一项具有实际意义的新任务,HDC-CL方法有效解决了动态视角下变化理解与描述的关键挑战,所构建的数据集和开源代码将推动该方向深入研究。 Abstract: This paper proposes a novel task for UAV scene understanding - UAV Scene Change Captioning (UAV-SCC) - which aims to generate natural language descriptions of semantic changes in dynamic aerial imagery captured from a movable viewpoint. Unlike traditional change captioning that mainly describes differences between image pairs captured from a fixed camera viewpoint over time, UAV scene change captioning focuses on image-pair differences resulting from both temporal and spatial scene variations dynamically captured by a moving camera. The key challenge lies in understanding viewpoint-induced scene changes from UAV image pairs that share only partially overlapping scene content due to viewpoint shifts caused by camera rotation, while effectively exploiting the relative orientation between the two images. To this end, we propose a Hierarchical Dual-Change Collaborative Learning (HDC-CL) method for UAV scene change captioning. In particular, a novel transformer, \emph{i.e.} Dynamic Adaptive Layout Transformer (DALT) is designed to adaptively model diverse spatial layouts of the image pair, where the interrelated features derived from the overlapping and non-overlapping regions are learned within the flexible and unified encoding layer. Furthermore, we propose a Hierarchical Cross-modal Orientation Consistency Calibration (HCM-OCC) method to enhance the model's sensitivity to viewpoint shift directions, enabling more accurate change captioning. To facilitate in-depth research on this task, we construct a new benchmark dataset, named UAV-SCC dataset, for UAV scene change captioning. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance on this task. The dataset and code will be publicly released upon acceptance of this paper.

[115] Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation

Fei Wang,Xinye Zheng,Kun Li,Yanyan Wei,Yuxin Liu,Ganpeng Hu,Tong Bao,Jingwen Yang

Main category: cs.CV

TL;DR: 本文提出ERBA模型,将酶动力学参数预测重新定义为分阶段的多模态条件建模问题,通过分子识别交叉注意力(MRCA)和几何感知混合专家(G-MoE)两阶段条件化,并结合酶-底物分布对齐(ESDA)保持语义保真,显著提升预测性能与分布外泛化能力。

Details Motivation: 现有方法将酶动力学预测简化为静态兼容性问题,忽略催化过程的阶段性(底物识别与构象适应),未能充分建模酶-底物互作的动态本质。 Method: 提出Enzyme-Reaction Bridging Adapter(ERBA):1)Molecular Recognition Cross-Attention(MRCA)将底物信息注入酶表征以建模特异性;2)Geometry-aware Mixture-of-Experts(G-MoE)融合活性位点结构并按口袋特化路由样本以反映诱导契合;3)Enzyme-Substrate Distribution Alignment(ESDA)在再生核希尔伯特空间中对齐酶-底物表征分布。 Result: 在三个动力学终点(k_cat、K_m、K_i)及多种蛋白质语言模型主干上,ERBA持续优于仅序列和浅层融合基线,尤其提升分布外泛化性能。 Conclusion: ERBA为可扩展的酶动力学预测提供了生物学基础扎实的新范式,并为后续整合辅因子、突变及时间分辨结构信息奠定框架。 Abstract: Predicting enzyme kinetic parameters quantifies how efficiently an enzyme catalyzes a specific substrate under defined biochemical conditions. Canonical parameters such as the turnover number ($k_\text{cat}$), Michaelis constant ($K_\text{m}$), and inhibition constant ($K_\text{i}$) depend jointly on the enzyme sequence, the substrate chemistry, and the conformational adaptation of the active site during binding. Many learning pipelines simplify this process to a static compatibility problem between the enzyme and substrate, fusing their representations through shallow operations and regressing a single value. Such formulations overlook the staged nature of catalysis, which involves both substrate recognition and conformational adaptation. In this regard, we reformulate kinetic prediction as a staged multimodal conditional modeling problem and introduce the Enzyme-Reaction Bridging Adapter (ERBA), which injects cross-modal information via fine-tuning into Protein Language Models (PLMs) while preserving their biochemical priors. ERBA performs conditioning in two stages: Molecular Recognition Cross-Attention (MRCA) first injects substrate information into the enzyme representation to capture specificity; Geometry-aware Mixture-of-Experts (G-MoE) then integrates active-site structure and routes samples to pocket-specialized experts to reflect induced fit. To maintain semantic fidelity, Enzyme-Substrate Distribution Alignment (ESDA) enforces distributional consistency within the PLM manifold in a reproducing kernel Hilbert space. Experiments across three kinetic endpoints and multiple PLM backbones, ERBA delivers consistent gains and stronger out-of-distribution performance compared with sequence-only and shallow-fusion baselines, offering a biologically grounded route to scalable kinetic prediction and a foundation for adding cofactors, mutations, and time-resolved structural cues.

[116] Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

Elena Ryumina,Alexandr Axyonov,Dmitry Sysoev,Timur Abdulkadirov,Kirill Almetov,Yulia Morozova,Dmitry Ryumin

Main category: cs.CV

TL;DR: 本文提出了一种用于视频级犹豫/矛盾情绪识别的多模态方法,融合场景、面部、音频和文本四种模态,并在BAH数据集上验证了其有效性。

Details Motivation: 犹豫/矛盾情绪识别在无约束视频中具有挑战性,因其行为表现微妙、多模态且依赖上下文。 Method: 提出融合场景(VideoMAE)、面部(情感帧嵌入+统计池化)、音频(EmotionWav2Vec2.0 + Mamba时序编码器)和文本(微调Transformer)四模态的方法,并采用原型增强的多模态融合模型。 Result: 在BAH数据集上,最优单模态MF1为70.02%,最优多模态融合达83.25%,五模型集成在最终测试集上达71.43%。 Conclusion: 互补多模态线索与鲁棒融合策略对犹豫/矛盾情绪识别至关重要。 Abstract: Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

[117] Wear Classification of Abrasive Flap Wheels using a Hierarchical Deep Learning Approach

Falko Kähler,Maxim Wille,Ole Schmedemann,Thorsten Schüppstuhl

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉的分层分类框架,用于自动监测磨削用砂布轮的磨损状态,通过三级逻辑分类(状态检测、磨损类型识别、严重程度评估)结合EfficientNetV2迁移学习与Grad-CAM可解释性分析,在真实数据集上实现了93.8%–99.3%的高准确率。

Details Motivation: 砂布轮柔性导致复杂磨损(如凹/凸轮廓、撕裂),影响磨削质量,亟需自动化、细粒度的磨损监测方法。 Method: 构建真实砂布轮图像数据集,采用EfficientNetV2进行迁移学习,设计三级分层分类框架:(1)新/磨损状态检测;(2)磨损类型(矩形/凹/凸)与撕裂识别;(3)变形严重程度(局部/完全)评估;并用Grad-CAM验证特征物理相关性。 Result: 各子任务准确率:撕裂检测93.8%,凹形严重度99.3%,整体鲁棒性强;Grad-CAM证实模型关注物理相关区域,误分类可追溯。 Conclusion: 该分层视觉分类框架为砂布轮磨削过程的自适应控制与磨损补偿提供了可靠基础。 Abstract: Abrasive flap wheels are common for finishing complex free-form surfaces due to their flexibility. However, this flexibility results in complex wear patterns such as concave/convex flap profiles or flap tears, which influence the grinding result. This paper proposes a novel, vision-based hierarchical classification framework to automate the wear condition monitoring of flap wheels. Unlike monolithic classification approaches, we decompose the problem into three logical levels: (1) state detection (new vs. worn), (2) wear type identification (rectangular, concave, convex) and flap tear detection, and (3) severity assessment (partial vs. complete deformation). A custom-built dataset of real flap wheel images was generated and a transfer learning approach with EfficientNetV2 architecture was used. The results demonstrate high robustness with classification accuracies ranging from 93.8% (flap tears) to 99.3% (concave severity). Furthermore, Gradient-weighted Class Activation Mapping (Grad-CAM) is utilized to validate that the models learn physically relevant features and examine false classifications. The proposed hierarchical method provides a basis for adaptive process control and wear consideration in automated flap wheel grinding.

[118] Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation

Yifan Zhan,Zhengqing Chen,Qingjie Wang,Zhuo He,Muyao Niu,Xiaoyang Guo,Wei Yin,Weiqiang Ren,Qian Zhang,Yinqiang Zheng

Main category: cs.CV

TL;DR: 本文提出CompoSIA,一种解耦交通要素(场景结构、物体身份、自车动作)的合成驾驶视频模拟器,支持细粒度可控生成对抗性驾驶场景,并在可控生成质量与下游压力测试中显著优于现有方法。

Details Motivation: 自动驾驶中存在大量由常见交通元素异常组合引发的安全关键边缘案例(长尾问题),而现有可控生成模型无法独立操控场景结构、物体身份和自车动作,导致难以合成真实多样的对抗场景。 Method: 提出CompoSIA:1)采用噪声层级身份注入实现姿态无关的身份替换(单参考图生成多姿态身份);2)设计分层双分支动作控制机制提升动作可控性;3)整体实现三大交通因素的解耦控制。 Result: 在FVD指标上身份编辑提升17%;旋转与平移动作误差分别降低30%和47%;下游压力测试中3秒内平均碰撞率上升173%。 Conclusion: 解耦控制是合成高质量、多样化对抗驾驶场景的关键,CompoSIA为自动驾驶系统鲁棒性验证提供了更有效、可控的仿真工具。 Abstract: A major challenge in autonomous driving is the "long tail" of safety-critical edge cases, which often emerge from unusual combinations of common traffic elements. Synthesizing these scenarios is crucial, yet current controllable generative models provide incomplete or entangled guidance, preventing the independent manipulation of scene structure, object identity, and ego actions. We introduce CompoSIA, a compositional driving video simulator that disentangles these traffic factors, enabling fine-grained control over diverse adversarial driving scenarios. To support controllable identity replacement of scene elements, we propose a noise-level identity injection, allowing pose-agnostic identity generation across diverse element poses, all from a single reference image. Furthermore, a hierarchical dual-branch action control mechanism is introduced to improve action controllability. Such disentangled control enables adversarial scenario synthesis-systematically combining safe elements into dangerous configurations that entangled generators cannot produce. Extensive comparisons demonstrate superior controllable generation quality over state-of-the-art baselines, with a 17% improvement in FVD for identity editing and reductions of 30% and 47% in rotation and translation errors for action control. Furthermore, downstream stress-testing reveals substantial planner failures: across editing modalities, the average collision rate of 3s increases by 173%.

[119] TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking

Jiale Meng,Jie Zhang,Runyi Hu,Zhe-Ming Lu,Tianwei Zhang,Yiming Li

Main category: cs.CV

TL;DR: 本文提出了一种名为TRACE的结构感知框架,利用扩散模型进行局部字符编码以嵌入数据,相比现有方法在抗噪性、PSNR和提取精度上均有显著提升。

Details Motivation: 现有方法依赖边特征或预定义码本,缺乏对字符结构固有稳定性的利用,难以抵抗噪声干扰。 Method: TRACE框架包含三个核心组件:(1) 自适应扩散初始化(含MPE、TPE和MDM算法),(2) 引导式扩散编码实现选定点的精确移动,(3) 带专用损失函数的掩码区域替换以最小化扩散后特征变化。 Result: 在跨媒体传输任务中,PSNR提升超5 dB,提取准确率提高5%,且在多语言、多字体场景下具有强泛化能力。 Conclusion: TRACE通过挖掘字符结构稳定性,显著提升了水印嵌入与提取的鲁棒性和实用性,适用于实际文档安全应用。 Abstract: We propose TRACE, a structure-aware framework leveraging diffusion models for localized character encoding to embed data. Unlike existing methods that rely on edge features or pre-defined codebooks, TRACE exploits character structures that provide inherent resistance to noise interference due to their stability and unified representation across diverse characters. Our framework comprises three key components: (1) adaptive diffusion initialization that automatically identifies handle points, target points, and editing regions through specialized algorithms including movement probability estimator (MPE), target point estimation (TPE) and mask drawing model (MDM), (2) guided diffusion encoding for precise movement of selected point, and (3) masked region replacement with a specialized loss function to minimize feature alterations after the diffusion process. Comprehensive experiments demonstrate \name{}'s superior performance over state-of-the-art methods, achieving more than 5 dB improvement in PSNR and 5\% higher extraction accuracy following cross-media transmission. \name{} achieves broad generalizability across multiple languages and fonts, making it particularly suitable for practical document security applications.

[120] A protocol for evaluating robustness to H&E staining variation in computational pathology models

Lydia A. Schönpflug,Nikki van den Berg,Sonali Andani,Nanda Horeweg,Jurriaan Barkey Wolf,Tjalling Bosse,Viktor H. Koelzer,Maxime W. Lafarge

Main category: cs.CV

TL;DR: 本文提出了一种三步协议,用于评估计算病理学(CPath)模型对苏木精-伊红(H&E)染色变异的鲁棒性,并在306个微卫星不稳定性(MSI)分类模型上验证了该协议的有效性。

Details Motivation: H&E染色在不同实验室间存在显著差异,影响计算病理学模型的泛化与临床部署,亟需系统评估其对模型预测的影响。 Method: 提出三步评估协议:1)选定参考染色条件;2)表征测试集染色特性;3)在模拟的参考染色条件下运行CPath模型;构建基于PLISM数据集的新参考染色库,并在SurGen结直肠癌数据集(n=738)上评估306个MSI模型(含300个基于注意力机制的MIL模型及6个公开模型)在四种模拟染色条件下的AUC与鲁棒性(min-max AUC范围)。 Result: 模型AUC范围为0.769–0.911(Δ=0.142),鲁棒性范围为0.007–0.079(Δ=0.072);鲁棒性与性能呈弱负相关(Pearson r=−0.22);协议可支持鲁棒性驱动的模型选型与可靠部署区间识别。 Conclusion: 所提评估协议为CPath模型在真实世界染色变异场景下的鲁棒性量化与部署决策提供了可复现、可扩展的方法框架。 Abstract: Sensitivity to staining variation remains a major barrier to deploying computational pathology (CPath) models as hematoxylin and eosin (H&E) staining varies across laboratories, requiring systematic assessment of how this variability affects model prediction. In this work, we developed a three-step protocol for evaluating robustness to H&E staining variation in CPath models. Step 1: Select reference staining conditions, Step 2: Characterize test set staining properties, Step 3: Apply CPath model(s) under simulated reference staining conditions. Here, we first created a new reference staining library based on the PLISM dataset. As an exemplary use case, we applied the protocol to assess the robustness properties of 306 microsatellite instability (MSI) classification models on the unseen SurGen colorectal cancer dataset (n=738), including 300 attention-based multiple instance learning models trained on the TCGA-COAD/READ datasets across three feature extractors (UNI2-h, H-Optimus-1, Virchow2), alongside six public MSI classification models. Classification performance was measured as AUC, and robustness as the min-max AUC range across four simulated staining conditions (low/high H&E intensity, low/high H&E color similarity). Across models and staining conditions, classification performance ranged from AUC 0.769-0.911 ($Δ$ = 0.142). Robustness ranged from 0.007-0.079 ($Δ$ = 0.072), and showed a weak inverse correlation with classification performance (Pearson r=-0.22, 95% CI [-0.34, -0.11]). Thus, we show that the proposed evaluation protocol enables robustness-informed CPath model selection and provides insight into performance shifts across H&E staining conditions, supporting the identification of operational ranges for reliable model deployment. Code is available at https://github.com/CTPLab/staining-robustness-evaluation .

[121] Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning

Mingkai Zhai,Wei Wang,Zongsheng Li,Quanying Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于视频的癫痫发作预测新任务,并设计了跨物种迁移学习框架,利用大鼠视频数据预训练模型以泛化到人类视频,实现了70%以上的预测准确率。

Details Motivation: 现有癫痫发作预测方法主要依赖需专业设备的脑电信号(如EEG),难以长期部署;而视频虽非侵入且易获取,但现有研究集中于发作后检测,发作前预测尚未探索。 Method: 提出基于短时(3-10秒)发作前视频片段预测未来5秒内是否发作的新任务;构建跨物种迁移学习框架,用大规模啮齿类动物视频进行辅助预训练,以学习跨物种可泛化的癫痫相关行为动力学特征。 Result: 在纯视频输入设置下预测准确率超70%,显著优于现有基线方法。 Conclusion: 跨物种学习可有效提升非侵入式、可扩展的癫痫早期预警系统性能,为临床应用提供了新路径。 Abstract: Epileptic seizure forecasting is a clinically important yet challenging problem in epilepsy research. Existing approaches predominantly rely on neural signals such as electroencephalography (EEG), which require specialized equipment and limit long-term deployment in real-world settings. In contrast, video data provide a non-invasive and accessible alternative, yet existing video-based studies mainly focus on post-onset seizure detection, leaving seizure forecasting largely unexplored. In this work, we formulate a novel task of video-based epileptic seizure forecasting, where short pre-ictal video segments (3-10 seconds) are used to predict whether a seizure will occur within the subsequent 5 seconds. To address the scarcity of annotated human epilepsy videos, we propose a cross-species transfer learning framework that leverages large-scale rodent video data for auxiliary pretraining. This enables the model to capture seizure-related behavioral dynamics that generalize across species. Experimental results demonstrate that our approach achieves over 70% prediction accuracy under a strictly video-only setting and outperforms existing baselines. These findings highlight the potential of cross-species learning for building non-invasive, scalable early-warning systems for epilepsy.

[122] Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

David McAllister,Miika Aittala,Tero Karras,Janne Hellsten,Angjoo Kanazawa,Timo Aila,Samuli Laine

Main category: cs.CV

TL;DR: 本文提出了一种用于扩散模型后训练的在线强化学习新方法,通过配对轨迹采样和将整个采样过程视为单一动作来降低更新方差,从而提升图像质量和提示对齐效果。

Details Motivation: 现有RL方法在扩散模型后训练中存在更新方差大、将每步采样视为独立动作导致策略学习不一致的问题,需更稳定高效的方法提升图像质量与提示对齐。 Method: 提出一种在线RL变体:采样成对轨迹,比较其奖励,将流速度(flow velocity)朝向更优图像方向更新;将整个扩散采样过程视为一个整体动作而非每步独立动作;结合高质量VLM和现成质量指标作为奖励信号。 Result: 该方法收敛更快,在多种评估指标下均优于先前方法,显著提升输出图像质量和提示对齐程度。 Conclusion: 将扩散采样建模为单一大动作并采用配对轨迹的在线RL策略,是一种更稳定、高效且可扩展的扩散模型后训练范式。 Abstract: Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

[123] Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis

Yinuo Jiang,Jun Cheng,Yiran Wang,Cheng Cheng

Main category: cs.CV

TL;DR: 本文提出SG-NLF,一种无需精确相机位姿的LiDAR神经辐射场框架,融合光谱信息与几何一致性,通过混合表征、置信图优化和对抗学习提升重建质量与位姿精度。

Details Motivation: 现有LiDAR新视角合成方法严重依赖准确相机位姿,且LiDAR数据稀疏、缺乏纹理,导致几何空洞和表面不连续。 Method: 提出SG-NLF框架:1)基于光谱先验的混合表征以重建平滑几何;2)构建特征兼容性驱动的置信感知图实现全局位姿对齐;3)引入对抗学习增强跨帧一致性。 Result: 在低频等挑战性场景下显著优于SOTA方法,重建质量与位姿精度分别提升35.8%和68.8%。 Conclusion: SG-NLF为无精确位姿条件下的LiDAR新视角合成提供了有效且鲁棒的新思路。 Abstract: Neural Radiance Fields (NeRF) have shown remarkable success in image novel view synthesis (NVS), inspiring extensions to LiDAR NVS. However, most methods heavily rely on accurate camera poses for scene reconstruction. The sparsity and textureless nature of LiDAR data also present distinct challenges, leading to geometric holes and discontinuous surfaces. To address these issues, we propose SG-NLF, a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. Specifically, we design a hybrid representation based on spectral priors to reconstruct smooth geometry. For pose optimization, we construct a confidence-aware graph based on feature compatibility to achieve global alignment. In addition, an adversarial learning strategy is introduced to enforce cross-frame consistency, thereby enhancing reconstruction quality. Comprehensive experiments demonstrate the effectiveness of our framework, especially in challenging low-frequency scenarios. Compared to previous state-of-the-art methods, SG-NLF improves reconstruction quality and pose accuracy by over 35.8% and 68.8%. Our work can provide a novel perspective for LiDAR view synthesis.

[124] FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts

Xin Xu,Weilong Li,Wei Liu,Wenke Huang,Zhixi Yu,Bin Yang,Xiaoying Liao,Kui Jiang

Main category: cs.CV

TL;DR: 本文提出FedBPrompt,一种面向联邦域泛化行人重识别(FedDG-ReID)的视觉提示方法,通过引入身体分布感知的可学习视觉提示(BAPM)引导ViT注意力聚焦行人区域,并结合提示微调策略(PFTS)降低通信开销,显著提升跨域泛化能力与训练效率。

Details Motivation: ViT在FedDG-ReID中因全局注意力难以区分相似背景或不同视角下的行人,且跨客户端分布偏移加剧该问题。 Method: 提出FedBPrompt框架,包含:1)身体分布感知视觉提示机制(BAPM),含整体全身提示(抑制背景噪声)和身体部位对齐提示(增强姿态/视角鲁棒性);2)基于提示的微调策略(PFTS),冻结ViT主干、仅更新轻量提示参数以降低通信成本。 Result: 实验表明BAPM有效提升特征判别力与跨域泛化性能,PFTS在极少聚合轮次下即获显著性能增益;二者可即插即用地集成至现有ViT-based FedDG-ReID框架。 Conclusion: FedBPrompt是一种灵活、高效且易于部署的解决方案,兼顾建模精度、通信效率与泛化能力,推动联邦学习下行人重识别的实际应用。 Abstract: Federated Domain Generalization for Person Re-Identification (FedDG-ReID) learns domain-invariant representations from decentralized data. While Vision Transformer (ViT) is widely adopted, its global attention often fails to distinguish pedestrians from high similarity backgrounds or diverse viewpoints -- a challenge amplified by cross-client distribution shifts in FedDG-ReID. To address this, we propose Federated Body Distribution Aware Visual Prompt (FedBPrompt), introducing learnable visual prompts to guide Transformer attention toward pedestrian-centric regions. FedBPrompt employs a Body Distribution Aware Visual Prompts Mechanism (BAPM) comprising: Holistic Full Body Prompts to suppress cross-client background noise, and Body Part Alignment Prompts to capture fine-grained details robust to pose and viewpoint variations. To mitigate high communication costs, we design a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and updates only lightweight prompts, significantly reducing communication overhead while maintaining adaptability. Extensive experiments demonstrate that BAPM effectively enhances feature discrimination and cross-domain generalization, while PFTS achieves notable performance gains within only a few aggregation rounds. Moreover, both BAPM and PFTS can be easily integrated into existing ViT-based FedDG-ReID frameworks, making FedBPrompt a flexible and effective solution for federated person re-identification. The code is available at https://github.com/leavlong/FedBPrompt.

[125] Stake the Points: Structure-Faithful Instance Unlearning

Kiseong Hong,JungKyoo Shin,Eunwoo Kim

Main category: cs.CV

TL;DR: 本文提出了一种结构保真的机器遗忘框架,通过引入语义锚点(stakes)来维持模型中保留知识的语义结构,从而在删除指定数据的同时更好保持模型性能。

Details Motivation: 现有机器遗忘方法忽视了保留数据间语义关系的维持,导致模型结构塌陷,影响删除与保留之间的平衡。 Method: 提出基于语义锚点(由CLIP等语义编码器生成的语言驱动属性描述)的结构保真框架,采用结构感知对齐和结构关键参数正则化来稳定知识组织。 Result: 在图像分类、检索和人脸识别任务上,平均性能分别提升32.9%、22.5%和19.3%,显著改善删除-保留权衡与泛化能力。 Conclusion: 维持知识结构对机器遗忘至关重要;所提结构保真框架能有效缓解结构塌陷,在保障隐私删除的同时增强模型实用性与鲁棒性。 Abstract: Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion-retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion-retention trade-off and enhancing generalization.

[126] VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

Juhye Park,Wooju Lee,Dasol Hong,Changki Sung,Youngwoo Seo,Dongwan Kang,Hyun Myung

Main category: cs.CV

TL;DR: 本文提出了一种名为VIRD的新型跨视角姿态估计方法,通过双轴变换构建视角不变表征,显著缩小地面图像与卫星图像之间的视角差距,并在KITTI和VIGOR数据集上大幅降低定位误差。

Details Motivation: GNSS在遮挡和多径效应下性能下降,而现有跨视角定位方法因地面与卫星视角差异大、空间对应关系有限,难以准确估计相机6-DoF姿态(文中聚焦3-DoF)。 Method: 提出VIRD方法:1)对卫星图做极坐标变换以建立水平对应;2)在地面图与极坐标变换后的卫星图特征上应用上下文增强的位置注意力机制,解决垂直错位;3)引入视角重建损失,增强表征的视角不变性。 Result: 在KITTI和VIGOR数据集上,相比SOTA方法(无需朝向先验),位置中位误差分别降低50.7%和18.0%,朝向中位误差分别降低76.5%和46.8%。 Conclusion: VIRD通过双轴变换与重建约束有效缓解跨视角巨大视角差异,显著提升无GNSS依赖下的全局定位精度,为自动驾驶与机器人提供更鲁棒的视觉定位方案。 Abstract: Accurate global localization is crucial for autonomous driving and robotics, but GNSS-based approaches often degrade due to occlusion and multipath effects. As an emerging alternative, cross-view pose estimation predicts the 3-DoF camera pose corresponding to a ground-view image with respect to a geo-referenced satellite image. However, existing methods struggle to bridge the significant viewpoint gap between the ground and satellite views mainly due to limited spatial correspondences. We propose a novel cross-view pose estimation method that constructs view-invariant representations through dual-axis transformation (VIRD). VIRD first applies a polar transformation to the satellite view to establish horizontal correspondence, then uses context-enhanced positional attention on the ground and polar-transformed satellite features to resolve vertical misalignment, explicitly mitigating the viewpoint gap. A view-reconstruction loss is introduced to strengthen the view invariance further, encouraging the derived representations to reconstruct the original and cross-view images. Experiments on the KITTI and VIGOR datasets demonstrate that VIRD outperforms the state-of-the-art methods without orientation priors, reducing median position and orientation errors by 50.7% and 76.5% on KITTI, and 18.0% and 46.8% on VIGOR, respectively.

[127] Rethinking VLMs for Image Forgery Detection and Localization

Shaofeng Guo,Jiequan Cui,Richang Hong

Main category: cs.CV

TL;DR: 本文提出IFDL-VLM新方法,利用视觉语言模型(VLM)提升图像伪造检测与定位(IFDL)性能,通过引入位置掩码增强VLM训练与结果可解释性,在9个基准上达到SOTA。

Details Motivation: AIGC快速发展使图像篡改更易实施,给图像伪造检测与定位(IFDL)带来严峻挑战;现有VLM因偏向语义合理性而非真实性,其先验知识难以直接助力IFDL,甚至产生负面影响。 Method: 提出IFDL-VLM新流程:利用伪造区域的位置掩码作为额外先验,引导VLM优化训练,缓解其固有偏差,从而提升检测、定位及结果可解释性。 Result: 在9个主流基准数据集上实验验证,IFDL-VLM在检测、定位和可解释性三方面均取得一致的最先进(SOTA)性能,且在域内与跨数据集泛化设置下均表现优异。 Conclusion: 位置掩码可有效弥补VLM在IFDL任务中的先验缺陷;IFDL-VLM通过融合空间定位信息与VLM语义理解能力,为AIGC时代下的图像真实性验证提供了新范式。 Abstract: With the rapid rise of Artificial Intelligence Generated Content (AIGC), image manipulation has become increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). In this paper, we study how to fully leverage vision-language models (VLMs) to assist the IFDL task. In particular, we observe that priors from VLMs hardly benefit the detection and localization performance and even have negative effects due to their inherent biases toward semantic plausibility rather than authenticity. Additionally, the location masks explicitly encode the forgery concepts, which can serve as extra priors for VLMs to ease their training optimization, thus enhancing the interpretability of detection and localization results. Building on these findings, we propose a new IFDL pipeline named IFDL-VLM. To demonstrate the effectiveness of our method, we conduct experiments on 9 popular benchmarks and assess the model performance under both in-domain and cross-dataset generalization settings. The experimental results show that we consistently achieve new state-of-the-art performance in detection, localization, and interpretability.Code is available at: https://github.com/sha0fengGuo/IFDL-VLM.

[128] SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization

Tianwei Ye,Xiaoguang Mei,Yifan Xia,Fan Fan,Jun Huang,Jiayi Ma

Main category: cs.CV

TL;DR: 本文提出SGMatch,一种基于学习的语义引导非刚性形状匹配框架,通过语义引导的局部交叉注意力模块和基于条件流匹配的正则化目标,提升在非等距变形和拓扑噪声下的点对点对应精度。

Details Motivation: 现有功能映射方法在非等距变形和拓扑噪声下存在几何描述符无法消除的歧义,以及截断谱基投影导致的空间不一致性问题。 Method: 提出SGMatch框架,包括语义引导的局部交叉注意力模块(融合视觉基础模型语义特征与几何描述符)和基于条件流匹配的正则化目标(监督时变速度场以保证对应关系空间平滑性)。 Result: 在多个基准上实验表明,SGMatch在近等距场景下具有竞争力,并在非等距变形和拓扑噪声下持续提升性能。 Conclusion: 语义信息与几何结构的协同建模,结合流形上的平滑性约束,可有效提升非刚性形状匹配鲁棒性与精度。 Abstract: Establishing accurate point-to-point correspondences between non-rigid 3D shapes remains a critical challenge, particularly under non-isometric deformations and topological noise. Existing functional map pipelines suffer from ambiguities that geometric descriptors alone cannot resolve, and spatial inconsistencies inherent in the projection of truncated spectral bases to dense pointwise correspondences. In this paper, we introduce SGMatch, a learning-based framework for semantic-guided non-rigid shape matching. Specifically, we design a Semantic-Guided Local Cross-Attention module that integrates semantic features from vision foundation models into geometric descriptors while preserving local structural continuity. Furthermore, we introduce a regularization objective based on conditional flow matching, which supervises a time-varying velocity field to encourage spatial smoothness of the recovered correspondences. Experimental results on multiple benchmarks demonstrate that SGMatch achieves competitive performance across near-isometric settings and consistent improvements under non-isometric deformations and topological noise.

[129] Thinking in Streaming Video

Zikang Liu,Longteng Guo,Handong Li,Ru Zhen,Xingjian He,Ruyi Ji,Xiaoming Ren,Yanhao Zhang,Haonan Lu,Jing Liu

Main category: cs.CV

TL;DR: 本文提出ThinkStream框架,通过Watch--Think--Speak范式实现流式视频推理,结合Reasoning-Compressed Streaming Memory(RCSM)和Streaming Reinforcement Learning训练方法,在保持低延迟和内存占用的同时显著提升流式视频理解性能。

Details Motivation: 现有视频推理方法多采用批处理范式,导致高延迟和计算开销,难以适应实时流式交互场景。 Method: 提出ThinkStream框架,包含Watch--Think--Speak增量推理范式;设计Reasoning-Compressed Streaming Memory(RCSM)压缩并更新中间推理痕迹以替代过时视觉token;采用Streaming Reinforcement Learning with Verifiable Rewards进行训练,优化响应时机与推理过程。 Result: 在多个流式视频基准上,ThinkStream显著优于现有在线视频模型,同时保持低延迟和低内存占用。 Conclusion: ThinkStream为动态环境中的实时视频理解提供了高效、可扩展的流式推理新范式,兼顾性能、延迟与资源效率。 Abstract: Real-time understanding of continuous video streams is essential for interactive assistants and multimodal agents operating in dynamic environments. However, most existing video reasoning approaches follow a batch paradigm that defers reasoning until the full video context is observed, resulting in high latency and growing computational cost that are incompatible with streaming scenarios. In this paper, we introduce ThinkStream, a framework for streaming video reasoning based on a Watch--Think--Speak paradigm that enables models to incrementally update their understanding as new video observations arrive. At each step, the model performs a short reasoning update and decides whether sufficient evidence has accumulated to produce a response. To support long-horizon streaming, we propose Reasoning-Compressed Streaming Memory (RCSM), which treats intermediate reasoning traces as compact semantic memory that replaces outdated visual tokens while preserving essential context. We further train the model using a Streaming Reinforcement Learning with Verifiable Rewards scheme that aligns incremental reasoning and response timing with the requirements of streaming interaction. Experiments on multiple streaming video benchmarks show that ThinkStream significantly outperforms existing online video models while maintaining low latency and memory usage. Code, models and data will be released at https://github.com/johncaged/ThinkStream

[130] Fair Lung Disease Diagnosis from Chest CT via Gender-Adversarial Attention Multiple Instance Learning

Aditya Parikh,Aasa Feragen

Main category: cs.CV

TL;DR: 本文提出了一种面向公平性的多类肺疾病诊断框架,利用注意力机制的多实例学习(MIL)模型结合ConvNeXt骨干网络和梯度反转层(GRL),在胸部CT容积数据上实现兼顾性别公平的四分类诊断(健康、新冠、腺癌、鳞癌),并在Fair Disease Diagnosis Challenge中取得优异性能。

Details Motivation: 解决胸部CT多类疾病诊断中病理信号稀疏、疾病类别与性别双重人口统计失衡带来的公平性挑战,满足竞赛对性别均衡预测的显式要求。 Method: 基于ConvNeXt的注意力式多实例学习(MIL)模型,引入梯度反转层(GRL)对抗性消除表征中的性别偏差;采用焦点损失+标签平滑、联合(类别, 性别)分层交叉验证、针对最稀缺子群的过采样;推理阶段使用五折集成、水平翻转TTA及跨折阈值优化。 Result: 验证集平均竞赛得分为0.685(标准差0.030),单折最优达0.759;代码完全开源。 Conclusion: 该框架有效缓解了医疗AI中因数据偏差导致的性别不公平问题,在保持高诊断准确率的同时显著提升跨性别群体的F1均衡性,为公平敏感的医学影像分析提供了可复现的端到端解决方案。 Abstract: We present a fairness-aware framework for multi-class lung disease diagnosis from chest CT volumes, developed for the Fair Disease Diagnosis Challenge at the PHAROS-AIF-MIH Workshop (CVPR 2026). The challenge requires classifying CT scans into four categories -- Healthy, COVID-19, Adenocarcinoma, and Squamous Cell Carcinoma -- with performance measured as the average of per-gender macro F1 scores, explicitly penalizing gender-inequitable predictions. Our approach addresses two core difficulties: the sparse pathological signal across hundreds of slices, and a severe demographic imbalance compounded across disease class and gender. We propose an attention-based Multiple Instance Learning (MIL) model on a ConvNeXt backbone that learns to identify diagnostically relevant slices without slice-level supervision, augmented with a Gradient Reversal Layer (GRL) that adversarially suppresses gender-predictive structure in the learned scan representation. Training incorporates focal loss with label smoothing, stratified cross-validation over joint (class, gender) strata, and targeted oversampling of the most underrepresented subgroup. At inference, all five-fold checkpoints are ensembled with horizontal-flip test-time augmentation via soft logit voting and out-of-the-fold threshold optimization for robustness. Our model achieves a mean validation competition score of 0.685 (std - 0.030), with the best single fold reaching 0.759. All training and inference code is publicly available at https://github.com/ADE-17/cvpr-fair-chest-ct

[131] Test-Time Attention Purification for Backdoored Large Vision Language Models

Zhifang Zhang,Bojun Yang,Shuo He,Weitong Chen,Wei Emma Zhang,Olaf Maennel,Lei Feng,Miao Xu

Main category: cs.CV

TL;DR: 本文提出CleanSight,一种无需训练、即插即用的测试时防御方法,通过检测并剪枝异常高注意力视觉token来抵御大视觉语言模型(LVLMs)中的后门攻击,其核心机制是识别‘注意力窃取’现象。

Details Motivation: 现有针对LVLMs后门攻击的防御方法依赖重训练,计算开销大且易损害模型性能;作者旨在从机制层面理解后门行为,从而设计高效、轻量、不牺牲性能的防御方案。 Method: 基于发现的‘注意力窃取’机制(即触发器通过异常跨模态注意力重分布干扰预测),CleanSight在测试时:(i) 利用选定跨模态融合层中视觉-文本注意力比率检测中毒输入;(ii) 选择性剪枝高注意力可疑视觉token以净化输入。 Result: CleanSight在多种数据集和后门攻击类型上显著优于现有基于像素的净化防御方法,同时在干净样本和中毒样本上均保持模型原有性能。 Conclusion: CleanSight验证了从注意力机制角度理解LVLM后门行为的有效性,为训练无关、高效鲁棒的后门防御提供了新范式。 Abstract: Despite the strong multimodal performance, large vision-language models (LVLMs) are vulnerable during fine-tuning to backdoor attacks, where adversaries insert trigger-embedded samples into the training data to implant behaviors that can be maliciously activated at test time. Existing defenses typically rely on retraining backdoored parameters (e.g., adapters or LoRA modules) with clean data, which is computationally expensive and often degrades model performance. In this work, we provide a new mechanistic understanding of backdoor behaviors in LVLMs: the trigger does not influence prediction through low-level visual patterns, but through abnormal cross-modal attention redistribution, where trigger-bearing visual tokens steal attention away from the textual context - a phenomenon we term attention stealing. Motivated by this, we propose CleanSight, a training-free, plug-and-play defense that operates purely at test time. CleanSight (i) detects poisoned inputs based on the relative visual-text attention ratio in selected cross-modal fusion layers, and (ii) purifies the input by selectively pruning the suspicious high-attention visual tokens to neutralize the backdoor activation. Extensive experiments show that CleanSight significantly outperforms existing pixel-based purification defenses across diverse datasets and backdoor attack types, while preserving the model's utility on both clean and poisoned samples.

[132] A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks

Tangzheng Lian,Guanyu Hu,Yijing Ren,Dimitrios Kollias,Oya Celiktutan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、无需标注数据的视觉-语言模型(VLM)去偏方法,通过在跨模态空间中获得闭式解,实现帕累托最优公平性并控制效用损失。

Details Motivation: 现有VLM去偏方法大多缺乏对模型效用保留的理论保障,且难以兼顾多模态与交叉性公平。 Method: 提出一种训练无关、无需标注数据的闭式去偏方法,作用于跨模态嵌入空间,联合处理视觉与文本模态,适用于多种下游任务。 Result: 在零样本图像分类、文图检索与生成等任务上,该方法在群体公平与交叉性公平多个指标和数据集上均优于现有方法,同时保持任务性能。 Conclusion: 所提方法以有界效用损失为代价,实现了帕累托最优的跨模态公平性,兼具实用性与理论严谨性。 Abstract: While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.

[133] SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation

Sampath Rapuri,Lalithkumar Seenivasan,Dominik Schneider,Roger Soberanis-Mukul,Yufan He,Hao Ding,Jiru Xu,Chenhao Yu,Chenyan Jing,Pengfei Guo,Daguang Xu,Mathias Unberath

Main category: cs.CV

TL;DR: 本文提出Surgical Action World (SAW),一种基于轻量级条件信号(语言提示、参考场景、组织可操作性掩码和2D工具尖端轨迹)驱动的视频扩散模型,用于生成高保真、时序一致的手术动作视频,无需推理时深度信息或复杂标注;在时序一致性与视觉质量上达到SOTA,并验证其在稀有动作数据增强与手术仿真中的实用价值。

Details Motivation: 解决手术AI与仿真中数据稀缺、罕见事件合成难、sim-to-real鸿沟等根本挑战,而现有视频生成方法依赖昂贵标注或复杂中间表示,且时序一致性与真实性不足。 Method: 提出SAW模型,采用条件视频扩散框架,将video-to-video扩散重构为轨迹驱动的手术动作合成;使用语言提示、参考场景、组织可操作性掩码和2D工具尖端轨迹四类轻量信号作为条件;在12,044段腹腔镜视频上微调,并引入深度一致性损失以保证几何合理性(推理时不需深度图)。 Result: 在保持SOTA时序一致性(CD-FVD: 199.19 vs. 546.82)和强视觉质量的同时,显著提升下游任务性能:稀有动作识别F1分数大幅提升(clipping从20.93%→43.14%,cutting从0.00%→8.33%);并成功支持基于模拟器轨迹生成逼真工具-组织交互视频。 Conclusion: SAW为构建可扩展、高保真、可控的手术世界模型提供了可行路径,兼具方法创新性与临床应用潜力。 Abstract: A surgical world model capable of generating realistic surgical action videos with precise control over tool-tissue interactions can address fundamental challenges in surgical AI and simulation -- from data scarcity and rare event synthesis to bridging the sim-to-real gap for surgical automation. However, current video generation methods, the very core of such surgical world models, require expensive annotations or complex structured intermediates as conditioning signals at inference, limiting their scalability. Other approaches exhibit limited temporal consistency across complex laparoscopic scenes and do not possess sufficient realism. We propose Surgical Action World (SAW) -- a step toward surgical action world modeling through video diffusion conditioned on four lightweight signals: language prompts encoding tool-action context, a reference surgical scene, tissue affordance mask, and 2D tool-tip trajectories. We design a conditional video diffusion approach that reformulates video-to-video diffusion into trajectory-conditioned surgical action synthesis. The backbone diffusion model is fine-tuned on a custom-curated dataset of 12,044 laparoscopic clips with lightweight spatiotemporal conditioning signals, leveraging a depth consistency loss to enforce geometric plausibility without requiring depth at inference. SAW achieves state-of-the-art temporal consistency (CD-FVD: 199.19 vs. 546.82) and strong visual quality on held-out test data. Furthermore, we demonstrate its downstream utility for (a) surgical AI, where augmenting rare actions with SAW-generated videos improves action recognition (clipping F1-score: 20.93% to 43.14%; cutting: 0.00% to 8.33%) on real test data, and (b) surgical simulation, where rendering tool-tissue interaction videos from simulator-derived trajectory points toward a visually faithful simulation engine.

[134] SortScrews: A Dataset and Baseline for Real-time Screw Classification

Tianhao Fu,Bingxuan Yang,Juncheng Guo,Shrena Sribalan,Yucheng Chen

Main category: cs.CV

TL;DR: 本文提出了SortScrews数据集,用于螺丝类型的视觉分类,并提供了可复用的数据采集脚本与基于EfficientNet-B0和ResNet-18的基线模型结果。

Details Motivation: 现有公开螺丝分类数据集稀缺,尤其缺乏适用于自动化分拣系统中受控单目标场景的数据。 Method: 构建了包含560张RGB图像的SortScrews数据集(6类螺丝+背景类),采用标准化采集设置;提供可复用的数据采集脚本;使用ImageNet预训练的EfficientNet-B0和ResNet-18进行迁移学习,并开展失败分析。 Result: 尽管数据集规模较小,轻量级模型仍取得了较高的分类准确率,验证了受控采集条件下小样本有效学习的可行性。 Conclusion: SortScrews填补了螺丝细粒度分类数据集的空白,配套工具链支持工业部件数据集的快速构建与复现研究。 Abstract: Automatic identification of screw types is important for industrial automation, robotics, and inventory management. However, publicly available datasets for screw classification are scarce, particularly for controlled single-object scenarios commonly encountered in automated sorting systems. In this work, we introduce $\textbf{SortScrews}$, a dataset for casewise visual classification of screws. The dataset contains 560 RGB images at $512\times512$ resolution covering six screw types and a background class. Images are captured using a standardized acquisition setup and include mild variations in lighting and camera perspective across four capture settings. To facilitate reproducible research and dataset expansion, we also provide a reusable data collection script that allows users to easily construct similar datasets for custom hardware components using inexpensive camera setups. We establish baseline results using transfer learning with EfficientNet-B0 and ResNet-18 classifiers pretrained on ImageNet. In addition, we conduct a well-explored failure analysis. Despite the limited dataset size, these lightweight models achieve strong classification accuracy, demonstrating that controlled acquisition conditions enable effective learning even with relatively small datasets. The dataset, collection pipeline, and baseline training code are publicly available at https://github.com/ATATC/SortScrews.

[135] Multimodal OCR: Parse Anything from Documents

Handong Zheng,Yumeng Li,Kaile Zhang,Liang Xin,Guangwei Zhao,Hao Liu,Jiayu Chen,Jie Lou,Jiyu Qiu,Qi Fu,Rui Yang,Shuo Jiang,Weijian Luo,Weijie Su,Weijun Zhang,Xingyu Zhu,Yabin Li,Yiwei ma,Yu Chen,Zhaohui Yu,Guang Yang,Colin Zhang,Lei Zhang,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出多模态OCR(MOCR)范式,通过统一解析文本与图形为结构化文本表示,提升文档理解能力;其模型dots.mocr在文档解析与图形结构化重建任务上均达SOTA,并支持端到端多模态联合训练。

Details Motivation: 传统OCR仅识别文本、丢弃图形区域,无法建模图文语义关系;需一种能联合解析图文并保留其结构与语义关联的新范式。 Method: 提出MOCR范式及具体实现dots.mocr,将图表、图标、表格等视觉元素作为一等解析目标;构建涵盖PDF、网页渲染图和SVG的大规模多源数据引擎;采用分阶段预训练与监督微调策略训练3B参数紧凑模型。 Result: 在OCR Arena Elo榜单上仅次于Gemini 3 Pro,olmOCR Bench达83.9新SOTA;图像转SVG任务中,图表、UI布局、科学图像、化学图等重建质量全面超越Gemini 3 Pro。 Conclusion: MOCR为大规模图文联合解析提供了可扩展路径,将图形转化为代码级监督信号,推动多模态预训练语料库建设;代码与模型已开源。 Abstract: We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.

[136] ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models

Yanpeng Zhao,Wentao Ding,Hongtao Li,Baoxiong Jia,Zilong Zheng

Main category: cs.CV

TL;DR: 本文提出ESPIRE——一个面向具身空间推理的诊断性基准,通过模拟物理环境和将任务分解为定位与执行两个生成式子任务,来更真实、细粒度地评估视觉语言模型(VLMs)的空间认知与行动能力。

Details Motivation: 现有VLM空间认知评估范式单一、覆盖不足,阻碍模型快速迭代;亟需更贴近真实机器人部署的、能支持细粒度分析的诊断性基准。 Method: 构建ESPIRE基准:1)提供物理仿真的具身环境;2)将机器人任务解耦为定位(localization)和执行(execution)两个生成式子任务,摒弃依赖干扰项的判别式评测(如VQA);3)在指令层与环境层系统化设计,确保空间推理场景全覆盖。 Result: 使用ESPIRE对前沿VLMs进行系统诊断,揭示其在空间推理到行动(reasoning-to-act)能力上的表现差异与局限性,支持细粒度行为分析。 Conclusion: ESPIRE填补了具身空间推理评估的空白,推动VLM向真实机器人任务迁移,并为模型改进提供可解释、可操作的诊断依据。 Abstract: A recent trend in vision-language models (VLMs) has been to enhance their spatial cognition for embodied domains. Despite progress, existing evaluations have been limited both in paradigm and in coverage, hindering rapid, iterative model development. To address these limitations, we propose ESPIRE, a diagnostic benchmark for embodied spatial reasoning. ESPIRE offers a simulated world that physically grounds VLMs and evaluates them on spatial-reasoning-centric robotic tasks, thus narrowing the gap between evaluation and real-world deployment. To adapt VLMs to robotic tasks, we decompose each task into localization and execution, and frame both as generative problems, in stark contrast to predominant discriminative evaluations (e.g., via visual-question answering) that rely on distractors and discard execution. This decomposition further enables a fine-grained analysis beyond passive spatial reasoning toward reasoning to act. We systematically design ESPIRE both at the instruction level and at the environment level, ensuring broad coverage of spatial reasoning scenarios. We use ESPIRE to diagnose a range of frontier VLMs and provide in-depth analysis of their spatial reasoning behaviors.

[137] Are General-Purpose Vision Models All We Need for 2D Medical Image Segmentation? A Cross-Dataset Empirical Study

Vanessa Borst,Samuel Kounev

Main category: cs.CV

TL;DR: 本文通过控制实验比较了专用医学分割架构(SMAs)与通用视觉模型(GP-VMs)在2D医学图像分割任务上的性能,发现GP-VMs在多个数据集上优于大多数SMAs,并具备良好的可解释性,表明其可作为医学图像分割的可行替代方案。

Details Motivation: 尽管通用视觉模型(GP-VMs)在自然图像任务中表现优异,但其在医学图像分割(MIS)中的有效性尚不明确;同时,专用医学分割架构(SMAs)是否仍具系统性优势也亟待验证。 Method: 采用统一训练与评估协议,在三个异构医学影像数据集上对比11种SMAs与GP-VMs的2D分割性能,并结合Grad-CAM进行可解释性(XAI)分析。 Result: GP-VMs在多数测试数据集上分割精度超越大部分SMAs;Grad-CAM可视化显示GP-VMs能自发关注临床相关解剖结构,无需领域定制架构。 Conclusion: GP-VMs可作为医学图像分割的有效替代方案,提示在构建端到端MIS系统时应更注重基于实证的模型选型,而非默认依赖领域专用架构。 Abstract: Medical image segmentation (MIS) is a fundamental component of computer-assisted diagnosis and clinical decision support systems. Over the past decade, numerous architectures specifically tailored to medical imaging have emerged to address domain-specific challenges such as low contrast, small anatomical structures, and limited annotated data. In parallel, rapid progress in computer vision has produced highly capable general-purpose vision models (GP-VMs) originally designed for natural images. Despite their strong performance on standard vision benchmarks, their effectiveness for MIS remains insufficiently understood. In this work, we conduct a controlled empirical study to examine whether specialized medical segmentation architectures (SMAs) provide systematic advantages over modern GP-VMs for 2D MIS. We compare eleven SMAs and GP-VMs using a unified training and evaluation protocol. Experiments are performed across three heterogeneous datasets covering different imaging modalities, class structures, and data characteristics. Beyond segmentation accuracy, we analyze qualitative Grad-CAM visualizations to investigate explainability (XAI) behavior. Our results demonstrate that, for the analyzed datasets, GP-VMs out-perform the majority of specialized MIS models. Moreover, XAI analyses indicate that GP-VMs can capture clinically relevant structures without explicit domain-specific architectural design. These findings suggest that GP-VMs can represent a viable alternative to domain-specific methods, highlighting the importance of informed model selection for end-to-end MIS systems. All code and resources are available at GitHub.

[138] Topo-R1: Detecting Topological Anomalies via Vision-Language Models

Meilong Xu,Qingqiao Hu,Xiaoling Hu,Shahira Abousamra,Xin Yu,Weimin Lyu,Kehan Qi,Dimitris Samaras,Chao Chen

Main category: cs.CV

TL;DR: 本文提出Topo-R1框架,通过两阶段训练(监督微调+强化学习)赋予视觉语言模型拓扑感知能力,解决无标注场景下的管状结构拓扑异常检测问题,并构建首个大规模多领域基准。

Details Motivation: 现有拓扑保持方法依赖昂贵且难以跨域迁移的领域特定真值标注;在无标注新领域中,如何检测拓扑异常成为关键问题。 Method: 构建自动化数据合成管道生成多级可验证拓扑异常数据集;提出Topo-R1框架,采用监督微调加基于GRPO的强化学习两阶段训练;设计融合类型感知匈牙利匹配、空间定位评分和中心线Dice(clDice)奖励的拓扑感知复合奖励函数。 Result: Topo-R1在所有评估协议下均显著优于通用VLM和有监督基线,建立了无标注拓扑质量评估新范式。 Conclusion: Topo-R1有效提升了VLM对稀疏连接性错误的细粒度拓扑感知能力,为无真值监督的管状结构分割质量评估提供了可靠、可泛化的解决方案。 Abstract: Topological correctness is crucial for tubular structures such as blood vessels, nerve fibers, and road networks. Existing topology-preserving methods rely on domain-specific ground truth, which is costly and rarely transfers across domains. When deployed to a new domain without annotations, a key question arises: how can we detect topological anomalies without ground-truth supervision? We reframe this as topological anomaly detection, a structured visual reasoning task requiring a model to locate and classify topological errors in predicted segmentation masks. Vision-Language Models (VLMs) are natural candidates; however, we find that state-of-the-art VLMs perform nearly at random, lacking the fine-grained, topology-aware perception needed to identify sparse connectivity errors in dense structures. To bridge this gap, we develop an automated data-curation pipeline that synthesizes diverse topological anomalies with verifiable annotations across progressively difficult levels, thereby constructing the first large-scale, multi-domain benchmark for this task. We then introduce Topo-R1, a framework that endows VLMs with topology-aware perception via two-stage training: supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO). Central to our approach is a topology-aware composite reward that integrates type-aware Hungarian matching for structured error classification, spatial localization scoring, and a centerline Dice (clDice) reward that directly penalizes connectivity disruptions, thereby jointly incentivizing semantic precision and structural fidelity. Extensive experiments demonstrate that Topo-R1 establishes a new paradigm for annotation-free topological quality assessment, consistently outperforming general-purpose VLMs and supervised baselines across all evaluation protocols.

[139] Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

Elena Ryumina,Maxim Markitantov,Alexandr Axyonov,Dmitry Ryumin,Mikhail Dolgushin,Denis Dresvyanskiy,Alexey Karpov

Main category: cs.CV

TL;DR: 本文提出了一种面向真实场景(in-the-wild)的多模态连续情绪识别方法,融合人脸、行为和音频三种模态,采用自适应跨模态专家混合与可靠性感知音视融合策略,在Aff-Wild2数据集上达到CCC 0.658。

Details Motivation: 在真实场景下,由于外观、姿态、光照、遮挡及个体表达差异大,连续情绪识别(效价-唤醒度)仍具挑战性。 Method: 融合人脸(GRADA+Transformer)、行为(Qwen3-VL-4B-Instruct + Mamba)和音频(WavLM-Large + 注意力统计池化 + 跨模态滤波)三模态;提出两种融合策略:定向跨模态MoE融合与可靠性感知音视融合。 Result: 在Aff-Wild2开发集上按ABAW挑战协议评测,Concordance Correlation Coefficient(CCC)达0.658。 Conclusion: 所提多模态框架及融合策略有效提升了真实场景下连续情绪识别的鲁棒性与准确性。 Abstract: Continuous emotion recognition in terms of valence and arousal under in-the-wild (ITW) conditions remains a challenging problem due to large variations in appearance, head pose, illumination, occlusions, and subject-specific patterns of affective expression. We present a multimodal method for valence-arousal estimation ITW. Our method combines three complementary modalities: face, behavior, and audio. The face modality relies on GRADA-based frame-level embeddings and Transformer-based temporal regression. We use Qwen3-VL-4B-Instruct to extract behavior-relevant information from video segments, while Mamba is used to model temporal dynamics across segments. The audio modality relies on WavLM-Large with attention-statistics pooling and includes a cross-modal filtering stage to reduce the influence of unreliable or non-speech segments. To fuse modalities, we explore two fusion strategies: a Directed Cross-Modal Mixture-of-Experts Fusion Strategy that learns interactions between modalities with adaptive weighting, and a Reliability-Aware Audio-Visual Fusion Strategy that combines visual features at the frame-level while using audio as complementary context. The results are reported on the Aff-Wild2 dataset following the 10th Affective Behavior Analysis in-the-Wild (ABAW) challenge protocol. Experiments demonstrate that the proposed multimodal fusion strategy achieves a Concordance Correlation Coefficient (CCC) of 0.658 on the Aff-Wild2 development set.

[140] Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

Yuki Hirakawa,Takashi Wada,Ryotaro Shimizu,Takuya Furusawa,Yuki Saito,Ryosuke Araki,Tianwei Chen,Fan Mo,Yoshimitsu Aoki

Main category: cs.CV

TL;DR: 本文提出了一种无需参考图像的虚拟试衣质量评估框架VTON-IQA,通过构建大规模人类标注数据集VTON-QBench,并引入交错交叉注意力模块,实现对单张生成图像的人类感知对齐质量评估。

Details Motivation: 现有VTON评估方法依赖真实图像(不可行)或分布级指标(无法反映单图感知质量),亟需一种参考-free、图像级、人类感知对齐的质量评估方法。 Method: 构建大规模人类标注基准VTON-QBench(62,688张图像、431,800条标注);设计Interleaved Cross-Attention模块以联合建模服装保真度与人物细节保持;提出端到端VTON-IQA评估框架。 Result: VTON-IQA在图像级质量预测上实现高人类对齐性;并基于该框架对14种主流VTON模型完成首次全面基准评测。 Conclusion: VTON-IQA为虚拟试衣提供了可靠、可解释、无需参考的图像级质量评估新范式,推动VTON系统向实用化发展。 Abstract: Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fréchet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.

[141] Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection

Yunzhuo Chen,Jordan Vice,Naveed Akhtar,Nur Al Hasan Haldar,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出两种互补方法:RAPTA(区域感知提示增强)和ADMCD(注意力驱动的多模态复制检测),以解决文本到图像扩散模型中训练图像记忆与复制问题,兼顾图像质量与版权/隐私保护。

Details Motivation: 现有文本到图像扩散模型存在记忆并复现训练图像的风险,带来版权与隐私隐患;而当前推理阶段的提示扰动方法虽可降低复制,却常损害图像-提示对齐度与整体保真度。 Method: 1) RAPTA:利用目标检测器定位显著区域,生成语义一致的提示变体,并在训练中随机采样以提升多样性;2) ADMCD:融合局部块、全局语义与纹理线索,通过轻量级Transformer生成融合表征,并采用无监督阈值决策规则检测复制行为。 Result: 实验表明RAPTA在减少过拟合的同时保持高质量图像合成;ADMCD能可靠检测复制行为,性能优于单模态指标,且无需大规模标注数据训练。 Conclusion: RAPTA与ADMCD协同解决了扩散模型中的图像复制风险,在不牺牲生成质量的前提下提升了版权合规性与隐私安全性,为安全可控的文本到图像生成提供了新范式。 Abstract: State-of-the-art text-to-image diffusion models can produce impressive visuals but may memorize and reproduce training images, creating copyright and privacy risks. Existing prompt perturbations applied at inference time, such as random token insertion or embedding noise, may lower copying but often harm image-prompt alignment and overall fidelity. To address this, we introduce two complementary methods. First, Region-Aware Prompt Augmentation (RAPTA) uses an object detector to find salient regions and turn them into semantically grounded prompt variants, which are randomly sampled during training to increase diversity, while maintaining semantic alignment. Second, Attention-Driven Multimodal Copy Detection (ADMCD) aggregates local patch, global semantic, and texture cues with a lightweight transformer to produce a fused representation, and applies simple thresholded decision rules to detect copying without training with large annotated datasets. Experiments show that RAPTA reduces overfitting while maintaining high synthesis quality, and that ADMCD reliably detects copying, outperforming single-modal metrics.

[142] Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods

Yihang Zhou,Chao Lin,Hideki Kikumoto,Ryozo Ooka,Sibo Cheng

Main category: cs.CV

TL;DR: 本文提出了一种基于风洞实验数据的学习-观测框架,利用PIV测量数据,对比Kriging插值与UNet、ViTAE、CWGAN三种深度学习模型,在不同传感器密度、训练策略(单/混合风向)和传感器布局优化下重建屋顶风速分布,结果表明深度学习显著优于传统插值方法,混合风向训练和QR优化进一步提升性能与鲁棒性。

Details Motivation: 屋顶风场具有强非线性、分离流和多方向变异性,导致基于稀疏传感器的流场重构困难,而实时风速分布对无人机、城市空中交通、风控系统等安全运行至关重要。 Method: 基于PIV风洞实验数据,构建学习-观测框架;对比Kriging插值与UNet、ViTAE、CWGAN三种深度学习模型;评估单风向(SDT)与混合风向(MDT)两种训练策略;在5–30个传感器密度下测试;分析传感器位置扰动(±1网格)下的鲁棒性;采用POD-QR联合优化传感器布置。 Result: 深度学习模型相比Kriging:SSIM提升最高32.7%,FAC2提升24.2%,NMSE降低27.8%;MDT相比SDT:SSIM提升最高173.7%,FAC2提升16.7%,MG提升98.3%;QR优化使扰动下鲁棒性最高提升27.8%;实测数据训练比仿真数据更具实际指导价值。 Conclusion: 深度学习适用于稀疏传感器下的屋顶风场重建;训练策略(MDT)、传感器配置与布局优化(POD-QR)需协同设计;基于真实实验数据的方法评估更利于工程部署。 Abstract: Real-time rooftop wind-speed distribution is important for the safe operation of drones and urban air mobility systems, wind control systems, and rooftop utilization. However, rooftop flows show strong nonlinearity, separation, and cross-direction variability, which make flow field reconstruction from sparse sensors difficult. This study develops a learning-from-observation framework using wind-tunnel experimental data obtained by Particle Image Velocimetry (PIV) and compares Kriging interpolation with three deep learning models: UNet, Vision Transformer Autoencoder (ViTAE), and Conditional Wasserstein GAN (CWGAN). We evaluate two training strategies, single wind-direction training (SDT) and mixed wind-direction training (MDT), across sensor densities from 5 to 30, test robustness under sensor position perturbations of plus or minus 1 grid, and optimize sensor placement via Proper Orthogonal Decomposition with QR decomposition. Results show that deep learning methods can reconstruct rooftop wind fields from sparse sensor data effectively. Compared with Kriging interpolation, the deep learning models improved SSIM by up to 32.7%, FAC2 by 24.2%, and NMSE by 27.8%. Mixed wind-direction training further improved performance, with gains of up to 173.7% in SSIM, 16.7% in FAC2, and 98.3% in MG compared with single-direction training. The results also show that sensor configuration, optimization, and training strategy should be considered jointly for reliable deployment. QR-based optimization improved robustness by up to 27.8% under sensor perturbations, although with metric-dependent trade-offs. Training on experimental rather than simulated data also provides practical guidance for method selection and sensor placement in different scenarios.

[143] InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Yebin Yang,Di Wen,Lei Qi,Weitong Kong,Junwei Zheng,Ruiping Liu,Yufan Chen,Chengzhi Wu,Kailun Yang,Yuqian Fu,Danda Pani Paudel,Luc Van Gool,Kunyu Peng

Main category: cs.CV

TL;DR: 本文提出了多人体3D运动编辑的新任务,并构建了首个带人工标注的双人运动编辑数据集InterEdit3D及TMME基准;提出InterEdit模型,通过语义感知的计划令牌对齐和交互感知的频率令牌对齐策略,显著提升文本引导编辑的一致性与保真度。

Details Motivation: 现有文本引导的3D运动编辑方法主要面向单人场景,难以扩展至多人场景,原因在于缺乏配对训练数据以及人与人之间复杂交互建模困难。 Method: 提出InterEdit模型:一种同步化的无分类器引导扩散模型;引入语义感知的计划令牌对齐(使用可学习token捕捉高层交互线索)和交互感知的频率令牌对齐(结合DCT变换与能量池化建模周期性运动动态)。 Result: 在新提出的TMME基准上,InterEdit在文本-运动一致性与编辑保真度方面均优于现有方法,达到SOTA性能;同时发布了InterEdit3D数据集与开源代码。 Conclusion: InterEdit有效建模多人交互动态与语义指令关联,为文本驱动的多人体运动编辑提供了可扩展、高性能的新范式。 Abstract: Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at https://github.com/YNG916/InterEdit.

[144] V-Bridge: Bridging Video Generative Priors to Versatile Few-shot Image Restoration

Shenghe Zheng,Junpeng Jiang,Wenbo Li

Main category: cs.CV

TL;DR: 本文提出V-Bridge框架,利用大规模预训练视频生成模型的隐含先验知识,仅用1000个样本即可在多个图像恢复任务上实现与专用模型相当的性能,揭示了视频生成模型在低层视觉任务中的迁移潜力。

Details Motivation: 大型视频生成模型虽具强大生成能力,但其作为通用视觉学习器的潜力尚未被充分挖掘;传统图像恢复方法依赖大量标注数据和专用架构,效率低且泛化性差。 Method: 将图像恢复重新定义为一种渐进式生成过程,利用预训练视频生成模型模拟从退化输入到高质量输出的逐步优化;通过少量多任务样本(仅1000个)微调或提示引导,激活模型中隐含的恢复先验。 Result: 在多个图像恢复任务(如去模糊、超分、去噪等)上,仅用不到2%的训练样本即达到与专用模型相当的性能;单模型支持多任务,展现出强泛化性和跨任务迁移能力。 Conclusion: 视频生成模型隐式学习了强大且可迁移的图像恢复先验,只需极少量数据即可激活;该发现打破了生成建模与低层视觉的传统界限,为视觉基础模型的设计提供了新范式。 Abstract: Large-scale video generative models are trained on vast and diverse visual data, enabling them to internalize rich structural, semantic, and dynamic priors of the visual world. While these models have demonstrated impressive generative capability, their potential as general-purpose visual learners remains largely untapped. In this work, we introduce V-Bridge, a framework that bridges this latent capacity to versatile few-shot image restoration tasks. We reinterpret image restoration not as a static regression problem, but as a progressive generative process, and leverage video models to simulate the gradual refinement from degraded inputs to high-fidelity outputs. Surprisingly, with only 1,000 multi-task training samples (less than 2% of existing restoration methods), pretrained video models can be induced to perform competitive image restoration, achieving multiple tasks with a single model, rivaling specialized architectures designed explicitly for this purpose. Our findings reveal that video generative models implicitly learn powerful and transferable restoration priors that can be activated with only extremely limited data, challenging the traditional boundary between generative modeling and low-level vision, and opening a new design paradigm for foundation models in visual tasks.

[145] Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence

Seunghwan Bang,Hwanjun Song

Main category: cs.CV

TL;DR: 本文提出VAEX-BENCH基准,用于评估多模态大语言模型在视频上的抽象时空推理能力,并构建了可控的合成第一人称视频数据集。

Details Motivation: 现有视频理解基准主要关注可从时空事件中直接提取答案的推理任务,而缺乏对需要整合时序信息、融合分散线索并推断隐含空间与上下文结构的抽象时空推理能力的评估。 Method: 提出结构化的抽象时空推理评估分类法,构建面向场景驱动的可控合成第一人称视频数据集(涵盖物体级、房间级和楼层平面图级),并基于此设计包含五项抽象推理任务及其对应提取式任务的VAEX-BENCH基准。 Result: 实验表明当前SOTA多模态大语言模型在抽象任务上表现显著弱于提取式任务,揭示了其在时空整合、跨帧线索融合与隐式结构建模等方面的瓶颈。 Conclusion: VAEX-BENCH为评估和推动多模态大语言模型的抽象时空推理能力提供了新基准和分析框架,强调需超越简单提取式理解,发展更深层的时空语义建模能力。 Abstract: The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan-level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

[146] BenDFM: A taxonomy and synthetic CAD dataset for manufacturability assessment in sheet metal bending

Matteo Ballegeer,Dries F. Benoit

Main category: cs.CV

TL;DR: 本文提出了一种面向钣金折弯工艺的可制造性预测新框架,包括可制造性指标分类法和首个合成数据集BenDFM(含2万件含折叠/展开几何与多标签的零件),并验证了图神经网络在该任务上的优势。

Details Motivation: 现有基于学习的可制造性预测方法受限于可制造性定义不统一、标签类型混杂(离散/连续、依赖/不依赖具体工艺配置)以及缺乏兼顾可制造与不可制造样本、覆盖真实工艺约束的高质量数据集。 Method: 1)构建沿‘配置依赖性’和‘测量类型’两维度的可制造性指标分类体系;2)设计并发布首个面向钣金折弯的合成数据集BenDFM,含20,000个经工艺感知仿真生成的零件(含可/不可制造样本)、折叠与展开几何及多类可制造性标签;3)在BenDFM上对两种SOTA 3D学习架构进行基准测试。 Result: 图结构表征(建模面间关系)比体素/点云等表征更优;依赖具体制造配置的指标预测难度显著高于通用指标;BenDFM为系统研究学习型DFM提供了新基准。 Conclusion: 统一可制造性定义与构建工艺专用、带多维标签的合成数据集是推动学习型DFM发展的关键路径;BenDFM填补了钣金折弯领域数据空白,并揭示了模型表征与任务难度间的关联。 Abstract: Predicting the manufacturability of CAD designs early, in terms of both feasibility and required effort, is a key goal of Design for Manufacturing (DFM). Despite advances in deep learning for CAD and its widespread use in manufacturing process selection, learning-based approaches for predicting manufacturability within a specific process remain limited. Two key challenges limit progress: inconsistency across prior work in how manufacturability is defined and consequently in the associated learning targets, and a scarcity of suitable datasets. Existing labels vary significantly: they may reflect intrinsic design constraints or depend on specific manufacturing capabilities (such as available tools), and they range from discrete feasibility checks to continuous complexity measures. Furthermore, industrial datasets typically contain only manufacturable parts, offering little signal for infeasible cases, while existing synthetic datasets focus on simple geometries and subtractive processes. To address these gaps, we propose a taxonomy of manufacturability metrics along the axes of configuration dependence and measurement type, allowing clearer scoping of generalizability and learning objectives. Next, we introduce BenDFM, the first synthetic dataset for manufacturability assessment in sheet metal bending. BenDFM contains 20,000 parts, both manufacturable and unmanufacturable, generated with process-aware bending simulations, providing both folded and unfolded geometries and multiple manufacturability labels across the taxonomy, enabling systematic study of previously unexplored learning-based DFM challenges. We benchmark two state-of-the-art 3D learning architectures on BenDFM, showing that graph-based representations that capture relationships between part surfaces achieve better accuracy, and that predicting metrics that depend on specific manufacturing setups remains more challenging.

[147] NOIR: Neural Operator mapping for Implicit Representations

Sidaty El Hadramy,Nazim Haouchine,Michael Wehrli,Philippe C. Cattin

Main category: cs.CV

TL;DR: NOIR是一种将医学影像任务重构为连续函数空间间算子学习的新框架,利用隐式神经表示和神经算子实现分辨率无关的函数到函数变换,在多个任务和数据集上展现出优异性能与泛化能力。

Details Motivation: 挑战当前基于离散像素/体素网格的深度学习范式,解决医学影像中因分辨率变化导致的模型泛化性差问题。 Method: 将离散医学信号嵌入共享的隐式神经表示(INR),并学习一个在潜在调制空间上映射的神经算子,实现连续函数空间间的映射。 Result: 在2D/3D分割、形状补全、图像翻译、图像合成等任务上达到原生分辨率下的竞争性性能;对未见分辨率具有强鲁棒性;经验验证满足神经算子的关键理论性质。 Conclusion: NOIR为医学影像分析提供了一种更通用、分辨率无关的建模范式,推动了连续深度学习在临床应用中的落地潜力。 Abstract: This paper presents NOIR, a framework that reframes core medical imaging tasks as operator learning between continuous function spaces, challenging the prevailing paradigm of discrete grid-based deep learning. Instead of operating on fixed pixel or voxel grids, NOIR embeds discrete medical signals into shared Implicit Neural Representations and learns a Neural Operator that maps between their latent modulations, enabling resolution-independent function-to-function transformations. We evaluate NOIR across multiple 2D and 3D downstream tasks, including segmentation, shape completion, image-to-image translation, and image synthesis, on several public datasets such as Shenzhen, OASIS-4, SkullBreak, fastMRI, as well as an in-house clinical dataset. It achieves competitive performance at native resolution while demonstrating strong robustness to unseen discretizations, and empirically satisfies key theoretical properties of neural operators. The project page is available here: https://github.com/Sidaty1/NOIR-io.

[148] Geometry-Guided Camera Motion Understanding in VideoLLMs

Haoan Feng,Sri Harsha Musunuri,Guan-Ming Su

Main category: cs.CV

TL;DR: 本文提出了一种针对视频大模型(VideoLLM)缺乏显式相机运动建模能力的问题的系统性解决方案,包括构建CameraMotionDataset与CameraMotionVQA基准、诊断视觉编码器中运动表征薄弱问题,并设计轻量、模型无关的几何线索提取与结构化提示注入方法,显著提升模型对相机运动的理解能力。

Details Motivation: 当前视频多模态大模型(VideoLLMs)普遍未显式建模相机运动这一关键几何信号,导致在细粒度运动原语识别上表现差,限制其在影视理解与具身智能等场景的应用。 Method: 提出benchmarking- diagnosis- injection三阶段框架:1)构建带显式相机控制的大规模合成数据集CameraMotionDataset及VQA评测基准CameraMotionVQA;2)通过探针实验分析Qwen2.5-VL等模型视觉编码器中相机运动表征的分布缺陷;3)设计无需微调的轻量流水线——利用3D基础模型提取几何线索,时序分类器预测约束运动原语,并通过结构化提示注入到VideoLLM推理中。 Result: 在多个现成VideoLLMs上验证了其运动识别错误率高;探针实验发现ViT深层对相机运动响应弱;所提方法在不训练模型前提下显著提升运动识别准确率与响应的相机感知能力。 Conclusion: 显式建模相机运动对提升VideoLLM和VLA系统的几何理解与影视级推理能力至关重要;基于3D几何线索提取与结构化提示注入是高效、通用且实用的技术路径。 Abstract: Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.

[149] FDeID-Toolbox: Face De-Identification Toolbox

Hui Wei,Hao Yu,Guoying Zhao

Main category: cs.CV

TL;DR: 本文提出FDeID-Toolbox,一个面向人脸去识别(FDeID)研究的模块化、可复现工具箱,以解决该领域实现碎片化、评估标准不一致和结果不可比的问题。

Details Motivation: 现有FDeID研究存在实现碎片化、评估协议不一致、结果不可比等问题,根源在于任务复杂性:需兼顾多下游任务与隐私、效用、视觉质量三方面评估。 Method: 设计并实现FDeID-Toolbox,包含四个核心模块:标准化数据加载器、统一方法实现(涵盖经典方法至SOTA生成模型)、灵活推理流程、系统化三维度(隐私/效用/质量)评估协议。 Result: 实验表明,该工具箱支持在统一条件下对各类FDeID方法进行公平、可复现的比较,显著提升研究可复现性与可扩展性。 Conclusion: FDeID-Toolbox为FDeID领域提供了标准化、模块化、可扩展的基础设施工具,推动隐私保护计算机视觉研究的规范化与协作发展。 Abstract: Face de-identification (FDeID) aims to remove personally identifiable information from facial images while preserving task-relevant utility attributes such as age, gender, and expression. It is critical for privacy-preserving computer vision, yet the field suffers from fragmented implementations, inconsistent evaluation protocols, and incomparable results across studies. These challenges stem from the inherent complexity of the task: FDeID spans multiple downstream applications (e.g., age estimation, gender recognition, expression analysis) and requires evaluation across three dimensions (e.g., privacy protection, utility preservation, and visual quality), making existing codebases difficult to use and extend. To address these issues, we present FDeID-Toolbox, a comprehensive toolbox designed for reproducible FDeID research. Our toolbox features a modular architecture comprising four core components: (1) standardized data loaders for mainstream benchmark datasets, (2) unified method implementations spanning classical approaches to SOTA generative models, (3) flexible inference pipelines, and (4) systematic evaluation protocols covering privacy, utility, and quality metrics. Through experiments, we demonstrate that FDeID-Toolbox enables fair and reproducible comparison of diverse FDeID methods under consistent conditions.

[150] Towards Faithful Multimodal Concept Bottleneck Models

Pierre Moreau,Emeline Pineau Ferrand,Yann Choho,Benjamin Wong,Annabelle Blangero,Milan Bhan

Main category: cs.CV

TL;DR: 本文提出f-CBM框架,一种面向多模态的忠实概念瓶颈模型,在视觉-语言骨干网络上联合优化概念检测与泄漏抑制,兼顾任务准确率、概念检测性能和泄漏减少效果。

Details Motivation: 现有概念瓶颈模型(CBMs)在多模态场景中研究不足,且其解释性依赖于概念检测准确性和避免概念表征中信息泄漏(leakage),而当前方法常将二者分开处理并牺牲预测精度来改善其一。 Method: 提出f-CBM框架:1)引入可微泄漏损失(differentiable leakage loss)抑制概念表征中的信息泄漏;2)采用Kolmogorov-Arnold网络作为预测头以增强概念检测能力;整体基于视觉-语言骨干网络实现多模态联合建模。 Result: f-CBM在任务准确率、概念检测性能和泄漏抑制三方面取得最优权衡,并能无缝适配图像-文本及纯文本数据集,展现出跨模态通用性。 Conclusion: f-CBM通过联合优化策略有效提升了多模态CBMs的忠实性与实用性,为可解释多模态建模提供了新范式。 Abstract: Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied in vision and, more recently, in NLP, CBMs remain largely unexplored in multimodal settings. For their explanations to be faithful, CBMs must satisfy two conditions: concepts must be properly detected, and concept representations must encode only their intended semantics, without smuggling extraneous task-relevant or inter-concept information into final predictions, a phenomenon known as leakage. Existing approaches treat concept detection and leakage mitigation as separate problems, and typically improve one at the expense of predictive accuracy. In this work, we introduce f-CBM, a faithful multimodal CBM framework built on a vision-language backbone that jointly targets both aspects through two complementary strategies: a differentiable leakage loss to mitigate leakage, and a Kolmogorov-Arnold Network prediction head that provides sufficient expressiveness to improve concept detection. Experiments demonstrate that f-CBM achieves the best trade-off between task accuracy, concept detection, and leakage reduction, while applying seamlessly to both image and text or text-only datasets, making it versatile across modalities.

[151] Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

Dingcheng Huang,Xiaotong Zhang,Kamal Youcef-Toumi

Main category: cs.CV

TL;DR: 本文提出了一种轻量级感知调度框架,通过利用前序帧输出和场景上下文,动态调度必要的感知模块,显著降低计算延迟并提升关键指标,适用于人机协作中的多模态流式感知。

Details Motivation: 现有并行感知流水线在流式感知场景中因逐帧执行多个感知模块而累积延迟,且存在信息冗余与计算资源分配不优的问题。 Method: 受Relevance概念和HRC事件信息稀疏性启发,提出一种基于前序帧输出实时估计并调度必要感知模块的轻量级感知调度框架。 Result: 相比传统并行感知流水线,计算延迟降低最多27.52%,MMPose激活召回率提升72.73%,关键帧准确率达98%。 Conclusion: 该框架可在不显著牺牲精度的前提下提升实时感知效率,具备成为HRC中多模态流式感知系统可扩展、系统化解决方案的潜力。 Abstract: In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework's capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

[152] Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification

Hiba Adil Al-kharsan,Róbert Rajkó

Main category: cs.CV

TL;DR: 本文提出了一种结合非负矩阵分解(NNMF)、轻量级CNN和基于扩散的特征净化的鲁棒脑肿瘤MRI分类框架,兼顾高准确率与强对抗鲁棒性。

Details Motivation: 深度学习模型在脑肿瘤MRI分类中虽精度高,但易受对抗扰动影响,医疗应用中可靠性堪忧,亟需提升鲁棒性。 Method: 采用NNMF提取可解释的非负特征,结合统计指标筛选判别性成分;用轻量级CNN进行分类;引入基于扩散的特征空间净化模块(前向加噪+学习去噪器)增强鲁棒性。 Result: 在Clean Accuracy和AutoAttack下的Robust Accuracy两项指标上均表现优异,显著提升对抗鲁棒性,同时保持竞争力的分类性能。 Conclusion: NNMF的可解释表征、轻量CNN与扩散防御的融合,为对抗环境下的医学图像分类提供了有效且可靠的解决方案。 Abstract: Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. In recent years, deep learning models have achieved high classification accuracy. However, their sensitivity to adversarial perturbations has become an important reliability concern in medical applications. This study suggests a robust brain tumor classification framework that combines Non-Negative Matrix Factorization (NNMF or NMF), lightweight convolutional neural networks (CNNs), and diffusion-based feature purification. Initially, MRI images are preprocessed and converted into a non-negative data matrix, from which compact and interpretable NNMF feature representations are extracted. Statistical metrics, including AUC, Cohen's d, and p-values, are used to rank and choose the most discriminative components. Then, a lightweight CNN classifier is trained directly on the selected feature groups. To improve adversarial robustness, a diffusion-based feature-space purification module is introduced. A forward noise method followed by a learned denoiser network is used before classification. System performance is estimated using both clean accuracy and robust accuracy under powerful adversarial attacks created by AutoAttack. The experimental results show that the proposed framework achieves competitive classification performance while significantly enhancing robustness against adversarial perturbations.The findings presuppose that combining interpretable NNMF-based representations with a lightweight deep approach and diffusion-based defense technique supplies an effective and reliable solution for medical image classification under adversarial conditions.

[153] Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

Rohith Peddi,Saurabh,Shravan Shanmugam,Likhitha Pallapothula,Yu Xiang,Parag Singla,Vibhav Gogate

Main category: cs.CV

TL;DR: 本文提出World Scene Graph Generation (WSGG)任务,旨在构建包含观测与未观测物体的4D世界场景图,并发布ActionGenome4D数据集;提出了三种建模未观测物体的方法(PWG、MWAE、4DST),并探索了视觉语言模型在该任务上的表现。

Details Motivation: 现有时空场景图方法局限于帧中心、2D且忽略被遮挡或暂时不可见的物体,缺乏对世界一致性和时序持续性的建模。 Method: 构建ActionGenome4D 4D数据集;形式化WSGG任务;提出三种方法:PWG(基于零阶特征缓存实现物体恒常性)、MWAE(将未观测物体推理重构为掩码补全+跨视角关联检索)、4DST(引入带3D运动与相机姿态特征的可微对象级时序注意力);设计Graph RAG式方法评估开源VLM在WSGG上的表现。 Result: 建立了首个面向世界中心、时序持续、可解释场景推理的WSGG基准;三种方法在未观测关系预测上展现出不同归纳偏置的有效性;VLM通过Graph RAG方式取得有竞争力的基线结果。 Conclusion: WSGG推动视频场景理解从帧中心迈向世界中心,强调物体持久性、3D一致性与可解释关系建模,为长时序、遮挡鲁棒的智能体场景理解奠定基础。 Abstract: Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

[154] Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

Ziqi Ma,Mengzhan Liufu,Georgia Gkioxari

Main category: cs.CV

TL;DR: 本文提出STEVO-Bench基准,用于评估视频世界模型是否能将状态演化与观测解耦,揭示当前模型在无观测条件下的演化能力局限。

Details Motivation: 探索视频世界模型生成的‘世界’能否像真实世界一样,在未被观测时仍能自主演化(如水倾倒、冰融化),即检验其对状态演化的内在建模能力。 Method: 设计STEVO-Bench基准,通过插入遮挡物、关灯、控制相机‘移开视线’等观测控制指令,系统性干预演化过程,并自动检测和分离模型在自然演化任务中的失败模式。 Result: 实验表明现有视频世界模型难以解耦状态演化与观测,在无视觉输入或观测中断时演化行为显著退化,暴露出数据与架构层面的偏差。 Conclusion: 当前视频世界模型严重依赖观测信号驱动演化,缺乏对物理世界内在动态的鲁棒建模能力;STEVO-Bench为未来模型设计与评估提供了新范式。 Abstract: Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/

[155] Visual-ERM: Reward Modeling for Visual Equivalence

Ziyu Liu,Shengyuan Ding,Xinyu Fang,Xuanlang Dai,Penghui Yang,Jianze Liang,Jiaqi Wang,Kai Chen,Dahua Lin,Yuhang Zang

Main category: cs.CV

TL;DR: 本文提出了一种名为Visual-ERM的视觉等价奖励模型,用于在渲染图像空间中直接、细粒度地评估vision-to-code任务的质量,显著提升了图表、表格和SVG解析的强化学习效果,并构建了新的细粒度视觉差异评测基准VC-RewardBench。

Details Motivation: 现有强化学习在vision-to-code任务中面临奖励信号不匹配问题:基于文本规则或粗粒度视觉嵌入相似性的奖励无法捕捉细粒度视觉差异,且易受奖励欺骗影响。 Method: 提出多模态生成式奖励模型Visual-ERM,直接在渲染图像空间提供细粒度、可解释、任务无关的反馈;将其集成至RL训练流程,并结合测试时反思与修订机制;同时构建新基准VC-RewardBench用于评估图像级细粒度差异。 Result: Visual-ERM使Qwen3-VL-8B-Instruct在chart-to-code上提升+8.4,在table和SVG解析上平均提升+2.7和+4.1;在VC-RewardBench上,8B规模的Visual-ERM显著优于Qwen3-VL-235B-Instruct,并接近领先闭源模型。 Conclusion: 细粒度视觉奖励监督对vision-to-code的强化学习既是必要条件也是充分条件,且不依赖任务特异性。 Abstract: Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.